Need it within a week...
Need to implent tool described in CollGram Profile (Bestgen & Granger 2014) [url removed, login to view] These measures are the average t-score and MI score for all bigrams in the student’s text, calculated based on a reference text corpus.
I have a set of 300 essays written by students at different levels, and I would like to calculate the Collgram profiles for them, based on the COCA corpus here example [url removed, login to view] alredy got it. The lists of bigrams and word frequencies are of course available on the website, so calculating the MI and t-score for each bigram will not be difficult, in concept.
For each pair of words (bigrams), two collocability ratios (MI and t) should be calculated, based on the frequency of constituent words. A calculation based on formulas is mathematically simple. I am attaching it in *.docx
• I am at the disposal of around 300 texts with lengths of 20-600 words written by English students (the orthography has been corrected). From each text, all of the bigrams must be extracted (punctuation symbols are the threshold of bigrams).
• The extracted bigrams must be found in the reference list (COCA corpus), which is discussed in point (1). If they are found, their two collocability ratios must be checked.
• For each text, four lists must be produced: a list of the found n-grams (of specimens and types), along with two collocability ratios, and a list of bigrams that were not found (of specimens and types).
• For each text, the following values must be produced: the average of two ratios – for the specimens and for the types separately, as well as a percentage of bigrams that have not been found in the general number of bigrams (for the specimens and types).
• The last operation should produce a nice table for the batch of text.
If really necessary I can proviede [url removed, login to view] license for POS tagging.
I got samples of output for an analysed file:
1 2 3 4 5 6 7 8
freq_text freq_COCA mean freq_COCA MI MI>3 t t>2,54
col 1 - a list of all bi-grams retrieved from a learner text (without punctuation marks)
col 2 - frequenecy of the bigram in a learner text
col 3 - frequency of the bigram in COCA. If blank, the bigram does not occur in COCA.
col 4 - mean frequency of the bigram in COCA per 1 million. If blank, the bigram does not occur in COCA.
col 5 - MI for the bigram calcualted based on COCA. If blank, the bigram does not occur in COCA.
col 6 - "*" if MI>3
col 7 - t for the bigram calcualted based on COCA. If blank, the bigram does not occur in COCA.
col 8 - "*" if t>2
Also input files may have <> tags that should be removed.
I also want to be able to load multiple files and if I load more than one to program and also get for each file analisis as above plus cummulative results as:
This only beginning I hope freelancer doing this will be willing to continue develop this tool as next projects.