In Progress

Text Comparison - word frequencies calculation URGENT

Need it within a week...

Need to implent tool described in CollGram Profile (Bestgen & Granger 2014) [login to view URL] These measures are the average t-score and MI score for all bigrams in the student’s text, calculated based on a reference text corpus.

I have a set of 300 essays written by students at different levels, and I would like to calculate the Collgram profiles for them, based on the COCA corpus here example [login to view URL] alredy got it. The lists of bigrams and word frequencies are of course available on the website, so calculating the MI and t-score for each bigram will not be difficult, in concept.

For each pair of words (bigrams), two collocability ratios (MI and t) should be calculated, based on the frequency of constituent words. A calculation based on formulas is mathematically simple. I am attaching it in *.docx

• I am at the disposal of around 300 texts with lengths of 20-600 words written by English students (the orthography has been corrected). From each text, all of the bigrams must be extracted (punctuation symbols are the threshold of bigrams).

• The extracted bigrams must be found in the reference list (COCA corpus), which is discussed in point (1). If they are found, their two collocability ratios must be checked.

• For each text, four lists must be produced: a list of the found n-grams (of specimens and types), along with two collocability ratios, and a list of bigrams that were not found (of specimens and types).

• For each text, the following values must be produced: the average of two ratios – for the specimens and for the types separately, as well as a percentage of bigrams that have not been found in the general number of bigrams (for the specimens and types).

• The last operation should produce a nice table for the batch of text.

If really necessary I can proviede [login to view URL] license for POS tagging.

I got samples of output for an analysed file:

1 2 3 4 5 6 7 8

freq_text freq_COCA mean freq_COCA MI MI>3 t t>2,54







col 1 - a list of all bi-grams retrieved from a learner text (without punctuation marks)

col 2 - frequenecy of the bigram in a learner text

col 3 - frequency of the bigram in COCA. If blank, the bigram does not occur in COCA.

col 4 - mean frequency of the bigram in COCA per 1 million. If blank, the bigram does not occur in COCA.

col 5 - MI for the bigram calcualted based on COCA. If blank, the bigram does not occur in COCA.

col 6 - "*" if MI>3

col 7 - t for the bigram calcualted based on COCA. If blank, the bigram does not occur in COCA.

col 8 - "*" if t>2

Also input files may have <> tags that should be removed.

I also want to be able to load multiple files and if I load more than one to program and also get for each file analisis as above plus cummulative results as:

This only beginning I hope freelancer doing this will be willing to continue develop this tool as next projects.

Skills: C# Programming, C++ Programming, Java, Perl, Python

See more: word reference com, word reference, student freelancer website, student freelancer java, not able to load freelancer website, mean freelancer com, java projects in freelancer, is freelancer available in uk, granger com, freelancer works uk, freelancer text reference number, freelancer symbols, freelancer projects c java, freelancer pos, freelancer on java projects, freelancer mi, freelancer java profile, freelancer in uk 2016, freelancer in java in uk, freelancer for students projects

About the Employer:
( 164 reviews ) Czestochowa, Poland

Project ID: #10141081

Awarded to:


Hello! i reviewed Quantifying the development of phraseological competence methodology and i want offer you python script what will calculate MI and t-score . So you will just need put all documents to folder, scr More

$631 USD in 10 days
(18 Reviews)

8 freelancers are bidding on average $480 for this job


I am very proficient in c and c++. I have 16 years c++ developing experience now, and have worked for more than 6 years. My work is online game developing, and mainly focus on server side, using c++ under linux environ More

$250 USD in 7 days
(120 Reviews)

Hi, We have a team of Data Mining and Web Scraping experts. We have worked on many Data Mining techniques including Association Rule Mining, Clustering, Outlier Mining, Sentiment Analysis etc extensively in the pas More

$526 USD in 5 days
(76 Reviews)

Hi, client. I am a C++ programmer and mathematician. Please check my Profile/RecordList and tell me details. Looking forward to your response. Thanks.

$1000 USD in 7 days
(26 Reviews)

A proposal has not yet been provided

$350 USD in 7 days
(11 Reviews)

Hi, I have over 5 years of experience in Java development here is my detailed plan. 1. The application will be GUI Application 2. You will have the facility to select an individual file or a group of files or a More

$500 USD in 10 days
(3 Reviews)
$333 USD in 5 days
(3 Reviews)

hi there! I'm software engineer, having skills in python .. as I can see, you need strong parse instrument to work with text files. I advice you to make gui program for that. and I can help you with it in few days or l More

$250 USD in 4 days
(8 Reviews)