C code to index large text library and find similar -- 2

$200-600 USD

Closed

Posted

over 5 years ago

$200-600 USD

Paid on delivery

I need a mini-app (Compiled C on Linux) that groups similar sentences together. I have 100,000 sentences (say in a PostgresSQL DB, Unicode text). It must perform VERY fast - by indexing each root-word to a 16bit integer (which would reduce its memory foot print), then re-creating a new data structure with sentence delimeters and sentence length. Group into buckets of similar sentence length. Then iterate through doing word-by-word comparisons (16bit comparisons). Two algos are acceptable:- 1. Simple - Take a source sentence and iterate through XORing word by word (irrespective of word order or word frequency). If there are more than x words outstanding - then it is NOT a similar sentence. X in this case would be 25% of the number of total words. We leave such large gap so that we don't need to worry about word roots. From the smaller data set - we then proceed to do a classic levenstechn comparison - but with an upper bound of x deviation - meaning after it detects more than say 10% deviation - it exists that comparison. Here it is a character by character comparison. The app should communicate with a folder of .gz files that contain the text and it could use a text boundary to distinguish each sentence. The output would need to be a new text file that sorts every sentence into groups of similarity - separated by a text boundary. I need something very soon. A mediocre algorithm is fine. To be awarded: explain in 1-2 sentences your proposed approach, and bid a base amount plus a bonus on completion. Come in cheap, and get the big reward after you have delivered.

Project ID: 17629535

About the project

11 proposals

Remote project

Active 6 yrs ago

Looking to make some money?

Email address

Benefits of bidding on Freelancer

Set your budget and timeframe

Get paid for your work

Outline your proposal

It's free to sign up and bid on jobs

11 freelancers are bidding on average $422 USD for this job

@hbxfnzwpf

I am very proficient in c and c++. I have 16 years c++ developing experience now, and have worked for more than 7 years. My work is online game developing, and mainly focus on server side, using c++ under Linux environment. I made many great projects using c++, for example, I made the tools which could convert java codes into c++ scripts, of course garbage collection included, this was very similar to a compiler, and was very complex. I also made our own mobile game using c++, I can show you the demo of client, if you like. I am very proficient in java also. I have a very good review on Freelancer.com, I never miss a project once I accept the job, you can check my review. Trust me, please let expert help you.

$400 USD in 5 days

4.9

(202 reviews)

7.3

@dinhfreedom

Dear sir. Your project attracted my attention at first glance, because I've extensive experience in C Programming. I'm really confident about your project, and very eager to join your project. If we have a chance to cooperate, I'll do my best to provide wonderful result. Looking forward to your response. Best Regards.

$400 USD in 10 days

4.8

(78 reviews)

6.7

@erShashi

Hi, I must say very interesting and challenging project. I have done some work on the similar project and did research on how Twitter search works on large volume. I would suggest lucene search library to create indexes over you data , and then implement algo to get appropriate results. Lucene works awesome with large volume if chosen indexes are good. About me , I am ex-Microsoft employee and have 8+ years experience in software development and customization usingwide range of Microsoft technologies (C#, ASP.NET, MVC,WPF, Window form, Sql Databases, Azure, Sync framework etc.), Mobile technologies - Android, Xamarin and server side language node.js and Golang. Since i have previous experience in such applications, so I think it will help in this project, if selected. About my previous work, you can visit my profile to see feedback from previous employer. Let me know, If you find me suitable for this project and share complete details. If you want more details, we can discuss over chat/ skype. Regards, Shashi

$588 USD in 25 days

4.9

(42 reviews)

5.4

@freelancerSolvit

.................................................................................................................................................................................................................

$444 USD in 10 days

5.0

(32 reviews)

4.8

@mdolgun

Hello, I am expert on C/C++/Python/Data Structures/Algorithms For word indexing, i propose using trie structure (character tree). Leaf nodes would carry the index value. We could also use a hash table for indexing, but trie has advantage of allowing approximate matching of n-differences (insert/delete/substitute character). After finding word indexing we can also use another trie structure (word tree) to find the most similar sentence again up to n-differences (insert/delete/substitute word). If such a sentence is found then it is inserted into leaf bucket. Note that this is not an optimal algorithm, because an optimal algorithms like hierarchical clustering has O(n3) complexity which infeasible for large data sets. We can talk details of input/output format. I can deliver a working code in 3 days, but for performance optimization I need 7 days. I suggest to have two mile stones: a working program (3 days), optimized program (7 days). Best Regards

$200 USD in 7 days

4.8

(5 reviews)

4.7

@magadhmindslx

Dear Sir, I have gone through project description and interested taking it up. Posted bid amount is indicative and a more accurate I can give once more details are shared. Looking forward to hear from you. Thanks

$200 USD in 10 days

5.0

(15 reviews)

3.2

@mbenkendorf

Dear Employer Due to my own interest in such natural language processing problems, I already developed your described approach into a first unoptimized protoype to see how fast it can process and group 100k sentences. Since there was no sample data attached, I took the first 100.000 sentences of Shakespeares works. It takes about 120 seconds on my machine (Intel Core i7-6700) to group the 100k sentences, mainly because some buckets of sentences with same length had 10.000-15.000 entries (I don't know how the productive data is structured in comparison). Perhaps something like cosine similarity can bring an improvement in speed. Otherwise, my approach is very simple and straightforward: the input sentences are loaded into an in-memory structure, afterwards the sentences are assigned to buckets determined by their length so that each unprocessed sentence must only be compared to those sentences in the bucket where his length falls into(plus maybe the one below and above). Best regards

$444 USD in 3 days

5.0

(4 reviews)

3.4

@TobiObadiah

Hi there, Interesting project you have there. Here is my approach. I have data structure library in C which is in development but will meet this project needs as some of the data structures have been implemented. The program will read the sentences into a list, stack. The goal will be to optimize the comparison( of words ) of the sentences. Indexing root-words will not be a problem as lots of ways already come to mind, say hashing, etc. I am pretty confident in my approach to solving this.

$300 USD in 4 days