Company Name Matching Engine

Cancelled Posted Mar 31, 2012 Paid on delivery
Cancelled Paid on delivery

I often have the need to match company names between two separate large csv files. Matching company names well is not a trivial task. Various algorithms and processes should be considered to do this including: Levenshtein Edit Distance, Smith-Waterman distances, Jaccard token distance, weighing common company name tokens differently than uncommon ones and so on.

For example, provided company names such as:

DSZ Investments, LLC

D.S.Z Investment Company

DSZ Investments, L.L.C

DSG Investments, LLC

The first 3 should be considered the same company, but the fourth should be considered a separate company even though the edit distance is very narrow. The common token "Company" has to have very low weight when doing the match. Whereas the uncommon token DSG must have a much heavier factor on the match due to it's rarity.

A highly relevant document that I read and that the principles within should be codified and integrated into the project is attached to this post.

Experience doing this type of matching or designing these types of algorithms would be very helpful. I work in a unix environment and I am looking for a command line tool that can run from the bash shell.

Please review the attached document and let's get the conversation going. Canned replies will be ignored.

Thanks for your interest in this project.

Script Install Shell Script

Project ID: #2727519

About the project

1 proposal Remote project Active Apr 22, 2012

1 freelancer is bidding on average $636 for this job

AnkSoftware

See private message.

$635.8 USD in 20 days
(4 Reviews)
5.0