We need an experienced MySQL 4.1.14 & PHP 4.3.9 developer to write a basic web crawler that uses a MySQL DB. We have a PHP script that parses web pages, but it must be changed to save data to MySQL. PHP web crawler (see diagram) -Written in PHP 4.3.9 using OOP, must be flexible & well documented, must return success or failure outcome & fail gracefully: doesn’t break if errors occur & returns error status Running our PHP script *List of URLs to be crawled created & passed to Queue each time script runs *Date & time of each URL’s successful completion is recorded in DB Queue -FIFO Queue of URLs to be crawled. Rule for adding to Queue: URL has not been crawled before OR URL was last crawled over [60] days ago -[60] day time frame must be flexible so it’s easy to change: NO hardcoding -Each item must have status field: Empty status (not been touched), Pending status (currently processing), Failed status (processing failed). Only process Queue items w/ Empty or Failed status -If script succeeds, remove URL & place in Archives. If script fails, URL stays in Queue to be re-crawled later. Our script saves current state in temp files on failure so re-crawling can resume using same state, but it needs to be changed to save state to DB Scheduler -Use Linux crond daemon to run web crawler every [30] seconds (NO PHP daemon). [30] second time frame must be flexible so it’s easy to change: NO hardcoding -Scheduler to be optimized to run max of [2] concurrent sessions of our PHP script. Max concurrent sessions must be flexible so it’s easy to change: NO hardcoding -Scheduler starts new crawling session IF: we haven't reached max concurrent sessions AND Queue has URL with Empty status Build MySQL 4.1.14 DB -DATA MODEL DESIGN MUST BE APPROVED BEFORE ANY DB WORK IS STARTED. Data model will incl. these entities: *Queue *Archive of successfully completed items *Meta data we parse out when crawling *Crawling session state (only used when crawling fails & current state is saved)
## Deliverables
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Deliverables must be in ready-to-run condition, as follows (depending on the nature of the deliverables):
a) For web sites or other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.
b) For all others including desktop software or software the buyer intends to distribute: A software installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.
3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
## Platform
LAMP environment