I need to create a server-side PHP/MySQL scraper that can run on shared hosting for one of my websites to test load (I will provide the URL of my website later.
My website is a job board, and the job posting formats are all standardized, and do not change from posting to posting. I need the software to go through each page of the search results, and scrape the following fields from each job posting into a MySQL database:
~ Job Title
~ Company Name
~ Employee Type
~ Job Type
~ Post Date
~ Contact (if it's listed)
~ Ref ID
I would like to have a settings page with the following options:
~ text box where I can modify the search string/URL without having to go into the code.
~ set maximum amount of job posting scrapes per minute
~ set maximum amount of scrapes per hour
~ ability to put in a list of proxy servers in a text area
~ set maximum amount of scrapes per proxy server
~ "Use Proxy Servers? Yes/No" dropdown which, when checked to Yes, forces the software to automatically switch proxy servers after the maximum amount of scrapes per proxy server is reached (there should be a fail-safe in case the proxy server is unresponsive/slow/times out, in which case the software should move on to the next proxy server on the list). If checked to "No", proxy servers aren't used, and the software just scrapes according to the maximum scapes per minute or hour set in the settings above.
The software also needs to check, prior to scraping a job posting, if this job posting has already been scraped before, by comparing the job's Ref ID with existing ones in the database. If there's a match, then the software should move on to the next job posting that hasn't been scraped before.
I would prefer that this project be done with PHP/MySQL, utilizing cURL, but I am very open to other suggestions.
Please let me know if you have any questions, I look forward to collaborating with you on this!