I work with words and data, especially recalcitrant, stubborn text data.
Most of the challenges people have with screenscrapers are to do with:
- Regular Expressions
- Processing of Streams
I'm experienced in both these areas.
Around the time of the 2000 Olympics, the only source for data was the website, which was deliberately obfuscated. I scraped all the current data from that website to combine with historical figures, for presenting in some Business Intelligence tools.
I used tools like CURL, Perl and Mech for this back then but work day-to-day in python, and can use the modern equivalents.
I will do the work in a half day of core time, and you will get at least some material scraped, reflecting the priorities you set. About 100 pages of content should serve as an introduction and I will be able to scale for later work.
PS. If you are after 10,000s of pages scraped, I believe you will want to look elsewhere.