Scrape Reddit in Java/Perl, dump into SQL tables. See the detailed description below.
## Deliverables
**The task is to scrape popular content from [login to view URL]:**
**
**
1.** **If you go to <[login to view URL]> you can see a list of the popular reddit topics sorted by subscriber count.
2. We want to scrape the following topics.
a. ALL the topics from the top 50 EXCEPT the following,
announcements
blog
askreddit
Iama
[[login to view URL]][1]
bestof
sex
minecraft
doesanybodyelse
trees
skyrim
explainlikeimfive
truereddit
b. IN ADDITION, we want to scrape the following topics that are outside the top 50
gadgets
LifeProTips
wikipedia
environment
cooking
history
art
games
philosophy
photography
sports
math
health
seduction
psychology
c. Also from the [[login to view URL]][1] website we want the following feeds
all
random
That's 54 topics in all.
3. For all the topics/feeds mentioned in 2 what we want is 1000 urls each from the
"top scoring" and "links from this month" category. For eg. for the topic "funny",
this would be 1000 urls from this feed [[login to view URL]<wbr />funny/top/?sort=top&t=month][2]
(If you don't find 1000 urls in the last month, we may have to go to top in the year)
4. Store your results in a mysql table with the following schema.
<id>, <url>,<topic name>, <score count>, <comment count>, <date submitted if available>
So ALL the urls from all the topics would be stored in this single table. This populated table
is the main deliverable along with your scripts. This table will be populated on the amazon machine
you bring up (as per 5 below) and you will copy the populated table to our server (our server details will
be provided later closer to task completion). So this table will have 54 topics X 1000 urls = 54,000 rows in all.
5. Your table and scripts should reside and run on the Amazon machine (s) you bring up for this task.
Ping Nick (cc'ed) regarding what you want and he'll give you our amazon account details
to bring up the machines.
6. You may have to deal with rate limiting or throttling by reddit so be prepared for this.
You may need to use multiple machines to do the crawl/screen scrape if necessary. You may need
to use multithreading to make your scripts finish in time.
7. We want your script to finish running in less than a day.
8. You may need to do more research on the API documentation but here is reddit's documentation.
[[login to view URL]<wbr />reddit/wiki/API][3] Figure out the most efficient way to do this. API (if it exists)
or if not, screen scrape.
9. Use Java/Perl (preferred) on a linux machine (preferred). Let us know if you HAVE to use something else
and we'll evaluate.