Find Jobs
Hire Freelancers

Scrape Reddit

$100-350 USD

Cancelled
Posted about 12 years ago

$100-350 USD

Paid on delivery
Scrape Reddit in Java/Perl, dump into SQL tables. See the detailed description below. ## Deliverables **The task is to scrape popular content from [login to view URL]:** ** ** 1.** **If you go to <[login to view URL]> you can see a list of the popular reddit topics sorted by subscriber count. 2. We want to scrape the following topics. a. ALL the topics from the top 50 EXCEPT the following, announcements blog askreddit Iama [[login to view URL]][1] bestof sex minecraft doesanybodyelse trees skyrim explainlikeimfive truereddit b. IN ADDITION, we want to scrape the following topics that are outside the top 50 gadgets LifeProTips wikipedia environment cooking history art games philosophy photography sports math health seduction psychology c. Also from the [[login to view URL]][1] website we want the following feeds all random That's 54 topics in all. 3. For all the topics/feeds mentioned in 2 what we want is 1000 urls each from the "top scoring" and "links from this month" category. For eg. for the topic "funny", this would be 1000 urls from this feed [[login to view URL]<wbr />funny/top/?sort=top&t=month][2] (If you don't find 1000 urls in the last month, we may have to go to top in the year) 4. Store your results in a mysql table with the following schema. <id>, <url>,<topic name>, <score count>, <comment count>, <date submitted if available> So ALL the urls from all the topics would be stored in this single table. This populated table is the main deliverable along with your scripts. This table will be populated on the amazon machine you bring up (as per 5 below) and you will copy the populated table to our server (our server details will be provided later closer to task completion). So this table will have 54 topics X 1000 urls = 54,000 rows in all. 5. Your table and scripts should reside and run on the Amazon machine (s) you bring up for this task. Ping Nick (cc'ed) regarding what you want and he'll give you our amazon account details to bring up the machines. 6. You may have to deal with rate limiting or throttling by reddit so be prepared for this. You may need to use multiple machines to do the crawl/screen scrape if necessary. You may need to use multithreading to make your scripts finish in time. 7. We want your script to finish running in less than a day. 8. You may need to do more research on the API documentation but here is reddit's documentation. [[login to view URL]<wbr />reddit/wiki/API][3] Figure out the most efficient way to do this. API (if it exists) or if not, screen scrape. 9. Use Java/Perl (preferred) on a linux machine (preferred). Let us know if you HAVE to use something else and we'll evaluate.
Project ID: 2714705

About the project

Remote project
Active 12 yrs ago

Looking to make some money?

Benefits of bidding on Freelancer

Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs

About the client

Flag of UNITED STATES
Mountain View, United States
5.0
230
Member since Apr 12, 2008

Client Verification

Thanks! We’ve emailed you a link to claim your free credit.
Something went wrong while sending your email. Please try again.
Registered Users Total Jobs Posted
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Loading preview
Permission granted for Geolocation.
Your login session has expired and you have been logged out. Please log in again.