Find Jobs
Hire Freelancers

Data crunching: 150 mailing list text files to UTF8 MySQL database

$30-50 USD

Completed
Posted almost 19 years ago

$30-50 USD

Paid on delivery
Goal: A PHP script that can convert about 150 text files containing posts from 10 years of 2 newsgroups to a single MySQL table in UTF8 format, for use with a search engine. All files to convert will be given at the start of the project. Details: I'm trying to create a single MySQL table from 10 years of newsgroup postings. The postings are spread over 150 text files, each file containing a month of posts. There are a few issues such as encoding and other conditions, as described in the deliverables. ## Deliverables Deliverables: A PHP script that will convert all .txt files in the same directory to a single MySQL table dump file with the following fields: (1) sequential ID of post (2) mailing list ID (3) name of author (4) email address of author (5) date/time of post (6) actual encoding of original post (7) title of post (8) full text of post (9) full text of post with quoted text removed (for searching) Issues: (a) Because several mailing list systems were used, the format by which each post is separated and the format of the headers of each post differ. There are maybe 5 total such formats. As an example, some of the files needing conversion are here: [login to view URL] (b) The posts are mostly in the SJIS encoding. However, there are several that are in EUC or ISO 2022-JP. The _actual_ encoding of each post needs to be checked, and the post needs to be converted to UTF8 before being stored in the database. This may be the trickiest part of the project, so make sure that you are comfortable with multi-byte Japanese encodings. For example, if you open one of the files found at the above website in a Web browser, some will only render properly when SJIS is selected as the encoding. Others will only render properly when ISO 2022-JP is selected as the encoding. The actual encoding for each post needs to be figured out, and stored as field (6). (c) All email addresses need to be obscured. For example, "someguy[at][login to view URL]" would need to be changed to "someguy[at]g...". This is true for both the email address field (4) as well as all full text fields (8) and (9). Note that [at] has been used here in place of the at sign, due to the RAC site restrictions. (d) The dates and times for all posts need to be unified to the format used by MySQL, for sorting. This is stored in field (5). (e) The full text field without quoted portions (9) is the same as the original text, but with all lines beginning with ">" removed, or all lines following a line with "----- Original Message -----" removed. You will need to be creative to create a good way to remove these portions, but 95% is acceptible. ## Platform PHP 5
Project ID: 3808531

About the project

6 proposals
Remote project
Active 19 yrs ago

Looking to make some money?

Benefits of bidding on Freelancer

Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
Awarded to:
User Avatar
See private message.
$21.24 USD in 10 days
3.6 (6 reviews)
1.4
1.4
6 freelancers are bidding on average $29 USD for this job
User Avatar
See private message.
$42.50 USD in 10 days
4.6 (84 reviews)
5.6
5.6
User Avatar
See private message.
$23.80 USD in 10 days
4.9 (46 reviews)
5.4
5.4
User Avatar
See private message.
$38.25 USD in 10 days
4.9 (12 reviews)
2.7
2.7
User Avatar
See private message.
$42.50 USD in 10 days
5.0 (1 review)
0.8
0.8
User Avatar
See private message.
$8.50 USD in 10 days
0.0 (0 reviews)
0.0
0.0

About the client

Flag of UNITED STATES
United States
5.0
1
Member since Jul 16, 2005

Client Verification

Thanks! We’ve emailed you a link to claim your free credit.
Something went wrong while sending your email. Please try again.
Registered Users Total Jobs Posted
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Loading preview
Permission granted for Geolocation.
Your login session has expired and you have been logged out. Please log in again.