Parse and build a database from raw html data. See project details.
## Deliverables
Yahoo group message files were downloaded and saved in html format using a program called Hypermail. The files are in chronological order, numbered from [login to view URL] to 12502.html. They are stored in folders numbered 0 to 90 that contain each message file's meta-data in files named "[login to view URL]," "[login to view URL]," "[login to view URL]," "[login to view URL]," and "subject.html."
I would like you to parse the contents of each html message file into a database based on the fields already enclosed in each file:
* Re: The message subject matter
* From: The sender
* Date: The date the message was sent
* in reply to: To whom the message was sent
* Message content: With the message content only. Exclude content not sent by that person, which generally are separated by "---"
I am agnostic as to the type of programming language and database that is to be used. My only request is that the database be as user friendly as possible, so someone without any training in computer science or programming can easily conduct queries to find the information enclosed in the database or enter new data as may be required after the database has been built.
I've enclosed a sample folder below.