Create an api that we can POST a file to and get HTML code out of. Possible inputs:
1) Text file: Just spit out contents
2) Image file: OCR with tesseract
3) PDF file
-- a) Scanned: Break out into individual pages and save as images; OCR with tesseract
-- b) Embedded text: Return HTML with pdftk
4) Other file: Attempt to read with LibreOffice running Headless
5) url + jQuery selector path: Read the html code of the url, and return html and images (excluding some selectors). See attached website.json.
Possible implementations
A) Extend [login to view URL] with worker containers written in Golang.
B) Write the code in node.js, use the same Docker container system with a RabbitMQ server (Jeff/Alex can help with the Docker setup)
References:
* open-ocr: [login to view URL]
* bash script that runs 1-4 above: [login to view URL]
* attached [login to view URL] for #5 above
Final deliverables will be released as an Open Source project on Github (Apache license).