HTML::Parser (hOCR is a HTML-based microformat) (link to spec here)
Use LWP or something to do the download and upload steps?
Look for something to help with the parser logic or state machine?
Tesseract
Is there a decent library for this so we won't have to wrap the command-line?
GraphicsMagick
What happened to PerlMagick? Where are the bindings for this?
Use a simple pseudo-state machine for each level of hOCR data:
There's some overall OCR data that can probably be ignored (it'll be per-page in this case)
hOCR supports columns, but ignore these for now (too complicated)
First state will be HOCR_PAGE
Maybe ignore this for OCR purposes and just use it to determine right DjVu page to add the hidden text layer to?
Second state will be HOCR_PARA
Is it worth mapping this to DjVuLibre's equivalent concept? Maybe just ignore it.
Third state will be HOCR_LINE
Fourth state will be HOCR_WORD
Fifth possible state will be HOCR_CHAR, and DjVuLibre supports it, but I don't think it's worth dealing with
Each parsing state is a constant
Need a global var or lightweight object to keep track of current state
HTML::Parser is event driven
Need to catch start tag events and end tag events
Need to check for valid events in each given state (not too many: hOCR is strictly nested and general HTML can be ignored; no tagsoup)
Build a tree in memory, or implement this as a streaming algorithm spitting out the sexprs as we go along?
Is it worthwhile to spend time on a generic data structure for this that can be serialized to many formats?
Maybe it makes more sense to write it as a straight hocr2sexpr converter and spit out per-page .sexpr files?
This would make the overall algorithm dumber-but-simpler
And given we'll be wrapping commandline utilities in any case, we can't avoid the "dumb" part. Maybe try to get the "simple" part too then?
Then again, a fully-streaming implementation is probably not that much more complicated, all things considered, provided we can rely on djvused not crapping out on us too much
That's a big if: for book-length djvu's, the DjVuLibre tools have crapped out rather a lot
Maybe it's better if we can operate on a "page-at-a-time" level?
Then again, we need to keep track of at least HOCR_PAGE and HOCR_LINE to generate working sexprs