DjVu stuff edit

General DjVu process edit

  • Grab the jp2's from IA
    • The jp2's are typically higher resolution than the jpg's
  • Adjust the image files to match with book pages
    • In particular, delete any scan-artifact first and last pages before or after the actual book covers
  • Use GraphicsMagic to convert the jp2's to jpg's
    • Since DjVuLibre can't read JPEG2000
  • Use DjVuLibre (c44) to generate single-page .djvu files from the page images
  • Use Tesseract to do OCR of each page, spitting out .hocr files
  • Write some custom code to:
    • Merge of the individual .djvu's into a multi-page .djcu for the whole book
    • Parse the hOCR data from Tesseract and generate DjVuLibre s-expressions
    • Use djvused to add a hidden text layer to the book .djvu
  • Upload to Commons

Rough outline algorithm notes edit

  • Relevant libraries:
    • HTML::Parser (hOCR is a HTML-based microformat) (link to spec here)
    • Use LWP or something to do the download and upload steps?
    • Look for something to help with the parser logic or state machine?
    • Tesseract
      • Is there a decent library for this so we won't have to wrap the command-line?
    • GraphicsMagick
      • What happened to PerlMagick? Where are the bindings for this?
  • Use a simple pseudo-state machine for each level of hOCR data:
    • There's some overall OCR data that can probably be ignored (it'll be per-page in this case)
    • hOCR supports columns, but ignore these for now (too complicated)
    • First state will be HOCR_PAGE
      • Maybe ignore this for OCR purposes and just use it to determine right DjVu page to add the hidden text layer to?
    • Second state will be HOCR_PARA
      • Is it worth mapping this to DjVuLibre's equivalent concept? Maybe just ignore it.
    • Third state will be HOCR_LINE
    • Fourth state will be HOCR_WORD
    • Fifth possible state will be HOCR_CHAR, and DjVuLibre supports it, but I don't think it's worth dealing with
  • Each parsing state is a constant
  • Need a global var or lightweight object to keep track of current state
  • HTML::Parser is event driven
    • Need to catch start tag events and end tag events
    • Need to check for valid events in each given state (not too many: hOCR is strictly nested and general HTML can be ignored; no tagsoup)
  • Build a tree in memory, or implement this as a streaming algorithm spitting out the sexprs as we go along?
  • Is it worthwhile to spend time on a generic data structure for this that can be serialized to many formats?
  • Maybe it makes more sense to write it as a straight hocr2sexpr converter and spit out per-page .sexpr files?
    • This would make the overall algorithm dumber-but-simpler
    • And given we'll be wrapping commandline utilities in any case, we can't avoid the "dumb" part. Maybe try to get the "simple" part too then?
    • Then again, a fully-streaming implementation is probably not that much more complicated, all things considered, provided we can rely on djvused not crapping out on us too much
    • That's a big if: for book-length djvu's, the DjVuLibre tools have crapped out rather a lot
    • Maybe it's better if we can operate on a "page-at-a-time" level?
  • Then again, we need to keep track of at least HOCR_PAGE and HOCR_LINE to generate working sexprs

Most manual workflow edit

  • gm mogrify -format jpeg '*.jp2'
  • tesseract inputpage.jpeg inputpage -l eng hocr
  • At a minimum need custom code to convert .hocr to .sexpr
    • NB! hOCR (top left) and sexpr (bottom left) use different coordinate systems!
  • c44 inputpage.jpeg inputpage.djvu
  • djvm -c output.djvu page1.djvu pageN.djvu
  • djvused -e 'select N; set-txt [sexpr data]'