Tesseract retraining project

Image pre-processing edit

Removing small specks can have a major effect on the OCR quality:

phab:F34558171pl’fint the pen,
phab:F34558172point the pen,

An image processor to nerf these specks could be a major uplift in OCR performance.

Removing islands edit

Simple function to remove unconnected islands under a certain area with OpenCV. Expects a white-on-black binary image:

def remove_islands(img, min_area):

    nlabels, labels, stats, centroids = cv.connectedComponentsWithStats(
        img, None, None, None, 8, cv.CV_32S)

    areas = stats[1:, cv.CC_STAT_AREA]

    print(areas)

    result = np.zeros((labels.shape), np.uint8)

    for i in range(0, nlabels - 1):
        if areas[i] >= min_area:   # keep
            result[labels == i + 1] = 255

    return result

The island size needs to be carefully chosen to avoid deleting things like colons and dots of i's.

By inverting the image, you can also delete small white specks in letters, though these do not seem to be as lethal to the OCR as black specks.

Fonts edit

18th century text is often printed using either w:Caslon (the original) or something very like it. It usually has more ligatures than the modern fonts.

A derivative of Adobe Caslon Pro may be possible.

Notable changes:

  • Much tighter kerning after a long-s (ſ) in the regular font (italic already kerned well)
  • Bar on t reduced in length (modern fonts have made that more obvious, which causes t's to be easily mistaken as i's or l's)
  • Less prominent serifs on r
  • More space before :;!?
  • Heavier top serifs on u to try to avoid mistaken o more often
  • Variants: (in PUA at U+E100)
    • Higher bar on 'e' - option ss01, since this is not always true - otherwise e → c errors
    • i,j with a missing dot - option ss02
    • t with truncated bar: ss03

To try:

  • Variant chars:
    • Add glyphs representing more damaged glyphs to the font to prevent overfitting of the model (the model becomes too "fixated" on the perfect form of the 't'). Probably put them as a <nowiki>ssXX?<nowiki> font feature.
      • E.g. t with a truncated top is mistaken i, r or c
      • e with a light centre line -> c
      • i with a heavy dot -> r

Generate the ground-truth data edit

Construct "clean" text for the fonts, variants, styles, etc. that you want:

model: eng_oldcaslon_longs
text:
  dir: corpus/eng_longs
fonts:
  - face: Old Caslon
    sizes:
      - 25
    variants:
      regular: {}
      italic:
        italic: true
      smallcaps:
        smallcaps: true
        ratio: 0.1
    features:
      - features:
        - ss01
        rate: 0.05
      - features:
        - ss02
        rate: 0.005
      - features:
        - ss03
        rate: 0.005
    process:
      - noise: 0.2
        erode: 3
      - noise: 0.3
        erode: 2
    include_clean: false
Generate the images and output to the tesstrain data directory
./generate.py -c configs/eng_oldcaslon_longs.yml -o ~/src/tesstrain/data -m eng_oldcaslon_longs

Once you have ground truth data edit

Set your model name in the shell (match the model name used above)
export MODEL_NAME=eng_oldcaslon_longs
Train the model
This will take a long time (hours, if you set a high MAX_ITERATIONS), go and proofread something
20000 iterations seems to work OK, after that overfitting seems more likely than improvment (0.2% error seems around the lower limit for now)
make training MODEL_NAME=$MODEL_NAME START_MODEL=eng TESSDATA=~/src/tessdata_best

First it will read the training files and set up ltsm and box files and you will see thousands of lines like this:

Tesseract Open Source OCR Engine v5.0.0-alpha-20210401-158-ge1761 with Leptonica
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "data/eng_oldprint-ground-truth/agrippa-occult.00420.png" -t "data/eng_oldprint-ground-truth/agrippa-occult.00420.gt.txt" > "data/eng_oldprint-ground-truth/agrippa-occult.00420.box"
+ tesseract data/eng_oldprint-ground-truth/agrippa-occult.00420.png data/eng_oldprint-ground-truth/agrippa-occult.00420 --psm 13 lstm.train

Then, it will start generating training output, and you will see the errors start to decrease.

At iteration 2132/30400/30400, Mean rms=0.148000%, delta=0.023000%, char train=0.071000%, word train=0.109000%, skip ratio=0.000000%,  New worst char error = 0.071000 wrote checkpoint.

At this point you can take any recent checkpoint file (one is generated every time the result gets 2% "better") for testing:

Create and copy the most recent .traineddata for use
make traineddata CHECKPOINT_FILES="$(ls -t data/$MODEL_NAME/checkpoints/*.checkpoint | head -1)" MODEL_NAME=$MODEL_NAME TESSDATA=~/src/tessdata_best
cp $(ls -t data/$MODEL_NAME/tessdata_best/*.traineddata | head -1) ~/.local/share/tessdata/$MODEL_NAME.traineddata
Use it!
tesseract --tessdata-dir ~/.local/share/tessdata -l $MODEL_NAME image.jpg -
When it's done, the .traineddata is ready
cp data/$MODEL_NAME.traineddata ~/.local/share/tessdata/$MODEL_NAME.traineddata
Continue training from that point (may need to increase MAX_ITERATIONS).
Beware that too much training on too little source data leads to overfitting - while the model may get better at the GT images, it gets less able to handle real life images that are not quite the same.
make training MODEL_NAME=$MODEL_NAME START_MODEL=$MODEL_NAME TESSDATA=data MAX_ITERATIONS=50000

Generate evaluation text edit

This can also be used to generate training data (but you will need a lot of it).

Generate a HOCR file of the image - using the model in question (hopefully!) gets you pretty close
tesseract /tmp/theimage.jpg /tmp/hocr --tessdata-dir ~/.local/share/tessdata -l eng_oldcaslon_longs hocr
Extract the HOCR file to image/text pairs (/tmp is where the image is).
hocr-extract-images -b /tmp /tmp/hocr.hocr theimage-%03d.png
Correct the text lines as needed.
This is a pain and really needs some kind of a tool to help
Copy/emplace the evaluation ground truths
Remember the text files need to end .gt.txt, not just .txt
cp -r your_images ~/src/tesstrain/data/eval_$MODEL_NAME
Generate .lstmf files
This also generates a file all-lstmf file which lists all the lstmf files.
make lists MODEL_NAME=eval_$MODEL_NAME
Evaluate the model against those files
lstmeval --model "data/${MODEL_NAME}.traineddata"  --eval_listfile "data/eval_${MODEL_DATA}/all-lstmf"

Progress so far edit

  • Long-s usually recognised
  • Some confusion between italic h and b
  • Occasional mistaken t → c/r
 
58 Terræ-F1lius. n" x1,

founded upon this politick ſuppoſition, that when
they had got a new Frmcng houſe, they could ne-
ver want new books; but by what means ſocver
it was bu't, my lord Clarendon has the honour,

and we, his happy poſlcrity, the invaluable beneſic
of it,

I ſhould think it an undertaking well worthy
the laborious Mr. Hearne, to give the world an ac-
count, from year to year, of the many incompa-
rable tomes, which iſſue from that illuſtrious preſs.
This, I apprehend, would do great honour to the
univerſity, and to its leamed authors, ſince the cata-
logue would not be crouded with any of thoſe he-
retical, pernicious, and free-thinking tracts, which
are the noiſom ſpawn of other modern preſſes: we
ſhould ſind there no ill.meaning Eſſays upon human
Underſtandmg, no Oceana's, no Hypotheſes of Liber-
ty, no deſcants upon Original Contracls, nor en-
quiries into the Stare of Nature, no Appeals to the
Laity and common Senſe in matters of religion, no
vindications of Conſcience and privare Judgment,
no defences of Reſiſtance in any poſſible caſes, no
apologies for the Revolution, and the preſent Go-
vernment, &c. to ſully the Academical Types, and
reproach the ſclemn Imprimatur of the univerſity
——New, accurate Editions of primitive Fathers,
and antient Chronicles, or modern ſermons, and long
ſyſturas of Logick, Metaphyſicks, and School-divinity
are the ſolid productions of this auguſt Typographa-
um————Such are the effects, and ſuch the advan-
tages of reſtraining the lrcence of the preſs! How
would letters flouriſh? how would arts revive?
bow would religiou lift up her awful front? and
how wculd the church rejoyce, if ſuch a whole-
ſome check were put upon the preſs throughout
the world ? l

But Printixg is not the only, not the principal
uſe, tar which theſe ſupendous ſtone-walls weie

erected 3

Links edit