Help:Gadget-ocr

←

OCR gadgets

→

There are three OCR (optical character recognition) gadgets that can be enabled to produce new text for scan pages if the existing file does not have an acceptable text layer.

Usually, you will not need these, as you can now use the "Transcribe text" button at the top right of the editor. There are instructions for that tool at mw:Help:Extension:Wikisource/Wikimedia OCR.

Transcription is accessible by a button in the editing toolbar in the Page namespace.

Because the gadgets use different OCR engines, one of them may perform better than the other on certain pages.

Enable the gadgets in your gadget preferences.

Tesseract OCR

The "basic" OCR gadget uses Tesseract to generate new OCR text. Generally, this gadget is better than the Google OCR gadget at recognising text columns, but has more character errors.

This tool uses an older Tesseract than the built-in OCR tool, so you may find the built-in tool has better results.

Google OCR

The Google OCR icon submits the page image to Google to be processed.

Generally the accuracy is excellent, but text in columns is sometimes not recognised as such and the lines are interleaved.

Transkribus OCR

Wikisource also integrates with the Transkribus engine, which is most helpful for transcribing handwritten manuscripts. When selecting Transkribus for OCR on a given page, you have the option of choosing between different Transkribus models that specialize in different handwriting styles.

Alternative sources

The Internet Archive (archive.org) automatically derives OCR files for items uploaded. This is explained in the OCR at the Internet Archive document, which describes the file formats. Behind the scenes, this currently uses Tesseract, but the Archive's OCR results may differ from the default Tesseract results used on Wikisource, because it is tuned differently. Previously, the Archive had used a variety of different OCR engines, some open source and some not, and you may find evidence of these older OCR results in the supplementary files in the sidebar on the archive.org page for a given item.

The Internet Archive produces both word-level and character-level OCR. There is no direct integration of the OCR produced by the Internet Archive, so you will need to manually load these into a tool before use if you wish to make use of them.