cleanup.js (Inductiveload's version)

Advanced OCR cleanup script

Installation edit

The basic script can be installed as follows:

mw.loader.load('//en.wikisource.org/w/index.php?title=User:Inductiveload/cleanup.js&action=raw&ctype=text/javascript');

Once installed, you will access to the default cleanup tool configuration. You may wish to add more (work-specific) corrections, disable some, or add other configurations like possible languages or long-s corrections.

Concept edit

The tool performs a list of common actions:

  • Collapsing of lines and hyphens where appropriate
  • Adding paragraphs where likely
  • Typographic fixes like spaces before/after commas
  • Removing obvious running headers and scanning watermarks
  • Fixing OCR errors:
    • This uses a large (hundreds of entries) list of replacements, mostly tested against an English wordlist to avoid false positives. For example not many words end in rcs, so comparcs is likely to be compares. However, arcs and orcs are not changed.
    • A separate (partially-complete) list of long-s scannos is also included, but this is more likely to have false positives, so it is not on be default.
  • Extra user-defined functions

Configuration edit

Configuration is via a "standard" (for me) <nowik>mw.hook</nowiki>, which is called with a configuration object for you to update:

This is the default config object, which is what you will get if you do not add a config hook handler:

	const Cleanup = {
		logLevel: ERROR,
		enable: true,
		testFunctions: [],
		enableTesting: mw.config.get( 'wgTitle' ).endsWith( 'cleanup-test' ),
		portletCategory: 'page',
		activeNamespaces: [ 'page' ],
		actionTitle: 'WsCleanup',
		additionalOcrReplacements: [],
		disabledReplacements: [],
		cleanupFunctions: [],
		italicWords: [],
		doLongSReplacements: false,
		doTemplateCleanup: true,
		remove_running_header: true,
		replaceSmartQuotes: true,
		collapseSuspiciousParagraphs: true,
		shortLineThreshold: 45,
		possibleLanguages: [ 'en', 'fr', 'es', 'de', 'zh-pinyin' ],
		italiciseForeign: true,
		smallAbbreviations: [],
		smallAbbrTemplate: 'smaller',
		editSummary: '/* Proofread */',
		markProofread: true,
		cleanupAccesskey: 'c'
	};
logLevel
The logging level of the Cleanup functions. Set to 0 for DEBUG, 1 for INFO and 2 for ERROR
enable
Enable the cleanup script (false prevents any of it from being added to the UI)
testFunctions, enableTesting
For now, internal
portletCategory
The portlet category to add the tool link to
activeNamespaces
Namespaces to load in (does nothing in other namespaces)
actionTitle
The name in the sidebar
additionalOcrReplacements
A list of additional OCR fixes (see below)
disabledReplacements
A list of disabled replacements (see below)
cleanupFunctions
Additional functions to run at the end of the process (see below)
replaceSmartQuotes
Convert “smart quotes” to "straight quotes".
doLongSReplacements
Perform a set of fixes for badly-OCR'd texts using long-s. For example: affiftassist
collapseSuspiciousParagraphs
Collapse paragraphs together if they look "suspect". For example, if one ends without punctuation, and the next starts with a lowercase letter.
shortLineThreshold
Dirty hack to re-insert paragraphs lost in the DjVu round trip. Adds a paragraph break to lines shorter than this that appear to be a sentence end and the next line looks like a sentence start. The spiritual inverse of collapse_suspicious_paras
possibleLanguages
Set to true if the work might contain these languages: this disables some replacements that would be invalid in those languages. For example, in German und is valid, but in English, it's more likely to be a scanno for and.
italiciseForeign
Italicise foreign words. This is a very short list at present.
smallAbbreviations
Abbreviations to put inside an {{asc}} template. Note, A.D. and B.C. are already templates.
smallAbbrTemplate
The template to use for small abbreviations.
editSummary
The edit summary to automatically add, if any
markProofread
Also mark the page as Proofread. Note: you still have to actually proofread the page yourself, this isn't magic!
cleanupAccesskey
The access key to use: e.g. cCtrl+Alt+c

Additional replacements edit

This is a list of replacements to make. It is a list of tuples of [ /regex/, "replacement/"] entries.

Often, you will make special replacements only in certain works:

if ( _title.startsWith( 'The Chinese Review' ) ) {
    cleanupConfig.additional_ocr_replacements.push( ...[
        [ /\bYch\b/, 'Yeh' ],
        [ /\bBouring\b/, 'Bowring' ]
    ] );
}

The g is always applied to these regexes. If you need a non-global regex, use a cleanup_function. Other flags are kept (e.g. i). Replacement references like $1 work.

Disabled replacements edit

Disable replacements are a list of replacements to not apply even though they are part of the normal script:

cleanupConfig.disabledReplacements.push( ...[
    [ /\berery/ ],
    [ /\bhighfer/, 'higher' ],
] );

If only one element is given in an item, all replacements with that regex are disabled. If two are given, the regex and the replacement must match for the disabling to happen. Only the "text" of the regex is compared, flags are not used.

Cleanup functions edit

Final functions to run. This is a list of functions that are given the header, body, footer editors (as in TemplateScript) as parameters. They run in order.

cleanup_functions

cleanupConfig.cleanupFunctions.push( ...[
    function ( editor, header, footer ) {
        header.set( 'Some custom header!' );
    },
    function ( editor, header, footer ) {
        header.set( 'Some custom footer!' );
    },
] );

See also edit

  • TemplateScript General purpose tool that allows you to add actions to the sidebar. The script is mostly one huge TemplateScript action.