Page:Wikipedia and Academic Libraries.djvu/294

This page has been validated.
Wikisource as a Tool for OCR Transcription Correction
281

the original item. For example, an incorrectly spelled word in a book should use the {{SIC}} tag to show both the incorrect spelling used in the original for historical accuracy, as well as the presumed correct spelling to aid with keyword search. It became apparent early in the project that to fully meet all of Wikisource’s guidelines, the proofreader would have to spend a lot of time on each book. Bearing in mind that the Library’s chief motivation for engaging with Wikisource was to generate and extract high-quality transcriptions, working in complete compliance with the existing standards would slow the process so much as to make it infeasible. Instead, the project team worked with key members of the Wikisource community to develop new standards that would allow a better balance between transcription quality and throughput. The agreed approach involved a focus on correct spelling and layout, while using some of the more common tags to ensure transcriptions aligned closely to the original text.

Discussion around standards helped the team develop the project workflow, splitting the work into five discrete tasks, which are outlined below (“Wikiproject NLS Workflow,” 2020).

  1. Upload multipage PDFs of digitized chapbooks and their associated metadata to Wikimedia Commons using the Pattypan bulk upload tool.
  2. Create Index pages on Wikisource, link these to the files on Wikimedia Commons, and add another link and information to the project Excel spreadsheet.
  3. Generate initial automated transcription using Wikisource’s Google OCR engine then proofread to correct errors.
  4. Validate the proofread transcription, publish to Wikisource (transclude), and link to author pages and Wikidata.
  5. Export transcriptions as multipage PDFs, convert to single-page TXT files, remove header and footer information and tags, and reupload into the Library’s Digital Gallery.

In developing the steps outlined above, the project showed it is possible to set up an end-to-end workflow to successfully improve