Match and split

For use with User:Phe-bot. Used to take edited text from the main namespace and to apply to consecutive scanned images in the Page: namespace. Note that there is a requirement that the page in Page: has an existing djvu text layer and is of a sufficiently reasonable quality to perform a match. Texts waiting to be matched and split can be viewed at Category:Texts to be migrated to scans

As of May 2024, an experimental version of these tools exists at toolforge:matchandsplit. Feel free to use/test the system and report bugs. Note, that instead of phebot, SodiumBot will perform the match and/or split.

Criteria for using this tool

edit

This tool has the power to create a lot of damage if not used carefully. Read through this section before using it and ensure that all the criteria are met.

  • Has the Index file been uploaded and put in place?
  • Does the Index file have a text layer?
  • Is the Index file of type DjVu?
  • Is it the same work (volume of the work)?
  • Is it the same edition of the work?
    • Does the year of publication match?
    • Is it the same publisher?
    • Is the city of publication the same? This is particularly important for texts that were published in England and the US simultaneously. Differences in punctuation and orthography can cause proofreading headaches in the Page: namespace.
  • Has the mainspace text been wikified?
    • Are bold, italics, smallcaps, &c. in place?
    • Have text page numbers and running headers been removed?
    • Have ref tags been used for any footnotes?
  • Has the text been proofread by us to at least 75%? If not, there is no advantage in using Match & Split and side-by-side proofreading will be quicker in the long run.
  • Have the header and footer fields in the Index been appropriately set for the running header and footer? When Page: namespace pages are created the contents of these fields is used to populate the header and footer fields.

When not to use this tool

edit
  • If any of the above criteria are not met
  • If the text has been pasted from Project Gutenberg or Distributed Proofreaders, then we do not know that the editions used are the same. Side-by-side proofreading should be done on the OCR text layer in the Page: namespace and either transcluded to replace the pasted text or transcluded as a separate edition. This decision will be based on the similarity of the editions.
  • If the text has been pasted from Internet Archive, then side-by-side proofreading is the appropriate action. Transclusion should replace the pasted text. This is because very little proofreading takes place at IA and the text is the OCR layer that we already have.

Match and Split Process

edit

Read through the whole process before commencing it and ensure that you understand the steps.

1) Via Special:Preferences, Gadget tab, turn on the Phe-bot Match and Split option

2) Identify the first page of the Page: from the DjVu file that corresponds with the text in the main namespace. This will require researching and reading the main namespace document, and reading the same document in the Page namespace.

MATCH

edit
Adding MATCH

3) The initial edit is to add __MATCH__ and associated text to the main namespace page that corresponds to the first word on the respective Page: namespace page where the edit is done,

  • easy way is to click   button in the toolbar and paste in the first corresponding page link in the Page: namespace; or
  • from first principles
Format:   ==__MATCH__:[[Page:La Vie littéraire, II.djvu/82]]==

ie. wrap in  ==   ==
add  __MATCH__
add  :
add  [[Page:in namespace.djvu/xx]] (it will be a redlink at this point)

Note: All formatting below the MATCH will be moved into the Page: namespace, so place Categories, etc. for the work above the __MATCH__, and remove any coding that straddles the match, eg. <div class="..."></div>

Hint: Paste the overarching formatting for the work, e.g., <div class=… and like from the end of the work, to above the MATCH statement, and move it back upon completion of the SPLIT process.

4) If all is working properly, the __MATCH__ should be an active link.


Applying MATCH

5) Click the __MATCH__ and the job will start


Outcome from MATCH

6) When finished, the page will

  • be saved and reload,
  • now have aligned the text with the first page in the Page: namespace and the [[Page:...]] link will now be an active (blue).
  • all subsequently aligned Page: pages will be listed, inserted and active linked.
  • a [split] tab will appear at the top of the page.


STOP!

7) Read the next step carefully, and ensure that the previous process has been both successful and correct before proceeding to split.


Verify MATCH

8) Verify the text has successfully been aligned to each of the Page: pages created.

  • This may fail if the text in the Page: namespace is of insufficient quality.
Notes
  • Errors to look for ...
    • Split quotation marks and other like formatting either side of a page link or ===level=== markers
    • Incomplete matching, marked by no match. Can be the case where successive non-text pages are encountered. This will be marked by ===no match===
In this case, realign with a subsequent Match statement and again process the match further down the page, repeatedly if necessary, before performing the Split.
  • Tables, particularly those that span page breaks, will not behave well and will have to be re-created in the Page: namespace.
  • Words moved either side of a page break. This is usually because of words that are hyphenated across the page break.

SPLIT

edit
Applying SPLIT

9) To get Phe-bot to split the one page to the respective multiple pages, click the [SPLIT] link. This label will now change to [splitting], please wait, don't click anything and be patient while it gets to work ... it will take a little while. [You can check on progress via robot activity page].

  • If the robot activity page shows that the job is complete, yet the page still shows [splitting] then clicking on the originating page at this point is okay.

10) When complete, the original page will reload with the page now transcluded with <pages /> nomenclature, and all the text loaded to the respective Page: namespace pages.


Verify SPLIT

11) Verify each page in the series has been split, and look to apply Proofread status through normal editing process.

Notes
  • Chapters in the main namespace will not always align with a complete Page:... end, i.e., the next chapter starting on the same page. The Split process should create a <section> marker at the appropriate place. For the next chapter, insert the new Match statement at the appropriate place on the Page:... and save and process normally.
    The bot will create relevant sections on the Page: and should continue smoothly.
  • That you have fixed any formatting, categories, and copyright tags that you placed before the MATCH to avoid them being moved with the page text.

Notes about <pagequality>

edit

When the Split is undertaken, the text will show on the respective pages in Page: namespace as Not Proofread. This is because:

  • the automated matching algorithm isn't perfect, especially around hyphenated words
  • the original proofreading may be against a slightly different edition or contain formatting that isn't suitable for the Page namespace.

Normally, the work needs to be re-checked on a page-by-page basis. It's usually easier to correct issues like cross-page hyphenation and formatting before splitting the work.

See also

edit