User:Inductiveload/dp reformat

←

dp reformat

→

This is a script to attempt to convert a raw text file holding page-wise proofread text from Distributed Proofreaders to a format that can be inserted into the Page: namespace by the Help:Match and Split bot.

Prerequisites

Install dp_reformat.js to your common.js page:

mw.loader.load("//en.wikisource.org/w/index.php?title=User:Inductiveload/dp_reformat.js&action=raw&ctype=text/javascript");

Enable the Match and Split gadget in your gadget preferences
Download a "concatenated page text file" from Distributed Proofreaders. These files are not available for "archived" projects, and DP will not provide access to them.
Upload the matching scan (usually DP will mention if they use an IA scan) and create an Index page.

Process

Open a new page (can be in mainspace, if you will proofread it soon, your user space, or even the Sandbox. You do not need to save.
Paste in the contents of the text file from Distributed Proofreaders
Click "Reformat DP text" in the side bar
Fill in the target Index file—this is the index page you created above
Fill in the target offset. This is needed because "page 1" according to DP is not the first page in the DJVU. If they have removed blank pages in the body of the work, you may need to work in sections.
- It is important to check this carefully, as incorrect splits need a bot and admin (due to the redirects) to fix
Click "Done".
The text should now be transformed into a split ready format that looks something like this:

==[[Page:The ways of war - Kettle - 1917.pdf/7]]==
THE WAYS OF WAR
==[[Page:The ways of war - Kettle - 1917.pdf/8]]==
Page content

More page content
==[[Page:The ways of war - Kettle - 1917.pdf/9]]==
....

Make any other adjustments to the text now if you want. The content of each ==Heading== will be placed into the matching page, so it's easier to do bulk edits replacements now, when it is still one file, rather than later.
Save the page now.
There should be a "split" tab at the top of the page, next to "Discussion".
Click this, and the Match and Split bot will start to move the text to the Page namespace. This can take a short while, as the bot does not create pages very fast.
- You can check the split has started here.
When it is done, the page will be updated with transcluded text from the Page namespace.

Limitations

PGDP use a very general syntax like /# ... #/ to mark "special formatting", where we'd normally apply the formatting ourselves using, say {{fine block}}. So these occurences need to be dealt with on a case-by-case basis.
Line-break's are retaining inside /* ... */ blocks using <poem>. This might not be right in all cases, but PGDP aren't more specific, so, again, it's a case-by-case thing.
Some diacritics might be missing, let me know if you spot a mapping from something like [=a] to a character like ā and I'll add it

Remember, this tool is not a substitute for proofreading, it's just an aid.

Other subdomains

Other language subdomains can be supported. I need to know the correct values for the domains dictionary, which defines how a domain handles certain formatting.

Configuration

Some configuration options are exposed.

Set them by adding a handler for the import_dp.config hook and setting the in the provided config object. This is optional: if you do not, defaults will be used.

  mw.hook("import_dp.config").add(function(cfg) {
    cfg.keep_comments = true;
  });

keep_comments
- true: convert [** comments] to Wiki-style . Note: DP claims ownership of comments, so they are stripped by default.
- false: default, strip comments.