User talk:Xover/Archives/2020

Latest comment: 3 years ago by Inductiveload in topic Fonts
Warning Please do not post any new comments on this page.
This is a discussion archive first created in , although the comments contained were likely posted before and after this date.
See current discussion or the archives index.

DjVu manipulation

Do I remember rightly that you have the skill to make minor corrections to DjVu files? I have a file that is in great condition, but two of the pages are reversed, which causes transclusion issues. Would you be able to swap the order of those two pages (and related OCR, etc.) and upload the corrected file? --EncycloPetey (talk) 22:08, 7 January 2020 (UTC)

@EncycloPetey: I can perform most likely manipulations of DjVu files, yes. If there are a large number of operations or pages (i.e. too tedious to do by hand) it may require making a new script (which may or may not be worthwhile depending on reusability potential). But swapping two pages should be a quick matter.
PS. Keep in mind that I can also regenerate a DjVu from the original page scan images, including generating a new OCR text layer. A lot of our "text layer is offset" problems with DjVu files are caused by phab:T219376 and these can be fixed by regenerating them from scans (my scripts have armoring to prevent triggering this MediaWiki bug). --Xover (talk) 05:38, 8 January 2020 (UTC)
Great! Then this should be a straightforward fix. The File:Aeneid (Conington 1866).djvu (on Commons) came from IA, and looked to be just fine but pages 81 and 82 turned out to be swapped in the scan. These are pp. 81 & 82 of the book (pp. 105 & 106 of the file). You can see the page numbers swapped in the Index, from when I discovered the problem yesterday.
I had checked the file to be sure that all pages were present before uploading, but I missed the fact that these two pages had their order swapped. As this is the only good first edition scan at IA, I'd rather have it corrected than resort to using a later edition or poorer scan. If you can simply swap those two pages and upload the corrected file to Commons, that should be all this file needs. --EncycloPetey (talk) 16:34, 8 January 2020 (UTC)
@EncycloPetey: Done. Swapping two pages is indeed easy: the only hard part is keeping straight in your head what goes where. :) --Xover (talk) 17:30, 8 January 2020 (UTC)

Thanks! I also see that Yale Shakespeare volumes have been unlocked now, but the Coriolanus does not have a DjVu file available. Are you able to create DjVu files from existing scans in other formats? The Coriolanus is at (external scan) from which I would need File:Coriolanus (1924) Yale.djvu to be made and uploaded to Commons.

The other two plays published in 1924 (Two Gentleman of Verona and Cymbeline) have DjVu files available, and once I have verified the quality of the scans, I will upload them and start the Index pages. --EncycloPetey (talk) 21:48, 8 January 2020 (UTC)

@EncycloPetey: Done. Incidentally, I am interested to know if you notice any differences in OCR quality in this DjVu vs. the IA-generated DjVus of the rest of the series (good or bad). I am using a different OCR engine which sometimes gives better results (and in a very few cases, dramatically worse results), and which has some knobs I can tweak to attempt to get better results for a specific work. I am also using a custom script to convert from the OCR engine's structured output format (hOCR) to the DjVu format's structured text format (sexpr) and can make adjustments during that conversion that could conceivably be useful for improving the end result (what gets loaded into the edit field by MediaWiki/ProofreadPage). For the latter I have toyed with the idea of unwrapping lines (while leaving paragraph breaks intact) just to eliminate that tedious task for most books. That wouldn't be helpful for this particular work (or other plays or poems), but there may be other similar transformations that could benefit from being automated in that process. If you have any ideas I'd love to hear them! --Xover (talk) 11:08, 9 January 2020 (UTC)
Thanks. When I get to this work, I will keep that in mind. As for your idea, note that eliminating line breaks in prose upsets some proofreaders because they find it easier to proofread comparing line by line, rather than having to hunt for the appropriate text in an undifferentiated block. --EncycloPetey (talk) 16:10, 9 January 2020 (UTC)

Scan quality...

Compare :

https://archive.org/details/b22382525/page/n23 in zoom with Page:A Short Account of the Botany of Poole.djvu/25

The former (as PDF) is at a MUCH higher quality... Does Mediawiki have a problem with doing quality scans for djvu right now? ShakespeareFan00 (talk) 18:16, 14 January 2020 (UTC)

@ShakespeareFan00: The Internet Archive book reader (the web interface at IA) doesn't display a PDF, it just displays the scan images. I am not aware of any problem with image quality in DjVus in MediaWiki but it is extracting the IW44-encoded image in the DjVu and reencoding it in JPEG for the thumbnails so some degradation will always be present. I also don't really see any egregious problem with the scan image quality in Page:A Short Account of the Botany of Poole.djvu/25: it is not nearly perfect, and poorer than what you see at IA, but that is due to excessive compression in whatever process generated that .djvu. It doesn't appear to be so bad the proofreading is impossible. It was however lacking OCR text, so I've regenerated it from the source scans, trying to preserve as much image quality as possible, and added a text layer. --Xover (talk) 19:16, 14 January 2020 (UTC)

Wharton's Old New York

In addition to the volume 4 scan problems, there is no DjVu available for the second volume (although there is a good scan).

I also lack the tools right now to straighten images that are slightly crooked. If you feel a little bit adventurous, and could provide straightened and cropped copies of the cover images for all four volumes, it would be appreciated. There is a folder on Commons for the series, in which the three available DjVu scans are already placed. --EncycloPetey (talk) 19:32, 14 January 2020 (UTC)

@EncycloPetey: I'll take a look. Could you specify what you mean by "the cover images" so I get the right ones? --Xover (talk) 19:36, 14 January 2020 (UTC)
The first page of the DjVu scans, the cover of the book, has an illustration in the upper right corner for each volume that also bears the title. These images are crooked in every scan, and straightened image files will be needed for transclusion. So I am not requesting that images be straightened within the file itself, but rather that the four cover images be extracted, cropped, and saved as separate files for use in proofreading the work. --EncycloPetey (talk) 19:41, 14 January 2020 (UTC)
@EncycloPetey: Done. Let me know if you want any further tweaks. --Xover (talk) 17:26, 15 January 2020 (UTC)
Thanks. --EncycloPetey (talk) 17:27, 15 January 2020 (UTC)

Index:East Anglia in the twentieth century.djvu out by one

How does one realign djvu text and image layers where a faux image has been added by IA-upload? (discussion) I can wget the file over to toollabs easily enough, though the command line manipulations to use with the djvu tools are unknown to me. Could you please tell me those so I can fix this file. Thanks. — billinghurst sDrewth 03:42, 20 January 2020 (UTC)

@Billinghurst: Offset text layer is most likely caused by T219376 (short version: if a .djvu has even a single page with an error, MediaWiki gets confused when extracting the text layer). Anecdotally, IA-upload seems to generate a disproportionate number of cases of this (@Samwilson: FYI). If that's the case then simply deleting the offending page is the simplest fix. This can be done with the DjVu tools from the DjVuLibre project using djvm -d filename.djvu pagenum.
If the offending page is not one that can be deleted out of hand, or it turns out to not be an instance of T219376, you might be better off just pinging me to regenerate it from the source scans. There's a lot you can do with .djvu files, but it gets complicated really fast, and manually dealing with the text layer is both tedious and fraught.
But if you want to experiment, the short-short version of the DjVuLibre cheatsheet is:
  • djvm -d filename.djvu pagenum—delete a page, including its text layer, from a .djvu
  • djvm -i filename.djvu [pagenum]—insert a single page, that's already converted to DjVu, into your .djvu file
  • djvm -c filename.djvu 1.djvu 2.djvu 3.djvu …—create a .djvu by collecting a bunch of single-page .djvu files
  • c44 page.jpeg page.djvu—convert a scan image of a single page (in JPEG, PGM, or PPM format) into a single-page DjVu (NB! No text layer!)
  • djvutxt [--detail=page] [--page=pagenum] file.djvu—dump the text layer for file.djvu. If you give it --detail=page you'll get the same output MediaWiki uses. If you give it --detail=word you'll get the full detailed text structure, down to the individual word level, in sexpr-format. --page=pagenum specifies which page in the .djvu you want to dump, otherwise you get the whole file.
  • djvused—essentially an interpreter for a private little scripting language for DjVu files. Very powerful, but for advanced users only.
  • djvudump filename.djvu—dumps the internal structure of .djvu files. Useful for debugging them.
The hard parts are 1) keeping the correct page numbers and order straight in your head when manipulating files, and 2) OCR text which you can fake in a pinch but which really requires custom scripting to do right. --Xover (talk) 05:24, 20 January 2020 (UTC)

Could you please explain...

On December 30, 2019 you opened a discussion at Wikisource:Copyright_discussions about several dozen guantanamo related documents I uploaded here.

On January 19, 2020 you closed that discussion, and erased those documents.

Did you consider giving me a heads-up that this discussion was underway? If you did not do so, can I ask why you did not do so? Geo Swan (talk) 03:51, 20 January 2020 (UTC)

@Geo Swan: My apologies. I simply did not notice that you were the uploader (I had Sherurcij—who is no longer active here—down as the contributor of most of them for some reason). --Xover (talk) 04:44, 20 January 2020 (UTC)

Index:Hindu Tales from the Sanskrit.pdf

Another PDF thumbnailler/image generator issue I think

Compare the scan quality of Page:Hindu Tales from the Sanskrit.pdf/169 vs https://archive.org/details/in.ernet.dli.2015.7115/page/n168

Was otherwise planing on using Google OCR to get the text. 16:06, 23 January 2020 (UTC)

@ShakespeareFan00: In this instance the cause is clear: the uploaded PDF on Commons has a resolution of 543 × 892 pixels (which is pitiful for text), while the original scans at IA have a resolution of 3154 x 4366 pixels. The PDF version at IA was generated by the uploader (Digital Library of India), so it never was of any higher resolution than what's now on Commons.
If you want I can generate a DjVu from the original scans so we at least get the maximum resolution that's available, and which has a OCR text layer. You'd have to move the Index: and existing Page: pages over to the new file though, which can be a bit tedious. --Xover (talk) 19:56, 23 January 2020 (UTC)
No objections. I'll pause on this one until the scans are updated.. ShakespeareFan00 (talk) 20:06, 23 January 2020 (UTC)
@ShakespeareFan00: Here File:Hindu Tales from the Sanskrit.djvu. --Xover (talk) 20:59, 23 January 2020 (UTC)

Template:Recto-verso header/testcases

Added a new option in the sandbox to cope with variant page numberic such as roman numerals, alpahetics or the ###a-###b etc variants I've seen in some works. ShakespeareFan00 (talk) 10:07, 25 January 2020 (UTC)

Template:Statute table family

A while back I reworked these to enable them to be subst more cleanly, and updated them to use template styles in the header.

Any chance you could use something like AWB to do a mass subst of uses, because it would greatly improve performance in respect of the sub pages of: https://en.wikisource.org/wiki/Chronological_Table_and_Index_of_the_Statutes/Chronological_Table?

I could do it manually, but the replacement could easily be automated.

It's currently split up because of the sheer number of template calls, blew out the transclusion limits when it was made. As templatestyles need only be called ONCE per page, it should be possible to reduce it to five or six instead of the one or more call per row currently.


When I did a similar change to the underlying pages here Short Titles Act 1896/First Schedule, I was able to bring the WHOLE first schedule back onto a single page, which is quite impressive. I hadn't updated the links in the root page yet because the split-version was easy to navigate. :)

I am not going to continue with attempting to repair Lint errors for the moment, it's causing too much frustration and friction. ShakespeareFan00 (talk) 12:58, 27 January 2020 (UTC)

@ShakespeareFan00: I don't have access to AWB (or other mass-change tool), sorry. You'll need to post that request at WS:S/H.
I'm pretty sure the TemplateStyles extension does deduplication of styles, so that repeat uses of a template with an associated TemplateStyles file should not lead to repeat inclusions of those styles. --Xover (talk) 13:41, 27 January 2020 (UTC)

Mentoring request...

Willing to take on this task? I may be a long-term contributor, but I'm still a relative noob when it comes to some non content areas.... ShakespeareFan00 (talk)

@ShakespeareFan00: You should always feel free to ask me for advice—and I might even be able to provide something resembling coherent such once every blue moon :)—but I'm not really sure what you expect me to do as a "mentor". --Xover (talk) 13:53, 27 January 2020 (UTC)
see w:WP:MENTOR ShakespeareFan00 (talk) 14:01, 27 January 2020 (UTC)
@ShakespeareFan00: I'm familiar with WP:MENTOR, but I've never seen it actually function as intended in a formal form. People are who they are, and they will either pick up on what needs amending on their own (possibly with a little friendly and informal help), or they will not (which on enWP usually means they will eventually end up blocked) regardless of any mentoring.
In your particular case, the only advice I can really offer, which I'm pretty sure you're already aware of yourself, is to stop banging your head against the wall before you get so frustrated that you end up venting it on the Scriptorium.
There are fundamental limitations of a wiki as a medium, and other limitations that stem from the fact that the WMF does not have infinite resources but nearly infinite competing interests and requests on those resources. That they are also imperfect and end up getting blinders regarding any problem that does not affect the Wikipedias is annoying, but very very human.
Once it becomes clear that a given issue cannot reasonably be solved on-wiki, the best course is to just drop it; either by finding a workaround that, while wrong, is at least a reasonable substitute (using headings instead of sidenotes for example), or by dropping the issue and finding something that is actually solvable to do (lint errors in ref tags is probably in the "unsolvable" category).
That "a certain other contributor" seems to be getting as frustrated by your perseverance and consequent frustration as your own frustration with the software, is, not to put too fine a point on it, not your fault. Do, by all means, try to avoid knowingly stepping on people's toes, of course. But at a certain point it is no longer constructive to walk on eggshells when this is clearly a case of just inherently rubbing each other the wrong way. --Xover (talk) 16:17, 27 January 2020 (UTC)
Well on the basis that the response to the phabriactor ticket seems to be that for well-founded reasons, they can't actively "break" the Cite extension ( something I can fully understand.), It looks like English Wikisource will have to request a fork of the extension (not that anyone here has anymore resources or developer expertise in order to implement anything better.).ShakespeareFan00 (talk) 16:22, 27 January 2020 (UTC)

Sidenotes , a thought....

Currently, for reasons lost to time , the sidenotes used on English Wikisource are largely Span based.. There are good technical reasons for this, because sidenotes can occur at any point in a run of text.

However, as the {{cl-act-p}} family was intended to address, there are situations where the current approach isn't ideal, because with the current styles, it's sometimes possible for them to overlap. This was what {{cl-act-p}} and the associated module was trying to resolve.

Independently to that, I'd attempted by editing some of the underlying templates {{Outside}} and {{outside2}} IIRC to not overlpa by changing the styling in relation to the float and clear parameters.

Neither approach was compatible with certain other templates (like {{di}} and so on)

As sidenotes are not supported in CSS (yet) nor in HTML, some more creative approaches are needed going forward. ( completed works already existing do not need changing.)

With legislation, the sidetitles can be relatively easily converted into sub-headings (which is essentially what I've attempted to do with some UK Statutes I transcribed.)

Then you have something complex like Ruffhead.....

What would be useful is something like the current sidenotes, which are compatible with the dynamic layouts, but which can accept a block level element, for an example an Template:Outside LR/s Template:Outside LR/e pair that could be used to wrap a sidenote, whilst allowing normal wikitext handling for paragrpah breaks, and other formatting, like {{center}} etc (albiet within the sidenote margins, not the main page.).

A more fuller solution for sidenotes would also resolve the issue of lengthy sidenotes in close proximity overlapping, when there a narrow margins involved.. This was the "clear-fix" approach that I had attempted to implement (badly) for {{Outside L}} and {{Outside R}} etc..

This won't be solved in the short-term, but a fresh perspective is needed.

(I will also note that {{numbered div}} has essentially the same structural issues that an earlier version of {{cl-act-p}} had, (they are in essence the SAME template with only minor tweaks... ShakespeareFan00 (talk) 14:24, 27 January 2020 (UTC)

@ShakespeareFan00: There is no general way to solve sidenotes without explicit software support in Mediawiki (which will probably require a CSS standard for it in order to be implemented). Keep in mind that the dynamic layouts and page numbers is not actually a feature of Mediawiki: it is a local javascript that GOIII hacked together (and which hasn't been maintained since he retired). There is no left and right margin that we can rely on to be there, and in which we can place a sidenote. The block vs. inline is also impossible to solve in the current software. We would need something that works like <ref>, that makes block vs. inline irrelevant, and where we can use something like {{smallrefs}} to control where they appear.
In other words, for all but the simplest cases, I would suggest to focus on acceptable alternatives; such as using footnotes instead of sidenotes, or the sub-headings you suggest, or… whatever gets the idea across even if it doesn't match the original. --Xover (talk) 16:26, 27 January 2020 (UTC)
Worth starting a Phabricator ticket for the long-term? You could also for something like Ruffhead, use a very specfic multi-col layout (which is table based) but that's not a good idea, because of mobiles.. As you indicate it would need backend support, which is not currently on anyone's proposal list. ShakespeareFan00 (talk) 16:51, 27 January 2020 (UTC)
Have a look at the first few pages of Ruffhead for one approach, I used to try and resolve the issue of overlapping sidenotes... It's by no means ideal, and it's STILL overly complex to implement. ShakespeareFan00 (talk) 16:52, 27 January 2020 (UTC)

Page:The Adventures of Tom Sawyer.djvu/111

The cause of the lint error is clear... the {{flow under/text/s}} is span based, but an attempt here is made to include a paragraph break...

Not sure how this one could be solved, and there's no "visible" impact.. ShakespeareFan00 (talk) 15:26, 27 January 2020 (UTC)

@ShakespeareFan00: The correct solution is m:Community Wishlist Survey 2020/Archive/Support CSS Shapes module, but I believe I saw a Once a Week page where they had figured out how to do it without explicit Mediawiki support. It was in one of the recent (last month or so) examples posted on WS:S, but I can't find it just now. --Xover (talk) 16:31, 27 January 2020 (UTC)
I would tend to agree, but I'm not entirely sure how widely that's supported yet. ShakespeareFan00 (talk) 16:47, 27 January 2020 (UTC)
@ShakespeareFan00: well enough. It's too new to be used by Mediawiki, but I'd say we're in the borderland of where we can use it in the few extraordinary cases where nothing else will do. Browser support will just improve over time, and meanwhile it degrades reasonably gracefully (readable, if not pretty). --Xover (talk) 17:18, 27 January 2020 (UTC)

Nop/Nopt and why a backend fix is needed to cease the insanity..

https://en.wikisource.org/w/index.php?title=Page:A_Naval_Biographical_Dictionary.djvu/1366&oldid=9893395

{{Nopt}} is useful template, when it works as intended... If it is used as a drop in for {{nop}} here, it disrupts the last cell in the row.

This isn't a critical concern, but it is visual... Can you come up with some test-cases to nail down exactly what Mediawiki thinks it should be rendering here? ShakespeareFan00 (talk) 18:56, 27 January 2020 (UTC)

Another example: -https://en.wikisource.org/w/index.php?title=Page:A_grammar_of_the_Teloogoo_language.djvu/40&oldid=9844736
The work around is seemingly to do

Header:

...
|-

Body:

{{nopt}}
|-

But I'm not sure that it would be the complete picture at all. ShakespeareFan00 (talk) 19:19, 27 January 2020 (UTC)

@ShakespeareFan00: I haven't really dug into this issue, but from what I recall what's happening is something like this:
When ProofreadPage's <pages … /> tag transcludes pages from the Page: namespace, it joins them together by removing any whitespace (modulo hyphenated words which are a special case). If you put {{nopt}} at the end of the first page, and Mediawiki table syntax (row start, say: |-) at the start of the second, the transclusion will lead to the following construct: <!-- nopt -->|-. Since the row start markup has to be at the beginning of the line this will not work.
When you put the {{nopt}} at the start of the second page, the ProofreadPage whitespace removal removes the newline and any space characters after the last cell on the first page, leading to the construct:
| the last table cell<!-- nopt -->
|-
The row start syntax is at the start of the line, so that works as intended. The HTML comment gets appended to the previous cell, but it is not rendered and so it works as intended.
In other words, both Mediawiki (ProofreadPage) and {{nopt}} work as intended, it's just the documentation for the latter that's deficient. I can't offhand think of any other way to handle this issue that would be any more elegant. --Xover (talk) 06:44, 28 January 2020 (UTC)

Numbered list over multiple pages..

The obvious starting point is {{numbered list}} but it would be nice if this could be converted into a module so it's not making more parser calls than there are list items to process.

I was thinking about having something like the commencing/continuing/completing params {{TOCstyle}} uses?

Prompted by the Questions section of Page:Treasure Island (1909).djvu/312 & Page:Treasure Island (1909).djvu/313 ShakespeareFan00 (talk) 15:45, 28 January 2020 (UTC)

Module:List also exists, and amending that to use the TOCstyle commencing/continuing/completing syntax should be trival for a Lua coder (which I'm not.). Want to sandbox something, as I think this would greatly improve the speed of transcribing some works?ShakespeareFan00 (talk) 16:01, 28 January 2020 (UTC)
@ShakespeareFan00: I don't have the spare cycles for this just now. I'll try to remember to take a look at some point, but do please feel free to remind me if it looks like I've forgot.
Calling me a Lua coder would be an exaggeration, but I usually manage to muddle through. And producing something that spits out a list, including across pages, looks, at least superficially, like it should be doable. --Xover (talk) 09:21, 30 January 2020 (UTC)

Requesting clone/duplication of pages between works..

(Reposting this directly here, as it's been on Scriptorum help for a while without a response.)

Source: Index:2019-12-02-report-of-evidence-in-the-democrats-impeachment-inquiry-in-the-house-of-representatives.pdf
Pages: 1 to 123

to

Destination: Index:Impeachment of Donald J. Trump, President of the United States — Report of the Committee on the Judiciary, House of Representatives.pdf Pages: 217 to 339

As the latter pages seem to be the exact same report appended or annexed. ShakespeareFan00 (talk) 13:39, 1 January 2020 (UTC)

Thanks in advance. ShakespeareFan00 (talk) 13:39, 1 January 2020 (UTC)

@ShakespeareFan00: I don't have any tools to do mass moves (or other mass edits), sorry. Perhaps Mpaa can help? --Xover (talk) 06:43, 30 January 2020 (UTC)

Before I really break something, Can you take a look as to why this is continually saying there are missing or misnested italics, desite my continued efforts to balance up the relevant tags? Thanks. ShakespeareFan00 (talk) 21:36, 29 January 2020 (UTC)

@ShakespeareFan00: The transclusion setup for that seems excessively complicated, so I've asked the contributor who worked on it (who's still active on enWP) to comment on the reason for that. If there's no particular reason for the complexity, the first step will be to simplify that. Next most likely culprit are the work-specific (well, area-specific, but…) templates used, but because they lack docs they'll be a pain to untangle. But, in any case, first step is figuring out the transclusion setup; and as far as I can tell these are just lint errors (no visible effect) so there's no particular hurry. --Xover (talk) 07:14, 30 January 2020 (UTC)
I think the complex layout is to accomodate "amendments" made since the original was passed so that Wikisource has various consolidated versions without needing to hold duplicative text of sections and clauses that are identical between versions.
The SLfoo series of templates seem to be redirects, and essentially set out the layout for the clauses, Unlike {{numbered div}} and the train-wreck {{cl-act-p}} these don't have additional anchoring. I concur there is little documentation.ShakespeareFan00 (talk) 14:55, 30 January 2020 (UTC)

Hathi scan request

Would you be able to grab Al Aaraaf for me, in order to scan back Al Aaraaf, Tamerlane and Minor Poems ? —Beleg Tâl (talk) 15:52, 31 January 2020 (UTC)

@Beleg Tâl: What's the copyright status of the front matter? PD-US-no-notice? --Xover (talk) 17:14, 31 January 2020 (UTC)
I don't see a notice, so {{PD-US-no-notice}} appears correct. —Beleg Tâl (talk) 17:19, 31 January 2020 (UTC)
@Beleg Tâl: I also find no registration in 1933 or 1934, nor a renewal in 1960 or 1961, so we should be good either way. --Xover (talk) 17:26, 31 January 2020 (UTC)
@Beleg Tâl: Done: File:Al Aaraaf (1933).djvu. I've done minimal checking. Ping me if it's borked in some way. --Xover (talk) 17:51, 31 January 2020 (UTC)

Request to convert PDF

May I ask you to convert File:Guide through Carlsbad and its environs.pdf into djvu for me? I used to convert PDF files into djvu using some online converter, but recently I have been receiving very bad results from them (they often leave some pages blank during the conversion). I also tried to download some convertor, but its results were even worse. Thanks! --Jan Kameníček (talk) 22:46, 2 February 2020 (UTC)

@Jan.Kamenicek: File:Guide through Carlsbad and its environs.djvu --Xover (talk) 08:58, 3 February 2020 (UTC)
Great, thanks very much! It is really bad that contributors are still forced to choose between struggling with bad PDF extraction in Mediawiki and struggling with DJVU conversion, without any Wikimedia help with any of these two problems. Not only it slows the work down, but it makes it so difficult that it is IMO one of the biggest obstacles to getting new contributors here. --Jan Kameníček (talk) 09:10, 3 February 2020 (UTC)

User:ShakespeareFan00/Sandbox

3 styles... If you can advise further let me know.. 22:43, 5 February 2020 (UTC)

Controversial request...

Do we have partial blocks on English Wikisource yet?

Owing to some concerns I have about my ability to effectively code certain templates, I was going to ask for some kind of limitation on my account, so that I HAVE to request changes to templates via talk pages, rather then edit the templates directly.

In effect, this would be a self-block request from Template namespace (but not Template Talk:), until I feel able to not make typos and syntax failures, which have tragically led to the kinds of unreasonable behaviour in "edit-comments" and on Scriptorum.

I can of course try to stop editing Templates manually, but ....

ShakespeareFan00 (talk) 11:08, 6 February 2020 (UTC)

@ShakespeareFan00: Partial blocks were deployed to enWS in June last year; we just haven't updated our policies to reflect it.
But I am hesitant to implement this request: when you know you should not do something, technical measures should not be needed to enforce it. In this case, it is by your own judgement that you should not edit in the Template: namespace so refraining from doing so should be easy to adhere to. If it will help you, I can add a strong admonishment to refrain from editing there: not because I've actually seen you make any edits that would merit that, but because it clearly causes you great frustration. We have endless backlogs of all sorts so there should be plenty of other things to do that will bring you pleasure rather than frustration. --Xover (talk) 19:21, 6 February 2020 (UTC)

Page:UKSI19810859.pdf/31

This seems to be another instance of P wrapping weridness.. There should be a normal (paragraph spacing) between the end of the paragraph at the top of the page and the continuation paragraph following it. There is apparently a reduced spacing - more like that of a normal line (even though I'm not changing any top or bottom margins. Can you possibly sanity check what I'm doing in the template code underlying this to make sure I am not overlooking some blindingly obvious logic failure or typo?. ShakespeareFan00 (talk) 22:33, 7 February 2020 (UTC)

These may be related to the long-standing dolevels and P wrapping glitch that's on Phabricator (see T134469 ) ShakespeareFan00 (talk) 22:50, 7 February 2020 (UTC)
Wow.. I took a VERY careful look at my code, and moving where the anchors were placed... No spacing issues.. Next question: How to insert the required leading needed as P and DIV have different initial marination. I really really need to check my old code more often. ShakespeareFan00 (talk) 23:26, 7 February 2020 (UTC)
@ShakespeareFan00: The tendency of MediaWiki to aggressively insert P tags combined with very detailed styling will make that very difficult to get consistent. It will very quickly devolve to needing a "whole page style" (to reset things like margins) that templates and TemplateStyles are a poor fit for.
I also have to note, I took a quick look at the templates in use here and ran away screaming. This is very complicated code for what is, superficially, relatively simple visual formatting. My hunch (and I could very well be wrong here) is that this is indicative of another area where you're fighting limitations of both MediaWiki and what HTML/CSS supports. That's partially coloured by a previous look I had at what would be the "theoretically correct and semantic" way to mark up this kind of content and concluding that HTML/CSS just doesn't offer the facilities to do that properly.
I'm thus not really sure I can offer any useful advice on this. This kind of text is on my big and unsorted "todo" list of things to try to figure out at some point, but I very rarely work on this kind of content so it's likely to get pushed back for other projects. --Xover (talk) 09:40, 8 February 2020 (UTC)
It's certainly feasible to do it in CSS. It's just that the Mediawiki overheads add additional conflicts/concerns. If it wasn't for the paged nature of the content, I would have suggested looking into CSS counters so that numbering becomes a content attribute for a P or DIV as part of a before:: rule, assuming that's something media-wiki actually supports in TemplateStyles.. ShakespeareFan00 (talk) 11:36, 8 February 2020 (UTC)
(Sigh) I clearly can't write rules that work... In this page the text-indentation defaults now overrides the rule I set to make it zero for a continuation paragraph. Having to write several different classes for starts vs continuations say this template's design is fundamentally flawed somehow. I can reformat the existing usages. There aren't that many, but you've potentially lost me as a long term contributor to this project, if well-intentioned (and in this instance) actually though through approaches are going to continually have to fight limitations in the tools.. ShakespeareFan00 (talk) 12:11, 8 February 2020 (UTC)
The tools are what they are. We can do what we can within the limits of what they allow, or we can keep fighting them and being perennially frustrated. I heartily recommend the former approach. There are plenty of things to do here that do not involve templates or numbered paragraphs at all. --Xover (talk) 13:53, 8 February 2020 (UTC)
BTW My reasons for redoing this is so that eventually I don't have to implement a new template for each 'different' level, I just add the relevant classes to the stylesheet. (BTW there isn't technically anything stopping someone else "classing" the section titles into floats, with suitable margin shifts ...)ShakespeareFan00 (talk) 14:53, 8 February 2020 (UTC)

Template:Cl-act-p/testcases#Inline_titles

So something is as I said previously being mis-expanded, but despite the conversion to Lua code, I can't figure out where. You'd paused on the deletion of this to await feedback from another user, which still hasn't arrived. Either the glitch can be tracked down the code ( time consuming) or the template needs to be completely re-written so the 'train-wreck' is conclusively resolved. ShakespeareFan00 (talk) 07:36, 8 February 2020 (UTC)

Regarding Index talk:History of Oregon volume 1.djvu

To keep relevant discussions clear, I've responded there only to the decision that was already reached. But your general point is well taken, and next time I'm starting work on a new transcription with page-spanning footnotes, I'll probably adopt your convention. (We're too deep into this work for an easy transition, unless somebody wants to get clever with AWB or something.) -Pete (talk) 20:23, 9 February 2020 (UTC)

@Peteforsyth: Ah. I was wondering why the page contained only fragments of a discussion. :)
I just read that bit as a question and wanted to clarify the (lack of) policy point and threw in the advice while I was at it in case it was still relevant. If you want the page to function more as a style guide then feel free to refactor my comment (or remove it entirely). --Xover (talk) 05:18, 10 February 2020 (UTC)
Others responded to me on my user talk page and, I think, in an edit summary...so I can see how it would come across that way. If there's a recommended format for documenting conventions for a particular work, I'd be happy to see it and emulate it...I appreciate your validation of our approach, and the advice. -Pete (talk) 05:28, 10 February 2020 (UTC)

Quote templates (from User talk:Chrisguise)

Moved because OT…

@Levana Taylor: Would it be useful to have {{“ ‘}} and {{’ ”}} (like {{" '}}) to emit these combinations with hair spacing? Slightly easier to type and ditto logical. --Xover (talk) 20:09, 9 February 2020 (UTC)
There was some back-and-forth at the time of changing quote policy but nothing decided; one opinion (which I share, I think) was that it would be preferable to have a single template that works with any pair of quotes. In any case, {{sp}} is just a stopgap which can be changed by bot when we settle on something. Levana Taylor (talk) 20:42, 9 February 2020 (UTC)
@Levana Taylor: I meant: Would you like me to create those two templates for you? We can't have such templates for every combination of quotation mark out there, but these two are very common and should have wide applicability. If you use those combinations a lot on OAW and having these templates would help a little then I would be happy to make them for you. The "savings" relative to just using {{sp}} may not be worth the effort of incorporating new ones into your muscle memory, but I just wanted to note option should you prefer that approach. --Xover (talk) 05:12, 10 February 2020 (UTC)
Yes, actually it’d be a good idea to have those three (not two) templates including {{“ ’}}. I’ll bet some people have been wondering why they don't exist. Levana Taylor (talk) 06:22, 10 February 2020 (UTC)
I come back to the guidance amendment which was not meant to encourage users to move towards more extensive use. People may wonder why policy change discussions can get bitter, when the boundaries are moved, and stretched and moved again. Remind me next time that this is what is going to happen, and I will be vociferous about opposing such initial changes as it seems that the defence needs to be made at the very beginning. — billinghurst sDrewth 09:54, 10 February 2020 (UTC)
@Billinghurst: I don't think I catch your meaning here? --Xover (talk) 10:00, 10 February 2020 (UTC)

Index talk:Blenheim Column of Victory

Sorry to bother you about something, but this has seemingly sat for 6 months. I noticed it when doing some Lint error repair work...ShakespeareFan00 (talk) 19:00, 10 February 2020 (UTC)

thanks

thank you unsigned comment by Anayguy (talk) 15:02, 11 February 2020‎ (UTC).

You're welcome. :) --Xover (talk) 16:12, 11 February 2020 (UTC)

DjVu from Google

Are you able to pull a scan from Google and generate a DjVu for upload to Commons? With the disappearance of Wikilivres, I've found that The Poems of Sappho copy that we had was (ironically) a small fragment of the complete text. IT would be very valuable to have The Poems of Sappho hosted here, and a scan would be the obvious first step. --EncycloPetey (talk) 16:50, 15 February 2020 (UTC)

@EncycloPetey: Nope, sorry. Google does not provide access to this book, at least not here. HathiTrust has it, but with limited access. Not sure about the status because I find some comments to the effect that it is downloadable in the US, so that is probably worth checking. --Xover (talk) 18:09, 15 February 2020 (UTC)
It looks as though I have the option to download a PDF. Would that be a sufficient starting point for you? I could upload it here as a temporary file if that would be enough for you to work from. --EncycloPetey (talk) 18:40, 15 February 2020 (UTC)
@EncycloPetey: I can work from PDF, yes. This will slightly degrade the image quality due to double encoding—which may be an issue with heavily compressed low-resolution scans—but usually works fine. --Xover (talk) 18:52, 15 February 2020 (UTC)
I expect problems anyway because of the quoted Greek passages, but will upload to File:Poems of Sappho (Cox 1924).pdf The only additional work required is that the first page is a Google notice that should be removed (without substitution) since it alters the odd/even page numbering. With the notice, the title page is an even page, but it should be an odd page. The converted DjVu can be uploaded to Commons, and the local pdf deleted.--EncycloPetey (talk) 18:56, 15 February 2020 (UTC)
@EncycloPetey: Done. File:The Poems of Sappho (1924).djvu and Index:The Poems of Sappho (1924).djvu (do, of course, feel free to tweak the Index to your preference: it's set up as a convenience, not an expression of opinion :)). On the Greek, this was the best I could do. It's not my area, but it looks to be roughly as good as the English text, or at worst a little bit worse. It will certainly need a competent hand (i.e. not mine!) to correct it in any case.
Please let me know if you find anything that needs tweaking. I can swap around images, or insert placeholders etc., and regenerate the DjVu fairly easily now while I have the source files sitting around. There are also some knobs I can tweak on the OCR to try to get better results if there are specific problems, but absent anything pathological the current version is probably within spitting distance of how good we can get it. --Xover (talk) 06:58, 16 February 2020 (UTC)
Thanks! --EncycloPetey (talk) 14:52, 16 February 2020 (UTC)

Re images: I see that pages 56 & 57 of the text (DjVu pages 62 & 63) are facsimile printings of a specific first edition text. These would be worth having as images. --EncycloPetey (talk) 16:51, 16 February 2020 (UTC)

@EncycloPetey: File:The Poems of Sappho (1924), p.56.png, File:The Poems of Sappho (1924), p.57.png. --Xover (talk) 20:07, 16 February 2020 (UTC)

template:do not move to Commons

Hi. Noticed that you have changed the parameter of this template on some works (eg. [1]) from expiry to expires … nada. Also to note that it is configured for the year of movement, rather than the year of expiry, ie. aimed at the 1 January date rather than the 31st Dec date. Yes it is different <shrug> we survive. — billinghurst sDrewth 06:07, 22 February 2020 (UTC)

@Billinghurst: Thanks. I always struggle to keep that parameter name straight, mostly because |expiry= is both bad grammar and reads awkwardly in a mnemonic sense. In the edit you link I would guess I was doing general cleanup and at some point changed my mind about some modification to the Commons template, "restoring" it by hand and ending up using my flawed recollection of what the parameter name is. I see |expires= isn't even a valid alias for |expiry= so that's not just pointless but actually broken too.
However, on the right year to use I'm confused. The docs clearly say (and the template code reflects) to use the last year of the copyright term in that parameter, but here you seem to be saying to use the first year after the copyright has expired? --Xover (talk) 06:36, 22 February 2020 (UTC)
I have later boot prints on the template, primarily around being able to have a parent category, and subsidiary works. I have nothing on the original nature of its design. Yes, it is done so the expiry shows in the new year when you can move it—so it is +1—which makes sense for its "voila!" moment, though confusing against the YoD for PD templates. <shrug> — billinghurst sDrewth 10:25, 22 February 2020 (UTC)
@Billinghurst: I'm sorry, but I'm not following. Are you saying you prefer to use the template in contravention of its documentation and the semantics of that parameter in its code? If so, what is the effect you are trying to achieve? --Xover (talk) 13:50, 22 February 2020 (UTC)
I am explaining what the template does, if the documentation does not match the action then the documentation is wrong. The template takes the year of expiry, not the last year of life of copyright as the PD-old-nn series takes. And not exactly enchanted with how you worded your statement. — billinghurst sDrewth 10:22, 23 February 2020 (UTC)
@Billinghurst: If my message caused offence then I apologise: that was certainly not my intention! I merely intended to indicate that I do not understand your preceding message, and to ascertain your intended meaning.
The documentation is unequivocal: it says {{Do not move to Commons|expiry=_last year of copyright_}}. But regardless of the documentation, if you look at the code of the template it is also clearly intended to be used with the last year of the copyright term. The parameter is used to display the "do not move banner" and to place the page into a "not suitable for commons year" category, both of which are removed or hidden once the copyright has expired (once current year is larger than the year given in |expiry=) and replaced by the category "media now suitable for commons".
So what I'm trying to figure out is what effect you're trying to achieve, partly because, unless what you want is in direct conflict with its design, it seems likely that the template can be modified such that gives you that effect without abusing (and, I stress, I here use that term in a purely technical sense!) the semantics of the parameter. --Xover (talk) 11:00, 23 February 2020 (UTC)

Index:Public School History of England and Canada

Found some scans of a possibly identical edition:- https://archive.org/details/publicschoolhist00robe/mode/2up

Is there a process for doing a replacement? ShakespeareFan00 (talk) 12:07, 22 February 2020 (UTC)

@ShakespeareFan00: No particular process, especially since this index is for individual image files rather than a DjVu and PDF. But if we can determine that it is the same edition I can generate a DjVu file with OCR and move the Index over. --Xover (talk) 13:53, 22 February 2020 (UTC)
It is is the same edition, but it's not a simple replacement, as the Archive.org copy is missing the title page, which is present in the Jpeg scans present on Wikosurce currently. Any djvu file might need to be manually patched for the title pages... ShakespeareFan00 (talk) 10:00, 23 February 2020 (UTC)
Also - https://archive.org/details/publicschoolhist00robeuoft/page/n5/mode/2up which IS also identical apart from the rear cover (and has the title pages) ShakespeareFan00 (talk) 10:00, 23 February 2020 (UTC)


https://archive.org/search.php?query=public%20school%20history%20of%20england%20and%20canada - Amongst these editions there must be one that's identical. ShakespeareFan00 (talk) 10:00, 23 February 2020 (UTC)

Page realignments needed.. Thanks... ShakespeareFan00 (talk) 20:55, 28 February 2020 (UTC) Now Done ShakespeareFan00 (talk) 22:59, 28 February 2020 (UTC)

Poet Lore, volume 4

Hello. I would like to ask you for help with File:Poet Lore, volume 4, 1892.pdf. There are three pages missing, which I have extracted from some other copy (which has different pages missing). Two of them are here and they come before the title page with the poem by Tennyson (which is currently the 5th page of the file and should move to become 7th page of the file). The third missing page is frontispiece and is here. It should come before the first page of No. 1 (with the text A Modern Bohemian Novelist…). Could you then also convert the file into djvu, please?

Thanks very much. --Jan Kameníček (talk) 09:35, 4 April 2020 (UTC)

Only now I have noticed the notification above. No problem, nowiki duties have priority. Hope you are fine. --Jan Kameníček (talk) 12:47, 4 April 2020 (UTC)

Hyphenation with italics...

User:ShakespeareFan00/Sandbox/hyphtest

Based on some concerns I had I wrote some minimal test cases...

None of the current approaches is ideal, because of how the tags around the start get interpreted, they collapse into a bold with a stray ', which is clearly not what is typically desired.

The parser here IS working a designed, but the combined italics over a Page gap, might be something that needs to be looked at again. ( The other concern arises with in follow refs as well.) ShakespeareFan00 (talk) 11:27, 10 May 2020 (UTC)

@ShakespeareFan00: I don't think I'm understanding correctly what problem you are trying to address here? Why can't the simple approach that I just added to your test cases be used? --Xover (talk) 17:25, 10 May 2020 (UTC)
That worked :) It's always the small things.. It means I can update the Help: page accordingly. {{hwe}} {{hws}} are not now needed so I can update accordingly..ShakespeareFan00 (talk) 19:38, 10 May 2020 (UTC)

List with a DIV..

User:ShakespeareFan00/Sandbox/Listglitch

This is getting silly. If Mediawiki can't handle very simple things like placing a DIV based template inside a list, then certain parts of the code responsible for handling that kind of markup need to be completely re-thought. This isn't a new issue, it is PRECISELY the issue that was reported about 2-3 years ago. (And in a related form has been known about in a related form since at least 2007!) - Head-meeting desk repeatedly (sigh). ShakespeareFan00 (talk) 01:00, 11 May 2020 (UTC)

@ShakespeareFan00: Again I'm not sure what the specific problem you're seeing is (your testcase page contains lots of things that may or may not be the issue you're concerned about). But for div inside list items, the most obvious issue isn't the div as such, but rather extraneous newlines in the template. Due to html whitespace rules these are usually effectively ignored, but in certain contexts newlines have semantics for the MediaWiki parser. Lists being one such context: list items in MediaWiki cannot contain raw newlines due to the simplified list syntax relative to html lists. I've removed the extraneous newlines in the EB1911 template you used as a test case just as a demonstration.
Also, in general, it often isn't an issue of MediaWiki being unable to handle whatever the issue is; but rather that when you have an extremely simplified syntax like wikimarkup, that's used to generate relatively complex things like full-blown html, you're going to run into limitations and tradeoffs. The lack of end tags in wikimarkup makes it impossible for the parser to function without inference, and when inference rules start stacking you can't easily tweak one without knock-on effects for others. This stuff is hard, in addition to suffering under Wikimedia's lack of resources and Wikipedia blinders. --Xover (talk) 05:39, 11 May 2020 (UTC)
Which in respect of that specifc template was the problem I was trying to solve.ShakespeareFan00 (talk) 08:10, 11 May 2020 (UTC)
The issue about line breaks inside line items is mentioned here w:Help:List#Nested_blocks_inside_list_items, but do normal people actually read (or know to look for) documentation, which may not be on the same wiki as the one they are editing on.)? As I've said in the past it would be nice if the inference rules were formally documented, somyself and other contributors relying on needing to look for a specific line in a Help: page (which may not even be on the same wiki) to find what they MIGHT be, as opposed to what they actually are. The notes there also don't take into account the DIV SPAN SPAN DIV whitespace handling issues that have caused confusion elsewhere. I am aware that this is a long standing issue.

(ASIDE: Converting the PRE block to a SYNTAXHIGHLIGHT resolves the issue of line feeds in respect of source code examples. Generally if it's multi line source code it should be using the latter not the former now. Where would be the best place to document this?) ShakespeareFan00 (talk) 08:25, 11 May 2020 (UTC)

The second issue concerns the generation of extra markers.. The formatting should (ideally) be the same for the internal conversion vs that placed inline ? ( I am wondering if what's being generated isn't the same code.) ShakespeareFan00 (talk) 08:46, 11 May 2020 (UTC)
And that proved to be correct. :)
*  Item 
** Sublist.

Doesn't open a new list for the sub list. Subtle, but easy once it's understood. ShakespeareFan00 (talk) 08:58, 11 May 2020 (UTC)

Long term , UKSI formatting...

Template:Uksi/styles.css

Requesting a review of the approach here, and the template family concerned. (The intent is to EVENTUALLY replace the need to have direct numbering expect for proofreading and do it all with CSS counters if they are supported.)

Not urgent, but the approach here should allow for some simplification of the mess that some of the higher level templates are.

It may, and I say may also be possible to use something like this to make {{numbered div}} more usable, or even ultimately rescue the {{cl-act}} family. ShakespeareFan00 (talk) 22:01, 12 May 2020 (UTC)

@ShakespeareFan00: The approach looks generally fine, though quite complex. You need to keep in mind that you can end up in a situation where the solution has so many nuances and complexities that it ends up being just an extra level of abstraction and complexity. I don't know the source material this is intended for very well, and so I don't really have any good sense of whether that's a risk here, but it's one factor to keep in mind when designing such things.
I would also strongly caution against relying on CSS counters or any other algorithmic way to generate content when the details of that content matters (as paragraph numbering in legal texts does). Every single time that content is rendered the numbering is generated anew, and thus every single time there is a potential that something can go wrong (changes in MediaWiki's parser, changes in web browser CSS engines, different web browsers, etc. etc.). It also means it can change over time due to changes in the standards that define those algorithms. And algorithms are inherently more complex than hardcoding the content in the first place. CSS counters are close to programming complexity, but even HTML numbered lists share a lot of this type of fragility.
In addition, if we by some method generate part of the content of the page, we hide that content from those editing the page (which may be someone doing maintenance or looking to reuse parts elsewhere). That may be a worthwhile tradeoff if we gain a lot of value from it (which it looks like the UKSI family may well do), but it's another factor to keep in mind. I recall with horror the template—was it modern? I can't recall—that wrapped almost every other word in a template with multiple arguments, resulting in the whole page being just a soup of markup. If you've ever seen raw PostScript data… That's a perfect (extreme) example of why this is a problem.
That doesn't mean we can't use these approaches at all, but it does mean we need to be careful to not fall into the trap of making the solution so fancy that it defeats the purpose. --Xover (talk) 06:13, 13 May 2020 (UTC)

Play by Synge

Could you please create a DjVu for J. M. Synge's play The Playboy of the Western World from (external scan)? It was published in 1907 (Dublin) [1912 reprint] and the author died in 1909, so there should be no issues uploading to Commons. We have a dearth of works by Irish authors. --EncycloPetey (talk) 21:02, 14 May 2020 (UTC)

Since Xover is on semi-wikibreak, I queued this using the IA Upload tool. It should show up soon here: File:The Playboy of the Western World.djvu If there are quality issues, Xover may know better than me how to correct them, but this should at least get things started. -Pete (talk) 21:07, 14 May 2020 (UTC)
Will that work if there is no DjVu available at IA? --EncycloPetey (talk) 21:13, 14 May 2020 (UTC)
Yes, the IA Upload tool has the ability to generate a DJVU based on the JP2 files at Internet Archive. It takes a bit longer to process (a few hours, I'd guess). By the way, I noticed that Synge's work "Riders to the Sea" appears to be quite significant as well, so I also uploaded that one and began match & split. I'm not 100% sure the transcription is for the edition it claims, though, as the transcription has a list of "Characters" where the original has a list of "Persons". Anyway, hopefully any differences are minimal and easily detected. -Pete (talk) 21:26, 14 May 2020 (UTC)
@EncycloPetey, @Peteforsyth: For simple cases with little need for manual page-fiddling and such, the computer does most of the work (I just remove the extra scan reference images at the beginning and end, and any botched images interspersed in the image series). I can usually find the time to grab the download and set it processing even when I'm otherwise busy. Case in point, I've just set the computer to crunching this scan so it should have a DjVu ready for upload by the time I'll be sufficiently caffeinated tomorrow morning. Let me know if you want it or if you prefer the ia-upload version (ia-upload grabs the OCR text from IA, which uses Abbyy Finereader, which some people prefer to the results Tesseract produces). --Xover (talk) 21:42, 14 May 2020 (UTC)
OK, thanks for the background, and sorry if I jumped the gun. I'll not interfere further on this one. -Pete (talk) 21:54, 14 May 2020 (UTC)
This one looks like a simple case; just the extra scan reference images from front and end to be removed. I couldn't say which OCR is to be preferred. --EncycloPetey (talk) 22:19, 14 May 2020 (UTC)
@EncycloPetey: I should have used the IA upload tool to remove page 1. I will remove the first and last pages tomorrow and upload a new version. -Pete (talk) 02:42, 15 May 2020 (UTC)
@Peteforsyth: If I've understood Xover's comment above, then he's already generating a replacement DjVu. --EncycloPetey (talk) 03:24, 15 May 2020 (UTC)
@Peteforsyth: Never worry about simply trying to be helpful! That's always going to be appreciated, and if well-meaning assistance should ever mess anything up I'll be sure to let you know (as I hope you will for me too). In this particular instance I very much doubt EncycloPetey minds having multiple options to choose from, and, as mentioned, it cost me very little effort so it would hardly be a waste worth mentioning even if it ultimately went unused.
@EncycloPetey: (and Pete) I took the liberty of uploading the DjVu I generated over the one Pete generated with ia-upload (vs. seperately), because on checking I found that the ia-upload one had that darned annoying text layer offset problem. It's caused by a really annoying interaction between the way ia-upload generates these DjVus and the really rather shockingly approximate API at the Internet Archive, and is almost impossible for software to correct for (I know Sam has looked at it). Essentially, IA is returning plain incorrect information about page numbers and ia-upload relies on that page order being correct in order to extract OCR text from the XML file at IA and associating it with the right pages in the DjVu. Once one page is incorrect every subsequent page will be too, and multiple such errors will compound.
Regarding the OCR engines… ABBY FineReader (a commercial product that IA uses) used to generate better quality OCR than Tesseract 3.x (open source tool used by Phe's tools) so some people prefer the OCR text from IA. In my experience, Tesseract 4.x (which is a major new rewrite with a completely new engine), ABBY FineReader 8.x, and Google Vision (what the Google OCR gadget uses) have comparable quality results. Each has strengths and weaknesses, and some of them handle certain languages better, but I find them essentially interchangeable in terms of OCR results.
In any case, Pete uploaded the DjVu at File:The Playboy of the Western World.djvu and I've uploaded my version over that. Please do let me know if there's a problem with it or if it needs tweaking. And I'm happy to do these DjVus, so never hesitate to ask if you have need of that. --Xover (talk) 07:02, 15 May 2020 (UTC)
It's true, ia-upload has some annoying bugs like that! :-( It sounds like this file's all sorted now, but sometimes I find that the IA PDF is of good enough quality, and has a text layer, so the whole question of DjVU can be sidestepped. Sam Wilson 07:17, 15 May 2020 (UTC)
@Samwilson: It's my understanding that DjVu is strongly preferred, which is why I've been trying to hone my skills in generating them. But I'm not fuly familiar with the reasons. (I know it's a more open format, which may be reason enough.) @Xover: Thanks for the explanations. If IA is generating info that is just plain wrong, do you know if anybody has informed them of that? I'd be happy to reach out if it wouldn't be redundant. -Pete (talk) 18:32, 15 May 2020 (UTC)
@Peteforsyth: So far as I know, nobody has talked to IA about this. If you're looking for a programming project I'm sure Sam would appreciate the help on ia-upload!
The text layer offset is discussed in phab:T194861. ia-upload is (I think) relying on the pre-generated XML file at IA combined with information from the API for information about the pages. As Mpaa comments there, the XML file does not always contain all the page images found in the .zip, and ia-upload's algorithm (processing them sequentially) compounds the problem. It's been a while since I looked at it, and my code uses a different approach, but as I recall what I found was that the scan reference images at the start and end, and any mis-scans in the middle, are not correctly reflected. I suspect they manually correct for these in their book reader. In any case, if you're trying to process the .zip files automatically you'll run into trouble with this.
My code avoids this problem because it generates new OCR rather than try to import IA's OCR, but this, obviously, has the downside that you don't get IA's OCR (which may have been the very thing you wanted). It also means there are unavoidable manual steps to prepare the scan images before before processing, making it unusable as a general use tool, unlike ia-upload which is just about as user friendly as it's possible to get this kind of tool. (I'm toying with the idea of making my tool available as an interactive tool at Toolforge/Labs, but I have limited time and I'm not sure there's all that much interest. The current commandline version is too hacky to be useful to anyone but the most techy.).
As for DjVu vs. PDF… There are lots of reasons. Ironically, the biggest reason the community tends to prefer DjVu is that MediaWiki's extraction of OCR text from PDF files is atrocious and much worse than its extraction of the same text from DjVu files. You can literally open the same PDF file in Acrobat and copy the text out and get bette results than MediaWiki's. But this may be at least partly due to the biggest issue for me: there's a definite dearth of even semi-decent tools for working with PDF files, especially in an automated way. My guess is that this is because PDF is a wholly visually oriented format, so there is very little structure or sense to PDF files that a tool could manipulate.
DjVu by comparison has an extremely structured system with levels for a whole document, referenced sub-documents (you can even reuse binary chunks between pages for hyper-optimization!), pages divided into areas, and OCR text with regions, columns, paragraphs, lines, words, and characters; all with positions and extents. The DjVuLibre tools are designed to let you manipulate all this from the command line (I haven't tried doing it as a library, but it does support that), and are fairly decently scriptable. That DjVu(Libre) is free in all the ways that matter (licence, patents, open source, open specification, free-as-in-beer, etc.) where PDF has uncomfortable caveats on several points is a secondary but not unimportant concern.
And, ultimately, from what I can tell DjVu—because it is designed specifically for our use case—is much better suited for our needs at the format level. For example the ability to separate a page image into multiple layers, where a single-color solid background layer can be encoded efficiently as essentially a single pixel; areas that will be occluded by a higher layer can be ditto compressed away; advanced wavelet compression for "photographic" (anything but the simplest black and white stuff) layers, but scaling down to simple bitonal (and thus highly space efficient) encoding. In one recent test a file went from several hundred MB to 3.5MB because I decided to optimize for size rather than fidelity (it was a badly crushed B&W Google scan with lots of noise that the "photographic" compression didn't do well on, and which didn't suffer markedly by bitonal encoding).
But I suspect that ultimately the community's preference for DjVu actually boils down to that poor OCR text extraction from PDFs in MediaWiki. The rest are somewhat too esoteric issues for most people here. --Xover (talk) 13:02, 16 May 2020 (UTC)
@Peteforsyth: IA has additional problems with scans in its library. Many of them are imported from Google, and quite often no one checks the scans for basic quality control. I have not infrequently found scans where some of the pages are upside-down, or scans where pages were missing (or duplicated), or scans where the corner text of every page was obscured by the thumb of the person doing the scanning, and many other issues. Given that these visually obvious issues occur in IA scans, it is unlikely that text layer offset (which cannot be seen easily) will be caught and corrected. --EncycloPetey (talk) 22:02, 16 May 2020 (UTC)

Whitespace (again)

Page:The record interpreter- a collection of abbreviations.djvu/449

Unless you leave 2 lines between plain-lists, the parser backend seems to collapse things.

Normally between 'grouped' items you would leave a single line feed.

one consistent rule to apply, all the time, would be nice. ShakespeareFan00 (talk) 14:24, 17 May 2020 (UTC)

Solved by adding additional parameters to {{plainlist/s}} , Review requested. Why would it not be possible to style the UL element directly? ShakespeareFan00 (talk) 19:57, 17 May 2020 (UTC)
@ShakespeareFan00: Presumably because the ul doesn't actually appear anywhere: mediawiki's wikmarkup infers the ul from the presence of a list item. But {{plainlist}} could of course just emit the raw html directly instead of relying on wikimarkup. That would probably solve the extra margin too, as I'm pretty sure the default mediawiki stylesheet adds a 1em margin for all div elements. Then again, you don't really need list markup for Page:The record interpreter- a collection of abbreviations.djvu/449: it would make just as much sense to terminate each line with a &br /> and then use the normal whitespace rules to separate groups. --Xover (talk) 20:18, 17 May 2020 (UTC)

Cl-act-p

And why it should be deleted or re-written from scratch completely.:-

The exact problem was that:

{{cl-act-p/1|s1=1|text={{cl-act-h||Test heading}}{{lorem ipsum}}}} 

{{cl-act-p/1|s1=1|{{cl-act-h||Test heading}}{{lorem ipsum}}}}

Do not generate the SAME output, in the latter it's trying to put the ENTIRITY of what should be the 'text' inside the ID field of the DIV it's generating. This is a mistake somewhere in the LUA/markup combination that's generating it. It's too complex for me to understand what the code is doing, and thus it can't be fixed on reasonable time scale, therefore it's time someone else, started from scratch based on the test-cases and details provided.

(I've also reverted cl-act-t back to an earlier version as well because of some completely screwed up margin handling in more recent revisions. (If something isn't working , go back to a versionthat's broken in a way that WILL hopefully be understood better.) ShakespeareFan00 (talk) 15:17, 17 May 2020 (UTC)

At the very least can you check that the LUA code responsible for decoding arguments is decoding things correctly? (i.e the embedded template should not be parsed sperateyl but as part of paramater 1 (unnamed). ShakespeareFan00 (talk) 15:39, 17 May 2020 (UTC)

Well after a LOT of headaches I got this working again. However, I still think it needs a rethink as it has a pain to find where it broke down.. It still needs a re-write or more extensive documentation. ShakespeareFan00 (talk) 23:45, 17 May 2020 (UTC)

Index:Arthur Cotton - The Madras Famine - 1898.djvu

Could you align the pages of this short work (35 pp.), please? I put up a message on WS:S, but no one responded. TE(æ)A,ea. (talk) 12:14, 2 June 2020 (UTC).

@TE(æ)A,ea.: Done. And apologies for not picking up the message at WS:S. --Xover (talk) 12:40, 2 June 2020 (UTC)
I thank you for fixing the OCR text. What I had meant to ask is for you to move the pages: /6 should be at /5, &c. TE(æ)A,ea. (talk) 12:42, 2 June 2020 (UTC).
@TE(æ)A,ea.: Done. Let me know if there are further issues. --Xover (talk) 13:00, 2 June 2020 (UTC)
It has worked. Thank you. TE(æ)A,ea. (talk) 18:07, 2 June 2020 (UTC).

Copyright records -

https://archive.org/details/copyrightrecords?sort=-date

Can someone arrange a mass upload of these to Commons? ShakespeareFan00 (talk) 07:11, 3 June 2020 (UTC)

@ShakespeareFan00: I'm surprised they're not already there. I'll try to mull a bit on what can be done there. --Xover (talk) 07:19, 3 June 2020 (UTC)
I've also asked @: over on Wikimedia Commons about this.. I seem to recall they've been involved with mass upload projects before. ShakespeareFan00 (talk) 07:22, 3 June 2020 (UTC)
Commons:Commons:Batch_uploading/CCE ShakespeareFan00 (talk) 07:39, 3 June 2020 (UTC)

Poet Lore, volume 34

Hello Xover, I would like to ask you for help with volume 34 of Poet Lore.

There is quite a good copy from the University of California at HathiTrust, but it is missing some title pages and contents pages. It has got the title page and the contents page for the Spring number, but not for the other three numbers. What is more, the title page and the contents page of the Spring number needs to be moved after the title page of the whole volume. There is also another copy from the University of Michigan. It is a very bad copy with many missing or badly scanned pages, but it has got all the title and contents pages. Those for the Summer number are between pages 158 and 159, for the Autumn number between 316 and 317, and for the Winter number between 474 and 475. Could I ask you to add the missing pages to the California file and also to convert it to .djvu?

I have already downloaded both pdf files to my computer, so if it helped you, I could upload them to Commons so that you can get them from there. Otherwise it might be better to upload only the corrected file. --Jan Kameníček (talk) 21:47, 6 June 2020 (UTC)

@Jan.Kamenicek: File:Poet Lore, volume 34, 1923.djvu. Please check that the result is as you intended. --Xover (talk) 13:44, 7 June 2020 (UTC)
Absolutely perfect! Thank you very much. --Jan Kameníček (talk) 14:43, 7 June 2020 (UTC)

IA uploads via tablet

I would like to upload a series of books. There are problems though:

One, I wondered if I hit an upload limit when I had trouble with the childrens poetry book.

Another, the best scan is google and it is a gludge. https://archive.org/details/vol1lettersofmar00mary

Another problem is that if the libraries don't reopen I will never proof these. Me and this tablet won't.

I listened to this series from Librivox on my mp3 player. I think I had to remix them so the player would get the order right. They were great while I was knitting. I have no idea what my comprehension was probably my mind wondered but overall it was a nice time in a cold winter.

Another problem is I love them. For no reason.

Can we discuss these problems?--RaboKarbakian (talk) 19:18, 10 June 2020 (UTC)

@RaboKarbakian: Pardon the intrusion: re tablet (assuming you do not have a PC), have you considered a Bluetooth keyboard? They're pretty cheap these days and they mean you can use the full tablet screen for display and some even have a laptop-style touchpad in them. Perhaps then you can turn the tablet sideways to fit the Wikisource content nicely screen? Happy to help with modifying CSS to get what you need on the screen.
I thought about things but didn't get to keyboards. I do image work. I am nearsighted, and the tablet is 7 inches at its widest. Great for video and audiobooks but not for image work.
Fair enough, I can't imagine how to do image stuff on a tablet without wanting to throw it at someone! What kind of tablet is it? There might be other options if it's all you have and you want to to proofread (e.g. casting to a TV, HDMI/MHL video out, etc). But, to be honest, nothing is going to be ideal. Still happy to lend a hand if you'd like to explore options. Inductiveloadtalk/contribs 17:27, 11 June 2020 (UTC)
I can also upload the scans if you like. Inductiveloadtalk/contribs 20:28, 10 June 2020 (UTC)
1844 volumes are now at Index:Letters of Mary Queen of Scots - Strickland - 1844 - Volume 1.djvu and Index:Letters of Mary Queen of Scots - Strickland - 1844 - Volume 2.djvu - I needed a test file for testing a script anyway. The margins are a bit tight, but it seems to have OCR'd OK. Inductiveloadtalk/contribs 14:28, 11 June 2020 (UTC)
That's great! I can't believe the ocr is good! (it had all the marks of problems....) Thank you so much!--RaboKarbakian (talk) 14:39, 11 June 2020 (UTC)
You're welcome. I fixed a bug in the OCR for v2, seems to be working throughout the file now. It's right on the edge of being a mess - another few mm less margin and it would cut off a bit of word on every line which would be rather painful! Inductiveloadtalk/contribs 17:27, 11 June 2020 (UTC)

┌─────────┘

@RaboKarbakian: Your unreasonable and irrational love of this work is clearly an issue for a professional… :-)

…but other than that, I'm happy to help with / discuss anything you please. Is there anything left that Inductiveload hasn't been able to help you with? (Thank you for helping out, btw, Inductiveload: very much appreciated!) --Xover (talk) 13:05, 12 June 2020 (UTC)

Handy script function: auto refs

Just a quick note tangential to the Easy LST thing. There's a new function in the Save/Load Actions script for auto-inserting references, which always has gotten me down. You might like it. User:Inductiveload/save_load_actions#Auto_refs for deets. Inductiveloadtalk/contribs 11:10, 16 June 2020 (UTC)

@Inductiveload: Oooh, nice! Thanks for the tip. --Xover (talk) 12:06, 16 June 2020 (UTC)

template:center

Hi, Xover. It seems that your last edit to {{center}} caused some problems. If you look e.g. at The Czechoslovak Review/Volume 2/Bohemian Needlework and Costumes, you can see that both the title "Bohemian Needlework and Costumes" (in xx-larger font) and the author "By Renata Tyrš" (in smaller font) are written on one line, while the author should be under the title. I have observed the same problem in other pages, too. --Jan Kameníček (talk) 21:41, 16 June 2020 (UTC)

@Jan.Kamenicek: *sigh* MediaWiki should be classified as a health hazard, it pushes my blood pressure so much. :(
{{center}} outputs (I'm simplifying):
<div style="text-align:center;">
{{{1}}}
</div>
Where {{{1}}} is whatever the first argument passed is (the two header lines in your example).
In HTML, the newline after the opening tag and before the closing tag explicitly do not matter (the whitespace rules in HTML essentially say they disappear), except in a context where you have explicitly said that you want whitespace to matter (think <pre>…</pre> tags; or <poem>…</poem> but poem is not actually a html tag, it's provided by a MediaWiki extension). In other words, the "right" way to code that template is:
<div style="text-align:center;">{{{1}}}</div>
In that way the template will not cause extra unwanted newlines when inside <pre>…</pre>, and outside pre contexts it shouldn't matter.
However—and this is what's rage inducing in trying to deal sensibly with the formatting we need here on enWS—in between the template code and the final rendered HTML in the web browser, MediaWiki's parser inserts itself. In the former "with-linebreaks" case, what MediaWiki outputs to the web browser is this (I'm simplifying away intermediary formatting tags that do not matter for this example):
<div style="text-align:center;">
  <p><i>Bohemian Needlework and Costume</i></p>
  <p><b>By Renata Tyrš</b></p>
</div>
Note in particular those <p>…</p> tags!
In the latter "no-linebreaks" case, however, what gets output is this:
<div style="text-align:center;">
  <i>Bohemian Needlework and Costume</i>
  <b>By Renata Tyrš</b>
</div>
Now the <p>…</p> tags have disappeared! And since HTML's whitespace rules say that whitespace doesn't matter, the two header lines are shown on a single line in the browser.
What's happening is that the wikimarkup that we write (and of which the contents of templates are part) is first parsed by MediaWiki and turned into HTML, and that HTML is then sent to the web browser that re-parses it and renders it to the user. When MediaWiki parses the wikitext it applies heuristic rules to account for the differences between what humans write and what HTML requires, in this case regarding paragraphs in text. Humans separate paragraphs with two newlines, but HTML requires a paragraph to surrounded by <p>…</p> tags. So MediaWiki tries to add those tags when it thinks the human intended to mark a paragraph, but to not do so if it thinks the human intended something other than a paragraph. Note in particular that this is not the human telling MediaWiki what is the intent; it's MediaWiki guessing.
And what's biting us here is that one of the main ways MediaWiki determines whether a paragraph is needed is whether there is whitespace between any surrounding element (the <div> from {{center}}) and the bit of text in question (your two headers in this case).
The bottom line here is that this is just simply not fixable. We could forbid multi-line use of {{center}} (so each of your header lines was wrapped in a separate {{center}} template) and go back and change every single existing use to conform to that. That's obviously not a workable solution (it can't be automated, and is obviously way too much effort for extremely little gain). So the only alternative is to say that {{center}} cannot be used inside pre-like contexts (anywhere whitespace is significant), and to create a separate centering template to cover this use case. Not exactly an elegant solution, but it's the only feasible one I can come up with.
In any case, thank you so much for letting me know the change caused a problem (and good catch!). --Xover (talk) 07:17, 17 June 2020 (UTC)

Problem at "Discoveries and Inventions"

I have reviewed through p. 7 of the 24 pages enumerated in the index, supplying the missing texts.
When I opened p. 8, on the left-hand side – instead of the English text – I saw a copy of the Polish original that is on the right-hand side.
I don't know how this came about or when. Is there some way to get the English version into the left-hand side, starting on p. 8, so that I may complete my review of the article?
Thanks.
Nihil novi (talk) 07:56, 17 June 2020 (UTC)
@Nihil novi: What you're seeing is the OCR text from the hidden text layer in the DjVu file. This text is preloaded in the editing field because when you are proofreading (transcribing) a work the OCR text is a useful starting point. Since you won't really use the text you can just go ahead and delete it, and then copy and paste the relevant part of the text from On Discoveries and Inventions. On the earlier pages I and Ankry had already created the page with text copied from the existing translation, so on those pages that was what you saw, but on the later pages (physical page 12 in the DjVu, which is the page labelled 8 in the scanned work) there is no content saved yet and so the software preloads the OCR text. --Xover (talk) 12:23, 17 June 2020 (UTC)

New scan

Hello,

Some bad news and good news. The Korea Copyright Commission says that they don't have any information about presumption of death for Son Jin-tae (for Translation: Changse-ga (1930)), and the other resource I contacted didn't reply in four days, so Translation: Changse-ga (1930) should probably removed for no positive evidence of PD status. That's the bad news.

The good news is that I've scanned the 1937 Japanese source I talked about, and translated one page in Page:Woncheon'gang bon-puri, page 1.jpg. The questions are:

  • A 1937 Japanese work where the two transcribers died in 1954 and 1960 is allowed, right? That's what you said in the talk page, but just to make sure. (I'm assuming it's the Japanese law that's relevant because the transcribers were both Japanese.)
  • The 1937 book is a collection of several dozen unrelated works, so are you allowed to post individual works instead of the whole book? Unfortunately I couldn't scan the entire book, only four individual works.
  • Can I translate directly without transcribing the Korean and Japanese on the multilingual Wikisource?

--Karaeng Matoaya (talk) 12:08, 5 June 2020 (UTC)

@Karaeng Matoaya: A pity about Translation: Changse-ga (1930). I'll delete it shortly. Please keep in mind that it can also be undeleted if new information comes to light.
For the 1937 work… Where was it first published? If it was first published in ROK then the citizenship of the authors does not (afaik) matter for Commons or enWS policy purposes (for international copyright law it might; but there are limits to how complicated we can make this absent an actual complaint). If it was first published in Japan we may have a problem. The term of protection in Japan in 1937 was pma. 50, meaning it was in copyright there until 2010. The URAA date for Japan is 1996, meaning at that point its US copyright was restored and runs according to US copyright rules. That means 95 years from publication, or until 2032. In which case neither Commons nor enWS can host it. If it was first published in ROK it entered the public domain in ROK in 1990 and is public domain in the US for failure to observe US formalities (copyright notice, renewal).
The policy for translation is at WS:T#Wikisource original translations. It does require that A scan supported original language work must be present on the appropriate language wiki, where the original language version is complete at least as far as the English translation. Also, strictly by policy you cannot have only a part of a work (it'd be an "excerpt" which we do not allow). However, since we're here talking about parts that are works in themselves, it's entirely possible the community will accept them in spite of policy. It's not something I would—with my admin hat on—go out of my way to enforce, but I can't really predict with any certainty what the community's sentiment would be if it were to end up at WS:PD. The best would be if you could get a complete scan of the book as published (if you have a bunch of images I can generate a DjVu file from them; and might even get some half-decent OCR text too) and work from that. The community has historically had a really high bar for deleting scan-backed works if the scan is complete, even if only a small part of it is proofread (and I would guess the same would hold true for a translation that is otherwise in compliance with policy). --Xover (talk) 17:49, 5 June 2020 (UTC)
@Karaeng Matoaya: Regarding the copyright… Note that I am above assuming we do not recognise Japan's annexation of Korea as legal, and thus the country of publication to be Korea rather than Japan. I am certainly no expert on this so I could be entirely mistaken. If we consider this geographic area (ROK) to legally be Japanese territory at the relevant time period the copyright assessment would change accordingly. But as I understand it the annexation was never recognised by most of the world, and was made void in 1965. --Xover (talk) 18:07, 5 June 2020 (UTC)
While I don't have Volume I (the relevant volume, 1937) of the work on hand as the library is closed for the weekend, Volume II (1938) is freely available on the National Library of Korea site and for the publishing details Volume I should be identical (technically it's actually the same book, but in Korea and Japan people split longer volumes into 上 and 下). The Volume II edition notice page says that the book was published in Keijō-fu, so South Korean copyright law should apply according to what you've said. I've updated Commons accordingly.
For the scan I'll see what I can do, but the work is at least 580 pages and the institution has only one working scanner to go around, so it might take a while. Also, the first few pages have the institution's seal and other notes written over them—should these be removed by hand after scanning?--Karaeng Matoaya (talk) 02:36, 6 June 2020 (UTC)
Oh, and when you said you can generate a DjVu file from images I take, you mean combining the jpg images uploaded at Commons into a single DjVu, right? So there's no need to give the images to you directly or anything like that.--Karaeng Matoaya (talk) 02:43, 6 June 2020 (UTC)
@Karaeng Matoaya: The "seal and other notes", if they are annotated on original pages of the work, are usually kept intact. If they are inserts we sometimes remove them, or sometimes keep them if they have some measure of independent notability. My guess without seeing them would be that it's best to just leave them in.
In order to generate a DjVu I will need to have the files locally on my computer, of course, but where I download them from isn't all that important. If it is convenient to put them in a zip file somewhere I can download them all in one go that will be the easiest; but so long as they're available somewhere I can grab them without too much trouble. If what suits your workflow best is to upload them directly to Commons then do that. Just try to make sure you use a naming scheme that is predictable and consistent, and put them all in a category for the work so they're easy to find. I also do not have any comprehension of asiatic writing systems, so if the image files are named using anything but Arabic numerals (1, 2, 3, etc.) I will need assistance figuring out sort order and such.
Oh, and to get any kind of useful OCR results I will need to tell the OCR engine what languages to look for (it can guess, but its guesses aren't very good). I got the impression the work contains both Japanese and Korean (Hangul? Or Hanja? Both?), which would be relevant. If the work contains other languages or scripts that would also be relevant to know (when the time comes to run the OCR).
Regarding the scanning, almost 600 pages is a pretty tall order even with plentiful equipment, so I see the challenge there. But I also imagine the holding institution would be very interested in getting this work digitised and transcribed (and since our files are always freely licensed, they are guaranteed to get a copy they can use however they see fit). I really encourage you to contact them to discuss this project and possible ways to approach it. For example, they may be able to facilitate access to the library on the weekend to do the scanning (or some other time that would be convenient for you). Or they may want you to do the scanning in a specific way (special lighting, including special colour swatches , etc.) in order to get results up to digital preservation standards. Who knows, they may even offer to do the scanning for you, or know of someone interested in helping with the transcription (this could be undergrads working on a related thesis, or some historical society, or an academic institution, or...). Also, once a scan exists, it is possible there would be people interested in helping with various bits at the Korean Wikisource. The essence of crowdsourcing is that many hands make light loads; or, don't work yourself to death trying to tackle this job alone if there are others that could be induced to help. :) --Xover (talk) 08:03, 6 June 2020 (UTC)
The institution made a single PDF file of the full scanned text for me, though I did need to pay. I've uploaded the file on commons:File:朝鮮巫俗の研究 上券.pdf for if you want to run OCR on them, though I'm not sure how well they'll work. So the book is written top-to-bottom and right-to-left, in that order. The introductory text, table of contents, explanation, etc., are all in Japanese, but the kanji are kyūjitai and I'm not sure if the engine will account for them. For the primary sources, Korean (full Hangul) is given on top, but it uses the geminate s character, which has been deprecated for several decades and which I'm not sure the engine will account for (you normally can't write them on Korean keyboards). The Korean also uses the Japanese iteration mark 々, which is never used in modern Korean. Also, a lot of kanji unfortunately appear to be blurred.--Karaeng Matoaya (talk) 12:48, 12 June 2020 (UTC)
@Karaeng Matoaya: I extracted the page images from the PDF and generated a new DjVu file from them at File:朝鮮巫俗の研究 上券.djvu. I set the OCR engine to look for vertical Japanese and vertical Hangul, in that order, in every page. You can see one result at Page:朝鮮巫俗の研究 上券.djvu/317 (you can change "317" to any page in the book to see what the OCR for that page is).
It is possible to change the precedence of the two scripts for each page, and it may give better results on pages in Hangul. There are also variant settings for horizontal scripts, and we can also try to specify the languages instead of the scripts (it can sometimes give better results). There are also a couple other settings I can tweak to try to improve it if it's too bad to use.
In any case, no OCR is going to be perfect; it's always a matter of "degrees of awful". The point is more to get something that just good enough that you can start from a text that you correct rather than having to retype it all from scratch. --Xover (talk) 16:02, 12 June 2020 (UTC)

┌───────────────────┘
I've transcribed the Korean original for about thirty-five pages on the multilingual Wikisource, including three full works. (I don't have a Japanese keyboard, and after transcribing the Japanese for one page I figured it would be wildly inefficient.) Is it fine for me to start working on Translation-space right away for the three transcribed works, with a link to the scan-backed multilingual Wikisource?--Karaeng Matoaya (talk) 10:30, 13 June 2020 (UTC)

@Karaeng Matoaya: Great work! Yes, the rule is there needs to be as much transcribed of the original as for the translation. BTW, is this perhaps a project for which it would make sense to try to recruit other Korean and Japanese speakers to help? @Jusjih: Thoughts? --Xover (talk) 11:12, 13 June 2020 (UTC)
Thanks for the encouragement and everything you've done for me, really! It can obviously be pretty difficult to get used to the norms of a new project, but you've explained everything so well. I'm not sure if WS has barnstars, but I would give you one.
I've made a translation page here if you didn't see and I've finished translating two of the three works, do tell me if anything is wrong. I'm not sure if I'll have the time to transcribe and translate any more pages for a while once the third is done, but I'll see what I can do. The very first work (The Princess Bari, also the longest work) is extremely valuable from a literary viewpoint and should probably be translated at some point down the line.--Karaeng Matoaya (talk) 15:13, 14 June 2020 (UTC)
@Karaeng Matoaya: Thank you for the kind words. I'm happy I was able to be of assistance!
On Translation:Studies on Korean Shamanism, Volume I: very nice work! The only issue is that the translation is supposed to be done page-by-page in the Page:-namespace, so that the text can be verified against the original, and afterwards transcluded to the Translation: namespace for presentation. You'll want to start from Index:朝鮮巫俗の研究 上券.djvu where all the pages are linked (note that until a proper pagelist to map logical pages to the physical pages in the DjVu file is added, all the page numbers listed there correspond to the order of pages in the DjVu and not the page numbers as printed in the book). You might find Help:Adding texts useful for an overview of the process (it's just that we're here doing a translation of a non-English work rather than transcribing an English-language one). --Xover (talk) 12:41, 16 June 2020 (UTC)
Thanks! The pagelist has been arranged. Just a few questions. Sometimes Korean syntax demands that the English translation reverse the order of lines (in most cases this is a sentence with a subordinate clause, which precedes the main clause in Korean but follows it in English). How should this work if the two lines are split between pages?
Another question: the Japanese translation has footnotes, which are contained on separate pages at the end of each section. Should these be translated on their own pages, or on the same page where the footnotes occur?
There are also times when the Japanese translation is inaccurate. For example, one of the three works I translated mentions gods moving through the night sky by using the Morning Star as their wonang. The Japanese translation translates this as "cowbell," but wonang means "cowbell" on a cow but "a metal loop on a horse's bridle" on a horse. From context it's obvious that the second meaning is what's intended. Should the Japanese translation be taken into any account in situations like these?
Finally, am I expected to finish translating the entire book before sending it to the Translation: namespace, or can I do it on a work-by-work basis? I'm asking because I'm working on Korean mythology-related things on a number of different Wikimedia projects, and I'd like to link to specific WS translations on Wikipedia but I'm not sure if I'll ever finish translating all six hundred pages.
Thanks in advance!--Karaeng Matoaya (talk) 02:51, 18 June 2020 (UTC)
@Karaeng Matoaya: Apologies for the tardy response: I was distracted by some high-priority interrupts and at the same time your question regarding community expectations took a bit of thinking. In any case…
Regarding subordinate clauses switching place in translation, this is a situation that just calls for common sense. Ideally a translated clause should be on the same page where the original expresses that meaning (so they can be compared), but if that's not possible then any division that has some kind of internal logic is probably fine. The Page: namespace is a work area where we don't cater excessively to end users, so if there is some inconsistency there it doesn't matter so long as the final transcluded result is consistent.
Footnotes are usually transcribed on the page where they appear in the original. To connect the citations with the endnotes you can use the {{authority reference}} template. It's a little technically complex (you kinda need to understand a bit about how MediaWiki works to understand what it does and how to use it), so do please feel free to ask for help with it. (really, experienced Wikisource contributors have trouble with it!)
On inaccurate Japanese translation; Wikisource primarily hosts previously published works as they were published. We don't "correct" typos, poor translations, much like we do not remove or tone down racism, misogyny, or other pervasive problems with historical texts. Once we have a complete faithful text in place, we do have an option for "annotated" editions. In such texts we can add footnotes that explain that the translation is incorrect and similar, but that's always in a separate copy that is clearly labelled as an annotated edition.
Finally, your last question, that caused me some consternation… :)
Wikisource policy in general does not permit "excerpts" (arbitrary subsets of a whole work). The exception would be where a published book is a compendium of subsets that are themselves works, such as a collection of plays or a biographical dictionary where each entry is a distinct work (it stands alone, has a different identifiable author, and the author may even have independent notability, etc.). These can be transcluded individually as they are completed. The tricky bit in this specific case is that while your creation myths would seem to qualify as distinct works in themselves, the commentary, footnotes, etc. are not a complete work without the context of the remaining 600 pages. This does put it in conflict with the policy as written, and thus it could end up being deleted. However, while it is a bit hard to predict with any certainty, I do not expect the community would actually object in this case. It would still be best, of course, to have the complete work proofread; but I think everyone will have understanding that that's a pretty huge undertaking.
But on the scope of the undertaking… Don't underestimate how much can be accomplished by chipping away at it, a little at a time, over a longer period. We have truly massive efforts like DNB00 (63 volumes!) that have been ongoing for 15(!) years. Also, do not underestimate the power of collaborative crowd-sourcing efforts. This is the kind of effort that might very well attract others interested in the subject matter, and a small group working together can get through a 600-page work in a week in some cases (translation will take longer than mere transcription, of course). It's still a big job, obviously, but don't despair about what prospects it has for completion. It is definitely doable!
In any case, again, apologies for the late reply; and please let me know if there is anything I can do to help! --Xover (talk) 07:29, 21 June 2020 (UTC)

IA mirroring at Commons...

c:User_talk:Fæ/CCE_volumes. Fæ is doing a MASSIVE batch upload.

So would it be feasible for someone to provide them with a list of all the pre 1870 (and possibly pre 1925) works linked to at IA from English Wikisource, be they djvu or PDF, and which do not yet exist on Commons? ShakespeareFan00 (talk) 16:39, 27 June 2020 (UTC)

The scripts I think do a SHA1 check, so it won't upload files already on commons. However it can't detect (for technical reasons) if something exists in DJVU vs PDF form. This is something else that might be needed, is identifying works on Wikisource/Commons, that a Djvu copy exists on WMF servers, but not a PDF. The intent would be to eventually have one "good" copy, and avoid duplicated uploads between (PDF, DJVU). Detection of identical editions (but differently SHA-1'ed) scans is beyond the ability of automated tools.

Of course if there are any major collections that are Commons compatible, and which aren't under consideration, the page I linked has a section for suggesting possible future 'batches'.

I will also note that The Catalog of Copyright Entries now has a roughly complete set of 1891-1978 volumes at Commons. Not that those are going to get transcribed at English Wikisource any time soon. :( ShakespeareFan00 (talk) 16:39, 27 June 2020 (UTC)

@ShakespeareFan00: pre-1925 would have to be uploaded to Wikisource: Commons policy requires PD in source country (which we can't know without manual detective work), so pre-1925 is safe only for US works. --Xover (talk) 17:07, 27 June 2020 (UTC)
Noted. Is there a list of pre 1870 works linked from English Wikisource then? ShakespeareFan00 (talk) 17:50, 27 June 2020 (UTC)
I've also asked (c:User_talk:Fæ#Batch_uploading_for_English_Wikisource?), about possibly asking here if there any 1870-1925 collections that whilst they couldn't be on Commons, might be suited to local upload for English Wikisource. (Policy here differing from Commons, as you indidcated.). Fæ seems to respond quickly in respect to this project, so "inviting" them to consider a proposal locally would be entirely reasonable. Can you think of someone able to write such a proposal? ( I.E Wikisource hosts a "mirror" of pre 1925 IA scans that whilst PD in the US, can't be on Commons.)

(I note the precedent that English Wikisource has included works by Agatha Christie, Betrand Russell and P.G. Woodhouse, even though those aren't necessarily PD globally yet.)ShakespeareFan00 (talk) 16:27, 28 June 2020 (UTC)

PDF quality

Index:Catalog of Title Entries of Books Etc. July 1-July 11 1891 1, Nos. 1-26 (IA catalogoftitleen11118libr).pdf

The PDF displays fine when viewed directly. The proofread page image is SIGNIFCANTLY degraded.

Practical suggestions on how to get Page image in the UI here on Wikisource to be of high quality would be appreciated? ShakespeareFan00 (talk) 19:02, 30 June 2020 (UTC)

Please delete this file

File:Japan.pdf is redundant now because there is a new djvu file that I have uploaded that is superior Eltomas2003 (talk) 03:03, 8 July 2020 (UTC)

djvu of The Hussite Wars

I am going to proofread Index:The Hussite wars, by the Count Lützow.djvu, but the djvu file is of quite a bad quality, in comparison with File:The Hussite Wars, by the Count Lützow.pdf (compare eg. the frontispiece in the djvu file with the frontispiece in the pdf file). May I ask you to make a better djvu file? --Jan Kameníček (talk) 12:34, 9 July 2020 (UTC)

@Jan.Kamenicek: Any particular reason it has to be that specific (Google-made) scan? --Xover (talk) 13:28, 9 July 2020 (UTC)
No. As far as no page is missing, it should not matter. --Jan Kameníček (talk) 13:32, 9 July 2020 (UTC)
@Jan.Kamenicek: Done. Replaced with what looks to me to be a far better scan, and the pagelist adjusted. Since I had the scan images easily available I also did the three images. Feel free to change, undo, overwrite anything I did there though if you prefer it otherwise or simply prefer doing something yourself. --Xover (talk) 16:43, 9 July 2020 (UTC)
Perfect, thanks very much for your help again. It was probably not necessary to upload the logo, which is already available in quite a good quality at Commons, but otherwise I am glad for the help with pictures too as I do not have to lose time with them. --Jan Kameníček (talk) 19:52, 9 July 2020 (UTC)
Only now I have noticed that the original file which I uploaded does not contain the map, so your file is better in this sense too 👍🏼 I just added other authors of the map in Commons, as Lützow provided only its translation. --Jan Kameníček (talk) 12:24, 11 July 2020 (UTC)
@Jan.Kamenicek: Good to hear the results were to your liking. My first rule of thumb is to go looking for a scan with colour images at IA: black and white scans indicates the scans have been crushed and preprocessed to reduce file size, which also suggests scan quality has not been a top priority. There's no guarantee a colour scan is high quality, but it's a useful first rule of thumb to filter the candidates. And IA almost always have the original scan images (i.e. not recompressed and collected into a PDF) which makes them convenient to work with and get good technical results. Everything else needs human judgement, but I find starting there gets me to the best option faster. --Xover (talk) 17:47, 11 July 2020 (UTC)

Untranscluded

The pages

have been missed. — billinghurst sDrewth 13:55, 13 July 2020 (UTC)

Patience! I'm not done yet. :) --Xover (talk) 13:57, 13 July 2020 (UTC)
@Billinghurst: There, that should do it. --Xover (talk) 14:10, 13 July 2020 (UTC)
Oh. I wasn't even looking at relative times, had just pulled up the day's list and was working through. <shrug> — billinghurst sDrewth 05:06, 14 July 2020 (UTC)
@Billinghurst: It was just funny because the "new message on your talk page" notification literally interrupted me when I had the edit window open to transclude them. :) But I appreciate the headsup; these could very easily have been forgotten about. --Xover (talk) 06:32, 14 July 2020 (UTC)
My editing and checking time usually doesn't align with others (too early for UK, too late for US), so time-checking didn't even cross my mind when I pulled up the list. I do the new authors, and transcluded work checks once a day in quiet time. I probably won't even learn a lesson. :-) — billinghurst sDrewth 08:20, 14 July 2020 (UTC)

Would you be willing to do some work for transclusion?

I thank you, once again, for creating this index. I have (personal) scanned copies of the following works of which I have uploaded the text to Wikisource:

The images, as you may see, have not been scanned with the highest of quality; however, those pages can be re-scanned. The scans are of two-page spreads, (like this scan was;) however, I noticed that it was you who fixed the pages. If you have any interest in doing this, I can upload the images at the English Wikisource, and the whole file can then be transferred to Wikimedia Commons. TE(æ)A,ea. (talk) 18:49, 19 June 2020 (UTC).

@TE(æ)A,ea.: I can certainly take a look. If the scans are regular I can automate the extraction. "Regular" here means that the square box of the two pages are at roughly the same pixel coordinates within the overall image file in each scan image. Or, at a minimum, that I can find a pixel offset at which the images can be split that will not cut off part of the page on some scan images.
If they are not sufficiently regular it'd be a pretty tedious and time-consuming manual task, but for a reasonable number of pages it'd still be doable.
Incidentally, while there are no existing tools (that I have found) that can do this task automatically for very irregular images, I do have a long-term idea to investigate various algorithms that might be effective and make a tool to do it. If I ever get around to that (no promises whatsoever on that front!), your scans for these works would be good data to test it against. --Xover (talk) 19:28, 19 June 2020 (UTC)
Oh, and a PS in the interest of "credit where credit is due", it was Inductiveload that uploaded Index:Address of Theodore Roosevelt NPP - 1912.djvu (and Slowking created the Index). :) --Xover (talk) 19:32, 19 June 2020 (UTC)
  • With the exception of the first two pages of The North Star, all of the scans should be almost exactly regular. I have plenty of works that I could scan in addition to these; among these, the 1952 Poems of Patriotism looks quite promising (and, I believe, is not copyrighted.)
  • One other question: could you check my formatting on this template? I don’t know too much about advanced MediaWiki formatting, but I hope I’ve not messed up too much. TE(æ)A,ea. (talk) 22:53, 19 June 2020 (UTC).
    @TE(æ)A,ea.: Template looks good. If it works as you intended it's probably correct. You'll probably want to pick a more descriptive name for it though; two- and three-letter names are usually best applied to project-wide templates and not work-specific ones. --Xover (talk) 15:33, 20 June 2020 (UTC)
    I believe it is intended for a “project:” there are ten volumes, and each volume probably has around 1,500 articles. I modeled the template after the system used for the Encyclopædia Britannica. TE(æ)A,ea. (talk) 18:22, 20 June 2020 (UTC).
    @TE(æ)A,ea.: I meant "project-wide" in the sense "All of English Wikisource". Even a large 10-volume work is still just a single work: that it's large justifies having a special template at all, but the short (2, 3, 4 letters, as a rough rule of thumb) names are best reserved for templates that have wider applicability. --Xover (talk) 05:32, 21 June 2020 (UTC)
    @TE(æ)A,ea.: I had a go. See if this looks ok: File:Touch Not-Taste Not (1833).djvu. --Xover (talk) 17:09, 20 June 2020 (UTC)
    I just created the Index: page, and everything looks good. TE(æ)A,ea. (talk) 17:17, 20 June 2020 (UTC).
    @TE(æ)A,ea.: File:The Gypsy Lad of Roumania (1914).djvu. --Xover (talk) 17:35, 20 June 2020 (UTC)
    @TE(æ)A,ea.: File:Songs of Long Ago (1903).djvu and File:Primary Christmas Songs (1913).djvu. --Xover (talk) 05:25, 21 June 2020 (UTC)
    Oh, and PS: if you're going to do any new scanning, a few points that can improve efficiency… It is easier if the top left corner of the book is in the top left corner of the scan image instead of the top right (it saves a manual calculation of the offset since the tools I use take pixel values relative to the top left corner). Image file formats are preferable to PDFs, so if your scanner can output in TIFF, PNG, or JPEG instead of PDF that would be preferable (my tools can't process PDF directly, so this adds an extra extraction step; and PDF tends to add extra lossy compression, that gets re-encoded when extracted, and then re-encoded into DjVu, and each re-encoding reducing quality). It is also possible to automatically crop image edges that are a uniform colour; so if you are using a camera (vs. a flatbed scanner) you might want to experiment with lighting and exposure settings to see if you can get the background to clip (go completely white). This may not be practical to do without specialised equipment, but I mention it just in case since I don't know what your setup is. None of these points are critical, they're just stuff that might be nice if they're easy to do. The only really critical thing is that the images are consistent within each work, so I can apply the same crop and split settings to all of them. --Xover (talk) 05:53, 21 June 2020 (UTC)
    The images for The North Star are aligned in that manner—it was the last work I scanned (at that time), and I finally got the system working correctly. As such, those images should not have any background which would need to be eliminated. They are, however, in .pdf format; I will try to correct that the next time I scan images. TE(æ)A,ea. (talk) 11:12, 21 June 2020 (UTC).
  • Quick question—For the pages, do you need them to be given as the /pages of the (future) .djvu file, or can you just check by upload order? If I don’t have to relabel the pages, it will make the upload much easier. TE(æ)A,ea. (talk) 11:47, 27 June 2020 (UTC).
    @TE(æ)A,ea.: Upload order might be a challenge (I'd need to check; it's possible I could get it to work), but so long as an alphabetical sort of the file names give them in the correct order it doesn't matter. You could name them as "AAAAAAA.jpg", "AAAAAAB.jpg", "AAAAAAC.jpg"; or "AAAAAAA.jpg", "BBBBBBB.jpg", "CCCCCCC.jpg"; or "xxx.jpg", "yyy.jpg", "zzz.jpg"; or "0000001.jpg", "0000002.jpg", "0000003.jpg"; or any other scheme that will sort correctly. Or put another way, if you look at the files in the file list on your computer when sorted by file name and they are in the intended order, then I can produce the DjVu in the correct order. My tools only look at the order after an alphabetical sort of the file names; it does not care what naming scheme is used to get them to sort correctly.
    Incidentally, if you would prefer to upload in a ZIP file to Dropbox or something like that, that would actually be more convenient for me (and presumably for you) as each file will then not need to be first uploaded, then downloaded, then deleted here. For a smaller number of pages it doesn't matter, but when we get into the hundreds of pages the manual overhead for each file does start to add up. But we'll make it work whichever way is convenient for you. --Xover (talk) 13:20, 27 June 2020 (UTC)
    That’s what I was worried about—the files are currently listed based on page number, and so some of the pages won’t sort correctly. Alas, for The North Star, the pages will still be in .pdf format, but for the next batch, (coming in three days, hopefully,) I should be able to scan them into an image format. Is there any specific extension which you would like them in, (if I even have that option?) I’ll try to upload the files as a .zip file—if it works, I’ll add a hyper-link. TE(æ)A,ea. (talk) 15:09, 27 June 2020 (UTC).
    This hyper-link should work. TE(æ)A,ea. (talk) 15:24, 27 June 2020 (UTC).
    @TE(æ)A,ea.: The website says it can only be downloaded from the same device where "the conversion" was done? I haven't tried it myself, but a quick Google search suggests file.io might work for this kind of thing. BTW, the archive doesn't need to be ZIP. I can extract from tar, 7z, etc. archives too. ZIP is just usually the easiest for most people to use and is available for almost all common computers (Windows 8.1 and 10 has it built in, as does macOS, and several Linux distros; on other versions there's usually a free utility that can be installed). Some image hosting services also let you batch upload individual images, but download an automatically generated ZIP archive (Dropbox does this, I think, but I don't use Dropbox myself so I'm not sure).
    As for file formats, the two most important factors are resolution and compression. OCR needs every pixel of resolution it can get, so the higher the better. And lossy compression reduces quality, especially when an image is recompressed multiple times (as it will be when converting to DjVu), so it is important to pick a format with either lossless compression or as little compression as possible. JPEG is convenient to work with but usually introduces too much lossy compression. TIFF and PNG are also common formats, but they are very general and lets you tweak a lot of settings for compression, so it's important they be configured properly. TIFF is probably your best bet, and with whatever format make sure any sliders are set to "High quality" rather than "Low file size". If ther's a choice between "lossless" and "lossy", always pick the "lossless".
    Regarding file names: if the issue is just that page 10 sorts before page 2 (i.e. 1, 10, 11, … 18, 19, 2, 20, 21, …) then that's not a problem. And in a pinch, if you give me a list of filenames that are in the correct order (regardless of how they would sort), I can generate new names with a bit of scripting so that it ends up right. It's only if there's no rhyme or reason to it and I'll have to manually check each image to see where it belongs in the order that it becomes a tedious job. --Xover (talk) 16:43, 27 June 2020 (UTC)
  • Four new works—all as .tif files—here. TE(æ)A,ea. (talk) 21:07, 30 June 2020 (UTC).
    @TE(æ)A,ea.: Facts About the Civil War has no printed date. Do you have any information about publication date? Or alternately, information about its copyright status? --Xover (talk) 11:17, 4 July 2020 (UTC)
    @TE(æ)A,ea.: And the same goes for Gothic Gourmet. It appears to have been published at some point after 1963, and whether it qualifies as no-notice depends on that point being before 1977. --Xover (talk) 11:35, 4 July 2020 (UTC)
    • Gothic Gourmet was printed before end-of-year 1966; whether that was the school year, the calendar year, or some unspecified fiscal year, I am unaware. I am checking Facts… now, but it should be from c. 1955, although I may not have any information to verify that claim. TE(æ)A,ea. (talk) 12:33, 4 July 2020 (UTC).
  • I just realised—I have given you the incorrect order of pages in The North Star. The following is the correct order: /1, …, /12, /13, /16, /17, /14, /15, /18, &c. I could create the index by moving these pages around, but it would be better if the index could be corrected. I apologise for this. TE(æ)A,ea. (talk) 12:50, 4 July 2020 (UTC).
    @TE(æ)A,ea.: So current physical page 16–17 should be swapped so they appear before the current page 14–15 in the DjVu? --Xover (talk) 13:21, 4 July 2020 (UTC)
    Yes, that’s correct. TE(æ)A,ea. (talk) 13:38, 4 July 2020 (UTC).
    I have just created the indexes. /180 of Gothic Gourmet needs to be deleted (it’s not a page), and /392 and /393 of The North Star need to be combined, as they are the same page. Thank you for creating the files. TE(æ)A,ea. (talk) 15:38, 4 July 2020 (UTC).
    @TE(æ)A,ea.: Both should be fixed now. --Xover (talk) 17:11, 4 July 2020 (UTC)
    They are; thank you. TE(æ)A,ea. (talk) 18:02, 4 July 2020 (UTC).

@TE(æ)A,ea.: Right-o. By my count these should be all of them (in no particular order):

Please let me know if I missed any, or if any of them need fixing/modification. The working directories for these will sit around on my computer for a couple of weeks, in which period it will be easy for me to make little adjustments as needed (after I delete the working directories it depends on what kind of change is needed). And, of course, please feel free to ask if you have any more scans you need processed.

PS. I'm going to go ahead and delete the scan images that were uploaded here on enWS now. Let me know if you need them for some reason and I'll undelete them. --Xover (talk) 19:46, 4 July 2020 (UTC)

    • The individual scans hosted on Wikisource can be deleted, there’s no problem there. The left-hand images in Primary Christmas Songs have been clipped, as have the pages in two more scans which I had intended to send to you previously. I will re-scan those works, and send them to you with the next set of scan images. In addition, I am thinking of scanning in the absent two-page spread from here—would you be able to repair that file? Also, thank you for doing all this work. It is quite helpful. TE(æ)A,ea. (talk) 21:07, 4 July 2020 (UTC).
    • Actually, could you remove /20 from Facts, please? TE(æ)A,ea. (talk) 21:10, 4 July 2020 (UTC).
      • @TE(æ)A,ea.: Facts p. 20 has been removed. On Things Mother Used to Make, I can certainly patch the DjVu, but I have no tools to mass move Page: pages, so that will either have to be done manually or through a bot request. And no worries on the work; I'm happy to help. --Xover (talk) 07:29, 5 July 2020 (UTC)
  • Another set—four this time, as well. There is a fifth work, only partially scanned; I was wondering if you would be able to create a .djvu file from it. The pages are sideways, and, generally, alternate orientation; a white slip marks the top of blank pages. TE(æ)A,ea. (talk) 18:35, 7 July 2020 (UTC).
    @TE(æ)A,ea.: file.io is giving me a "file not found" error for https://file.io/lIcHSaO0. Could you have typoed the link, or perhaps it timed out?
    I can certainly rotate pages, and if there's consistency in the variations I can do it in bulk (script it). --Xover (talk) 13:32, 9 July 2020 (UTC)
    I presume that the page timed out—here is a new hyper-link. TE(æ)A,ea. (talk) 15:13, 9 July 2020 (UTC).
    @TE(æ)A,ea.: Still getting "not found". Did you try downloading it by any chance? file.io deletes the file once it has been downloaded, in addition to a time based expiry. --Xover (talk) 16:49, 9 July 2020 (UTC)
    The Web-site appears to no longer work with the type of file I upload—this hyper-link may work better. TE(æ)A,ea. (talk) 18:36, 9 July 2020 (UTC).
    @TE(æ)A,ea.: That worked. I'll try to get started on them tomorrow, time permitting. --Xover (talk) 18:55, 9 July 2020 (UTC)
    @TE(æ)A,ea.: Not yet, sorry. I've finished File:Modern Manners.djvu and uploaded it locally so you can fill in the information template before exporting to Commons. But I ran into trouble on Poems of Patriotism since the page positions are irregular (i.e. when I extract a certain rectangle from the images, one page is fine but in another the text is cut off), and have been trying to find some good way to compensate (I may end up having to do that part manually). If you plan to work on these any time soon I can put Poems of Patriotism aside and have a stab at the rest? The previous scans have been ok in this regard, so I'm guessing its the binding on Poems … that's the cause of the irregularity. --Xover (talk) 06:42, 14 July 2020 (UTC)
    • Yes, it was really difficult to avoid having guttered text when scanning Poems; the binding is somewhat unusual. As I have several other works scanned in, and waiting for upload, there is no rush for the Poems, as long as you haven’t forgotten about it. I was going to scan in more works later to-day, and, (hopefully,) complete the single-page scan work, although that work may take too long, I fear. TE(æ)A,ea. (talk) 12:19, 14 July 2020 (UTC).
  • The new scans are great; however, File:The American Army in France, 1917-1919 (1920).djvu may deleted, as the current scan is incomplete. I still need to scan the remaining pages from that work; I will send them to you once I have finished. TE(æ)A,ea. (talk) 12:47, 16 July 2020 (UTC).
    @TE(æ)A,ea.: Ok, that should be all of them:
    Based on your message I'll delete the last one soon (or we could just overwrite it with the updated version when it's ready). I ended up doing many of the steps on this one manually, but I believe it should be reasonably possible to automate, provided the files are equally regular as these were. I'll need to write a little program to keep track of which should be rotated clockwise and which counterclockwise, but so long as it is consistently every other file and the size of the crop area is the same, this isn't all that hard. Not having to split a double-page scan makes it much simpler, so if that's a convenient way for you to scan works it might be preferable. Judging by this scan, doing it this way made both the horizontal and vertical axis much more regular and thus easier to crop correctly with automated tools.
    Case in point: I had to do Poems of Poetry entirely by hand. There was just not sufficient common ground between the images to be able to extract the pages programatically (primarily in the horizontal axis, but the vertical was borderline too).
    In any case, I've uploaded them locally for convenience. Please add {{book}} and the relevant license templates to them, and let me know if they should be renamed in some way, and then I'll transfer them to Commons.
    PS. Sorry it took so long to get this batch done. --Xover (talk) 15:38, 16 July 2020 (UTC)
    • I do realise that it would be easier to create a file if the pages are scanned individually; however, doing so greatly reduces the speed at which I can scan the images. For Modern Manners, I can scan about seven pages a minute; for The American Army in France, one page a minute. I am planning on scanning some other large works, (of the size of The American Army,) but with fewer pages. For all of these works, I have scanned the images separately, (at a higher quality,) so as to extract the images from those scan images. Don’t worry too much of the pace of uploads—I still need to proofread all of the works that have been uploaded. The names are all fine, by the way. TE(æ)A,ea. (talk) 15:52, 16 July 2020 (UTC).
    • For File:The American Army in France, 1917-1919 (1920).djvu, keep the file locally until I can scan in the remaining pages. TE(æ)A,ea. (talk) 15:58, 16 July 2020 (UTC).

Styles targeting #divMediaWiki-Proofreadpage_header_template

Quick question: with reference to MediaWiki:Gadget-Site.css, what, if any elements have the ID #divMediaWiki-Proofreadpage_header_template? I find no reference to it anywhere, on enWS or in the ProofreadPage code, but I'm not 100% sure. Use of header=1 in the pages tag doesn't seem to introduce such an ID wrapping the header template, at least. Inductiveloadtalk/contribs 13:22, 15 July 2020 (UTC)

@Inductiveload: The short version is that it's unused and can be safely removed. Or, hey, give it background: red and browse around the site for a bit. :)
The long version is that the id was never generated automatically, despite the name looking like that's the case. Back on 27 September 2009 it was added manually by Jack Merridew while cleaning up inline styles in MediaWiki:Proofreadpage header template (which was then a brand new feature) and moving them to MediaWiki:Gadget-Site.css. Three months later it was removed and replaced with some inline style by ThomasV for unknown reasons.
The moral of the story is to be religious about documenting stuff, both inline in the code and in "design" type documentation on a talk or project page somewhere where it'll remain findable for poor future gnomes trying to figure out what the heck is going on. --Xover (talk) 14:10, 15 July 2020 (UTC)
Aha! Thanks, I missed it while trawling the history of that page. At least TemplateStyles CSS should be a bit less opaque since you can find what uses it, and it's closer to the relevant templates. Inductiveloadtalk/contribs 14:29, 15 July 2020 (UTC)

The Republic

I had started proofreading The Republic by this scan; however, the original work was moved to The Republic (Gutenberg edition) and my work to The Republic of Plato. The Gutenberg edition is the same as the scan, although with different formatting; as such, I was overriding the pages of that edition with my transcription of the scan. However, I was stopped from completing this by the separation of editions, and had been meaning to ask you to resolve them; I seem to have forgotten to do so for some time, as I last edited the scan in April. TE(æ)A,ea. (talk) 16:27, 25 July 2020 (UTC).

@TE(æ)A,ea.: Just keep proofreading and transcluding at The Republic of Plato. Once you're done the other text will be speedyable as an unsourced text redundant with a scan-backed one. --Xover (talk) 17:31, 25 July 2020 (UTC)

Table-Talk

I’m sorry, I got quite carried away there. I didn’t mean to cause so many problems. I made the following changes: the main page was divided into two volumes, so that the front matter could be transcluded; I moved all pages of form Table-Talk/Essay X to Table-Talk/Volume 1/Essay X, as the second volume starts numbering of essays from 1; and I converting all pages of form Table-Talk/Title of Essay into redirection pages to Table-Talk/Volume 1/Essay N, as you had done so for Table-Talk/Essay 1. I apologise for the trouble I have caused. TE(æ)A,ea. (talk) 21:39, 27 July 2020 (UTC).

@TE(æ)A,ea.: No worries; I know you were just trying to help. It's just that there's some truth to the adage "Too many cooks spoil the broth", and in this particular case I had deliberately left it the way it was until the proofreading was finished. No biggie. --Xover (talk) 10:37, 28 July 2020 (UTC)

Index:Narratives of the mission of George Bogle to Tibet.djvu

Dear Xover, thank you for taking care to identify and upload a consistent edition of Queen Victoria's Letters. It is a pleasure to work on it now.

In 2015 I had a very similar problem (missing pages) with the edition in the subject. I had uploaded] 3 versions of this book to Wikidata before finding out they all had different missing pages.

I discussed that issue with someone on WS back then, and was told that on Google Books there were files to be found, that did not have those problems. Unfortunately, I am currently unable to find this discussion.

Could you please help find the non-faulty version of this book or fix the current scan? --Tar-ba-gan (talk) 23:08, 3 August 2020 (UTC)

@Tar-ba-gan: I've done what I could at Index:Narratives of the Mission of George Bogle to Tibet (1879).djvu, based on Internet Archive identifier: dli.csl.5002, which was the best scan i could find. I wasn't able to find a decent scan of the map here. If you track down one it is ok to just use that in the transcription even though this scan only has a partial map: it'll be a judgement call on what best serves our readers. Let me know if you need help moving anything over from Index:Narratives of the mission of George Bogle to Tibet.djvu. Keep in mind you can just move each page that you want to preserve over to its new position, using the "Move" command in the "More" menu on the page you want to move. If you let me know when you're done with the old Index:/Pages: I can delete them.
PS. Apologies that this took so long. It was a somewhat complicated case, and no decent scans to be found. --Xover (talk) 15:30, 7 August 2020 (UTC)
No wonder it took time to identify the more complete scan! I had tried and failed miserably, and this kind of text (Explorers/Himalayas) is "systematically" important for me (unlike Queen Victoria's letters) so I was quite frustrated about that for years. Thanks for solving this! --Tar-ba-gan (talk) 08:04, 8 August 2020 (UTC)
Dear Xover, after a bit of work I find that the situation is as peculiar as this: I think the old faulty scan with a couple pages missing cannot be removed until the new transcription project is finished. The thing is, OCR and occasionally page preservation of the most recent scan is quite bad, so the best I can do is to open new pages simultaneously in both scans and copypaste the text from the older scan to the most recent one. --Tar-ba-gan (talk) 23:06, 10 August 2020 (UTC)
@Tar-ba-gan: Ouch! I'm sorry I couldn't get you a better starting point, but I don't think there is a lot I can do about the OCR quality. This scan just doesn't give the OCR engine a lot to work with (it's a combination of several factors, chiefly the lack of contrast between the text and the background, and the texture in whatever paper they printed this on; not to mention that the stamps in the header are confusing the OCR engine terribly). My only suggestion is to try enabling the Google OCR gadget in your preferences and try that on the bad pages. In pathological cases like this it can sometimes give much better results. Other than that we'd need a better scan to get better results, and I was unable to find one with all pages present etc.
I can generate a DjVu from any collection of images, so if you want to try to cobble a complete copy together with images from multiple scans that would be a possibility. It'd be rather a lot of fiddly manual work, so whether it's worth the effort depends on just how bad the current OCR quality is.
I'm sorry I couldn't be of more help here.
PS. Oh, and don't worry about the other scan. In the Index: and Page: namespaces there is no particular hurry. We just don't want duplicates and faulty scans sitting around indefinitely so that users waste time proofreading them. --Xover (talk) 07:41, 11 August 2020 (UTC)

Files for speedy deletion

Could you move the list to a sub-page? It takes up a lot of space on the main deletions page, and the deletions aren’t controversial. As for the listings, I am working on this month’s WS:PotM work right now, and won’t be able to get back to going through the files for a week or so. TE(æ)A,ea. (talk) 12:17, 12 August 2020 (UTC).

@TE(æ)A,ea.: The existing stuff on PD can just get closed and archived off on the usual timer (I'll do it on my next spin through processing that). But the rest of these you can just dump here on my talk since it looks I'm the only one processing these anyway. Transwikied files is the speedy criterion with the least potential for controversy, and if anybody had objections on principle they've had the chance to raise them on PD for a while now. --Xover (talk) 14:14, 12 August 2020 (UTC)

Additional deletions

Per this discussion, the following pages should be deleted:

By the way, I will be able to deal with more loose files soon, so you have that work to look forward to. TE(æ)A,ea. (talk) 21:14, 26 August 2020 (UTC).

@TE(æ)A,ea.: Done. And thanks! --Xover (talk) 06:03, 27 August 2020 (UTC)

occupational categories rejig

I have set up proof of concept conversions for some of the occupation categories

and the requisite Template:Category disambiguation and configured HotCat to not allow the category's addition, and instead to show the sub-cats. Hoping that you are a HotCat user and willing to test and confirm that this will work. — billinghurst sDrewth 05:12, 28 August 2020 (UTC)

Though maybe the template should be renamed to align with c:Template:MetaCatbillinghurst sDrewth 05:21, 28 August 2020 (UTC)
@Billinghurst: Limited testing, but it seems to work very well so far! The template is, I think, what enWP calls a "fully diffused" category, and they take their categories seriously over there, so it might be worthwhile to see if we could crib something from there for the template. --Xover (talk) 18:28, 28 August 2020 (UTC)
Grr! Sometimes I think MediaWiki is obtuse on purpose. this edit showed a new timestamp (18:29) when I previewed the change, and the edit history confirms it was saved at :29, but somehow the saved timestamp shows :28 and thus the ping didn't work. Infuriating!
In any case, Billinghurst, see above for the testing. I also had a quick look at the templates enWP uses for diffusing categories (vs. category disambiguation), which is listed in the navigation template at w:Template:Other category-header templates ("Maintenance" section). Most obviously relevant here would be w:Template:Container category and w:Template:Diffusing subcategory, both of which look reasonable, code-wise. But whether we want to treat this as disambiguation or diffusion I've no particular opinion on. --Xover (talk) 07:50, 29 August 2020 (UTC)

What is Property?

I have made the change, with the exception of chapter 4 and 5 of the “First Memoir,” as these were originally divided, but are not now. TE(æ)A,ea. (talk) 15:32, 7 September 2020 (UTC).

@TE(æ)A,ea.: Done. Please check that I didn't mess anything up. --Xover (talk) 16:28, 7 September 2020 (UTC)

Charter

Hi Xover, could you restore the Charter of Fundamental Rights of the European Union page please? It's very important. I can easily amend the annotations that you object to: this will be much easier than restarting the page. The relevant link is here: http://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:12012P/TXT

It's linked to on the Wikipedia page, so we need something back up. Wikidea (talk) 15:18, 9 September 2020 (UTC)

@Wikidea: I have restored the text of the deleted page in User:Wikidea/Charter of Fundamental Rights of the European Union. Please do not move it to mainspace before it is in compliance with policy, particularly the annotations policy.
Wikisource primarily hosts previously published works as published, and allow editions annotated by contributors only as an adjunct to properly proofread previously published editions. For this particular case, you should find and upload the PDF of each of the 2000, 2004, and 2007 (iirc) editions of this act; set up an Index for each; and proofread each of them page by page. Once we have all the relevant originals transcluded (at, say, Charter of … Union (2000), Charter of … (2004), Charter of … (2007)) it would be acceptable to set up a comparison of them at Charter of Fundamental Rights of the European Union (Annotation).
Please also keep in mind that we do not use plain wikimarkup for things like headings, nor use automatically generated tables of contents. For a typical heading we would use some combination of formatting templates such as {{center}}, {{x-larger}}, etc.; and for the table of contents typically actual table markup mimicking the original, if there is one present, or {{AuxTOC}} otherwise.
Please feel free to ping me if you need assistance, or you can ask at the help section of the Scriptorium. --Xover (talk) 04:09, 10 September 2020 (UTC)
Thanks, much appreciated. Wikidea (talk) 15:16, 10 September 2020 (UTC)

"On Discoveries and Inventions" completed

I have finished supplying all the passages that I had left out in my original translation of Prus' 1873 lecture "On Discoveries and Inventions".
Due to differences of syntax between Polish and English, and to the way sentences are sometimes completed on a succeeding page, in places there will be mismatches, between the two language versions, at the end of a page.
I am struck by how germane the author's observations, set down 147 years ago, are to our day. This is reflected in several of my notes. These are clearly marked as the translator's, do not disturb Prus' text, and conceivably might interest some readers. However, I leave their retention to your judgment.
I hope that when the text is restored to Wikisource, a link can be provided to the scan of the original text that you were able to locate.
Kindly let me know, should there be questions about any part of my translation.
Would you recommend that I add "CC-BY-SA-3.0" to my other Wikisource translations that currently carry only a "GFDL" license?
Thanks.
Nihil novi (talk) 12:37, 21 June 2020 (UTC)
@Nihil novi: Excellent news, and great work!
Page mismatches due to language differences are to be expected, so that's nothing to worry about. The work now lives at Translation:On Discoveries and Inventions, and the old title (On Discoveries and Inventions) will remain valid as a redirect until such time as a different work of the same title is added here (at which time we we'll presumably find some suitable way to disambiguate and link the new location). Please have a look through Translation:On Discoveries and Inventions to check that everything still looks good after transclusion. One thing in particular to look out for is two paragraphs running together. This happens when the end of a paragraph coincides with the end of a page. In these cases we have to manually tell the software to preserve the paragraph break by placing the template {{nop}} at the end of the first of the two pages.
The footnotes, and even the Wikipedia links, are problematic though: these both count as annotations and are not permitted in normal works. Annotated versions are supposed to go into a separate copy of the work that is clearly labelled as an annotated edition. In other words, we will have to do something to address that. However, for various technical reasons, I am not sure how we could sensibly do that just now; so for now I propose we just leave it as is and I'll try to come up with something there. If anyone should object in the mean time the links and footnotes can be easily removed (and since all old revisions are kept in the page history, can also be easily restored if needed).
I would also encourage you to go through each page and update the page status to "Proofread" for the pages you consider complete and finished (which should be all of them as I understood you). We can then try to find another Polish speaker to go through them and "Validate" them. This two-step transcription process is standard for English language works (two people independently verify that the transcription is correct), but we might as well employ it for translations too even if the situation is slightly different.
And, finally, I've updated the wikipage at Translation:On Discoveries and Inventions to use the requisite header template ({{translation header}}), which always sets the translator to "Wikisource". This is because such translations are considered to be collaborative and ongoing efforts, somewhat akin to Wikipedia articles. In reality this is unlikely to be a significant factor for this specific work (the idea was more aimed at something like a collaborative new translation of Tolstoï or Aristotle), but… In any case, I noticed you had set the translator's name to what I presume is your real name. Would you like us to credit you (and if so, by that name or just by your username here?) on the work's talk page? It will be a lot less visible, I'm sorry to say, but that is the standard way we have of doing it (using the {{textinfo}} template). Most contributions here are simply "credited" through the username appearing in each page's revision history, but translations are a little different so adding a separate note about it feels appropriate.
Regarding your other translations: yes, do please replace the {{GFDL}} tag with {{CC-BY-SA-3.0}} to avoid any confusion. Technically, every original contribution you make here is dual-licensed under both those licenses for historical reasons (there's some fine print about it just above the "Publish changes" button in the editing form when you edit a page), but when an explicit license tag that contradicts it is added to a page it causes confusion and may end up with the work being deleted. --Xover (talk) 14:52, 21 June 2020 (UTC)
Thank you.
I am reviewing "Translation:On Discoveries and Inventions", checking on paragraph divisions. So far, I find two paragraphs run together: "Until 1846,..." should mark the start of a new paragraph; but when I switch to editing mode, the text will not advance beyond the title page. I tried entering "nop" into my Wikisource translation text, but that does not split the two run-together paragraphs on the "Translation..." text. How can I accomplish this correction?
Nihil novi (talk) 22:56, 21 June 2020 (UTC)
Problem apparently solved: I see the correction now made in "Translation...".
Thanks.
Nihil novi (talk) 23:11, 21 June 2020 (UTC)
@Nihil novi: Good to hear! Let me also take the opportunity to thank you for your contributions, and for putting up with our at times arcane tools, practices, and policies. We're aware this all could be a lot more user friendly, but let's just say that that makes us appreciate anyone willing to stick it out despite the challenges even more! :) In any case, thanks for contributing, and do please feel free to ask me if you need help with anything else. You can also always ask at Wikisource:Scriptorium/Help, where the whole community will see it, in case I am not available (it's an entirely volunteer driven project, so individual people here tend to have unpredictable availability). --Xover (talk) 04:50, 22 June 2020 (UTC)
Thank you for patiently shepherding me through this process.
I am also indebted to you for prompting me to translate the missing passages.
At one time, I seriously considered emulating the monks who worked anonymously in their scriptoria.
Having, however, published papers and books under my name as their translator and sometimes their editor, for bibliographic reasons I would appreciate being credited with this translation, as previously, by my civilian name.
And could this translation also be listed on the "Author:Bolesław Prus" page and the "Author:Christopher Kasparek" page?
If an annotated edition of this piece is feasible, I think it could help connect the author's mind and times with the present-day reader's.
From what you write, no one should object to my substituting the "GFDL" license with "CC-BY-SA-3.0" at my other translations, and I will try to do so. Their original Polish texts are available for comparison on Wikisource.
I hope I may indeed again impose on you for advice.
Thank you.
Nihil novi (talk) 07:18, 22 June 2020 (UTC)
User:Piotrus has generously completed his review and validation of the English translation of On Discoveries and Inventions [2] by Bolesław Prus (Aleksander Głowacki).
I gather that the translation has now attained full rights of residence on Wikisource.
I would like to again thank you for encouraging me to complete the partial English translation, of some years back; for tutoring me on Wikisource procedures and techniques; and for offering your own helpful comments on the translation.
I wonder whether I could further impose on you: to credit the translator (Christopher Kasparek), as we discussed above? I fear that, were I to attempt doing this myself, someone would have to correct my errors made in the process.
I trust you are successfully maintaining social distance during this Covid–19 pandemic!
Many thanks,
Nihil novi (talk) 19:15, 14 September 2020 (UTC)
@Nihil novi: Great work; and kudos for the fortitude of sticking with it through the really rather less than user friendly tools and process! An eminently interesting work, and its implementation will stand as an example we can point future contributors to!
I have added a note to the translation's talk page—Translation talk:On Discoveries and Inventions—crediting you with contributing the translation. As a translation that has not been previously published (on a proper publishing house) our policy is to treat it like a collaborative work (i.e. so that Piotrus's contributions are acknowledged) by crediting it in the work's main header as a "Wikisource translation". In this particular case that's a little awkward, I feel, since you were clearly the main translator and it would be most natural to simply credit you as such; but our policy doesn't really allow for that, and it would have negative effects in other cases. But on the work's talk page we are free to explain the situation more specifically. It is also now featured in the "New texts" section of our Main Page.
In any case, if you want to translate more of Prus' works and need assistance, please do not hesitate to ask. --Xover (talk) 08:25, 17 September 2020 (UTC)
Thank you.
It is good to see Prus's prescient 1873 lecture now available on Wikisource in English, 147 years after he delivered it in Warsaw in Polish.
Do you happen to know how Wikisource came by the scan of the lecture's printed version?
Is there a straightforward way for Wikisource to obtain scans of other Polish public-domain works, perhaps from the Polish National Library?
Nihil novi (talk) 09:20, 17 September 2020 (UTC)

Technical noodling on Annotations

@Inductiveload: I'm a little short on spare cycles just now, so perhaps I could prevail on you to help me think through this a bit?

This work is now a scan-backed Wikisource translation and an annotated work. Annotations need to be in a separate and clearly labelled page. Since it is actually scan-backed (translated page by page in Page:-space) we can't (well, "shouldn't") just cut&paste but use multiple transclusion. Which means we need some technical facility to handle the difference dynamically.

This work uses two kinds of annotations: translator's notes (footnotes), and wikipedia links. My original thought (which I've not had the cycles to flesh out yet) was something like {{annotation note}} and {{annotation link}} that will output nothing/unlinked text in the Translation: but spit out <ref>…</ref> and wikilinked text (respectively) in the annotated version.

In addition to not thinking through the template details (there may be better ways, or the approach might be infeasible), I have no good idea how to distinguish between unannotated and annotated versions of a work. Some previous approaches have relied on the annotated version being on a /Annotated subpage, or on having …(Annotated) in the page name. None of those approaches have been good (but possibly for other reasons than the trigger). But I have no clear idea of alternative approaches.

Thoughts? Ideas? --Xover (talk) 09:53, 30 June 2020 (UTC)

@Xover: Hmm, It's a tricky one. Any template that needs selective output will have to be sensitive to the environment at render time. AIUI, one can only really control the namespace and the page name. We can't really control the namespace (both will be Translation, I assume). Forcing a a subpage like "/Annotated" is going to end in tears, because the two top level works ("Work" and "Work/Annotated") will become interleaved under "Work" (regardless of if you do "Work/Annotated/Sub/pages" or "Work/Sub/pages/Annotated"). You could pick out a title suffix like "(annotated)" with parser functions or modules.
Either way you'll bake the string "/Annotated" or "(annotated)" into the templates. At least you'd want to leave headroom for different annotations, so maybe a pattern like "(annotated( - XXX)?)".
As for the templates, care needs to be taken to not end up with the situation that {{modern}} ended up in. Better hygiene of the template formatting might help here, but for any substantial level of annotation, the Wikicode is going to be a mess.
And alternate solution is to have only one version and provide selective visibility though Javascript. Switching the CSS visibility of a class or two should work. This is how {{ls}} worked long long ago, you could choose how it was displayed. The intention was to allow various options such as old orthography (long-s, etc), Wikilinks, etc to be under separate control. But it never gained traction and it was then broken and not missed enough to be repaired. I have not idea what this would do to exports. This avoids needing two transcluded copies (and therefore you can't link to it separately), but doesn't really change the Wikicode. Inductiveloadtalk/contribs 10:22, 30 June 2020 (UTC)
@Inductiveload: Thinking out loud (and definitely not deeply): New namespace for annotations, Annotation:, with policy to say only two kinds are permitted, inter-project links (i.e. wikipedia) and footnotes. Annotation: can contain both annotated normal works and annotated Wikisource translations, and distinguishing is done with {{header}} vs. {{translation header}}. All works in the namespace must be scan-backed, already transcluded to either mainspace or translation, and must use one of the approved annotation templates (starting set: the two I sketched above; additional ones as we come up with them). No grandfather clause: existing works migrated there must also be migrated to be compliant with that policy. The namespace is the trigger for the annotation templates. Thoughts?
It also occurs to me that Spangineer has made an initiative to move WS:ANN to actual policy, which might "synergize" well with trying to introduce this kind of scheme. --Xover (talk) 18:32, 30 June 2020 (UTC)
That would probably be a decent solution. However, transcluding the same text twice, once to mainspace and once to annotation-space, does kind of pre-suppose that we only have one annotated version. You could imagine that there might be two annotations of the same work. Though I think this is unlikely to actually happen.
Also, proofreading the annotations would be annoying: the wikicode would be cluttered and you'd need a way to force annotations on and off in page space. Though I suppose that could have a simple gadget with a side-bar control.
Other policy-sided thoughts all sound fairly sensible. I'm not sure if the "allowed annotation" list is a bit restrictive, but then again, I don't actually expect any completed "intense" annotations to actually exist any time soon. I could image line-by-line analysis of, e.g. Bhagavad-Gita (Besant 4th)/Discourse 1 or Translation:The Story of the Stone/Chapter 1 or something. But, TBH, that's beginning to stray off the Wikisource reservation slightly and perhaps slightly into Wikibooks/versity territory. Inductiveloadtalk/contribs 09:53, 10 July 2020 (UTC)
@Inductiveload: To my mind, limiting annotations to one per (edition of a) work seems like a reasonable first approximation. At least absent counter-examples my thinking is that we do not want competing annotated versions: we want collaboration to make one single even better one. But this, and the limited kinds of permitted annotations I envision, is coloured by my desire to have a clean and fresh start here. Not "anything goes, and we'll dial back anything later deemed problematic", but "these specific deviations from the normal proofreading are permissible, and we'll consider any additional variants if a good use case comes up". With only wikilinks (and with guidance designed to avoid "sea of blue" problems, ala w:WP:OVERLINK) and footnotes—both eminently containable—it should be within reason in terms of editability. Or so is my hope at any rate. I imagine such annotations will either be added by the one first proofreading a work, and while doing the proofreading, in which case the annotation artefacts are what they want and not in the way; or they are added after the fact to a work that has already been proofread, in which case they will (obviously) not get in the way of the proofreading.
You're right that this model will not be a good fit for a line-by-line analysis or similar (possible, but not a good fit). I'm not sure what the solution for such annotations are. Some would certainly be Wikibooks/Wikiversity material, but I can imagine there being a significant grey area. I am comfortable kicking that can down the road though. I'm thinking a strictly limited starting point that is intended to be expanded—slowly and carefully—over time in order to maintain some semblance of control over scope and quality; not that it should never be expanded.
There's also a little voice nagging at me that either ProofreadPage or whatever they're using over at Wikibooks, if relevant, might conceivably be expanded in some way to allow for multiple "branches" off the same file/index. With Multi-content revisions and some of the related tech, you might have in-software support for creating both translations and annotations off the same proofread set of pages. Maybe, after proofreading, you could hit a "sync to sandbox"-type link to populate the "Annotation" slot of the Page: pages with a copy of the wikitext from the main proofread slot. Turn on and off annotations dynamically in mainspace, maybe? That's probably overkill right now, but as a sort of long term pie-in-the-sky type thing… --Xover (talk) 13:22, 10 July 2020 (UTC)

Your message

Creating OCR text does not in any way make it more difficult for others to proofread text. It significantly accelerates the proof reading process by creating division of labour and by attracting search engine traffic (which increases the number of proof readers). I have every intention of proof reading the pages in question. That said, if you are going to subject me to this kind of harassment, I will leave Egyptian Literature alone. James500 (talk) 12:00, 13 September 2020 (UTC)

I also have to question the attempt to assert ownership of a scan you have not edited for four months (as far as I can see). I do not appreciate being accused of having no intention of proof reading by someone who has not done any proof reading himself for that length of time. James500 (talk) 12:41, 13 September 2020 (UTC)
@James500: You are entitled to your opinion regarding the utility of creating such pages, but I have now given you my opinion on the matter and politely asked you to refrain from doing so. That you choose to cast such a request as "harassment" or an attempt to assert "ownership"—or, indeed, appear to treat anyone disagreeing with you on any subject or matter in a similar fashion—suggests to me that you may find contributing to a collaborative and consensus-based project challenging. In light of that you may wish to consider whether your approach here is really the one best suited to achieve a constructive result. There are many projects and services on the web where one may act unilaterally, but on a fundamentally collaborative one it is necessary to adapt to the needs of other contributors, even at the expense of one's own preferred approach.
PS. The common practice when someone leaves a message on your talk page is to respond there, adding a {{ping}} to let the one who left the message that you have responded, and not to remove the message and then leave a disconnected reply on the other person's talk page. The reason is that this approach keeps replies in one place and in context of the original message. Even if your personal preference is to do otherwise I would encourage you to follow the common practice in order to facilitate better collaboration with other contributors.
PPS. One of the reasons I have not proofread any pages of that work in the last few months is that I have been busy with other tasks, primarily helping community members such as yourself with the tasks they need assistance with. Your feedback on that prioritization has been noted. But if you wish to actually proofread that work, in a collaborative fashion and in line with the style established for it, then I absolutely encourage you to do so. Collaboration, not unilateral and personally preferred actions, are the core of this project, as I have previously emphasised. --Xover (talk) 13:09, 13 September 2020 (UTC)
Falsely accusing me of having no intention of proof reading pages is a personal attack and an assumption of bad faith. I have no problem characterising that kind of comment as harassment. Further, it was not clear whether your message was actually a request or an indication that you might use admin tools. If you are not proposing to use admin tools, you really should make that clear. James500 (talk) 13:45, 13 September 2020 (UTC)
@James500: Do you have some particular reason to expect it would be necessary to employ admin tools in connection with a polite request placed on your talk page? That seems like a rather odd assumption to make.
Contrariwise, based on a random sampling of the several hundred such pages you have created this month alone, none of which have actually been Proofread, it seemed a reasonable assumption that the same would be the case for the pages in question as well. If my assumption was in error, I invite you to demonstrate the mistake by Proofreading at least as many pages as you have created as Not Proofread. In fact, that would, in my opinion, be the very most desirable outcome.
In the mean time, I encourage you to strike your accusations of "harassment", "false accusations", "personal attack", "assumption of bad faith", and reiterated accusation of "harassment". That you choose a confrontational mode of interaction is one thing, but casually throwing around such accusations is not acceptable behaviour. --Xover (talk) 14:28, 13 September 2020 (UTC)
It is not appropriate for you to make comments about what you imagine is going on in my head. That includes assertions that I do or do not intend to do something. Nor is it appropriate for you to seek to put me to proof that I do or do not intend to do something. Especially after I have already started to correct the pages in question. James500 (talk) 14:46, 13 September 2020 (UTC)
I have struck out all of my comments because I do not wish to have any further interaction with you. Nor do I consent to such interaction. James500 (talk) 15:48, 13 September 2020 (UTC)
Thank you for striking out those accusations. That is appreciated. Regarding your wish to avoid further interaction, I shall certainly try to accommodate that. But I must stress that on a collaborative project it is generally not possible to reserve oneself from interacting with other contributors. --Xover (talk) 17:13, 13 September 2020 (UTC)

Quietly Night is Falling... (By I.S. Nikitin)

You deleted poem "Quietly Night is Falling..." from page https://en.wikisource.org/wiki/Author:Ivan_Savvich_Nikitin. I'm Anton Demin. It was my translation. Can you restore it? unsigned comment by Ad2271 (talk) 23:03, 16 September 2020‎ (UTC).

@Ad2271: Hi Anton. Thanks for getting in touch regarding this.
The issue here was that for all content that comes from elsewhere we need documentation, of some reasonable form, that it has either entered the public domain (its copyright protection has expired) or has been actively licensed under a compatible license (typically one of the Creative Commons licenses). For content created directly here by one of our contributors (such as this comment or your preceding one) the site's Terms of Service takes care of the licensing part: whenever you save an edit there are licensing terms listed as you irrevocably agree to release your contribution under the CC BY-SA 3.0 License and the GFDL.
For translations there are additional complexities because the translation gets its own copyright that is independent of the original. And that was the matter at issue here: the original by Nikitin was in the public domain (copyright had expired), but the translation was credited to an "A.Demin" for whom we had no information to determine copyright status and no evidence of licensing under a compatible license. And under those circumstances our copyright policy requires us to delete the work in order to avoid violating copyright.
However, we do accept user translations, known as "Wikisource translations", by contributors to the site; and as direct contributions they are covered by the licensing terms imposed by the Terms of Service. These translations get special naming (a Translation:) prefix in the page name, a specially formatted header ({{translation header}} instead of {{header}}), and are credited as "translated by Wikisource" instead of a named physical person. Attribution to individual users of the site are done through the revision history of the page (what you get at the "View history" tab at the top of the page), and the idea is that almost all pages on the site will be the result of a collaboration so crediting any individual user would generally be misleading or impractical.
In any case… In this specific case, and based on your message here, I think we can probably undelete the translation, move it to the Translation: namespace, and switch out the header; and then just link to the documentation of its authorship here. --Xover (talk) 06:29, 17 September 2020 (UTC)
@Ad2271: Ok, I've undeleted it and updated as described above. It is now available at Translation:Quietly Night is falling. --Xover (talk) 07:44, 17 September 2020 (UTC)
@Xover: Thanks a lot for the quick response and page recovery.--Ad2271 (talk) 12:08, 17 September 2020 (UTC)

We sent you an e-mail

Hello Xover/Archives/2020,

Really sorry for the inconvenience. This is a gentle note to request that you check your email. We sent you a message titled "The Community Insights survey is coming!". If you have questions, email surveys@wikimedia.org.

You can see my explanation here.

MediaWiki message delivery (talk) 18:48, 25 September 2020 (UTC)

Paragraphs in text-layers; or a hack

I was going to ask you if you thought we could get the paragraphs in Djvu files into the text layer as blank lines, then I found that you've not only requested it, but even have a patch pending for it at phab:T230415!

In the meantime, while the wheels grind slowly, perhaps you'd be interested in a filthy JS hack?

      // Insert para breaks when a short line looks like a paragraph end.
      let short_line_thresh = 45; // set per-work if 45 isn't right
      let lines = editor.get().split(/\r?\n/);

      for (let i = 0; i < lines.length - 1; i++) {
        // Short line followed by punctuation and a fresh sentence on the next line
        if ((lines[i].length < short_line_thresh) &&
            lines[i].match(/[.!?'"”’—]\s*$/) &&
            lines[i+1].match(/\s*['"“‘A-Z0-9]/)) {
          lines[i] += "\n";
        }
      }

      editor.set(lines.join("\n"));

It obviously won't catch all paragraphs, but a majority have "short" last lines. Inductiveloadtalk/contribs 13:43, 5 October 2020 (UTC)

@Inductiveload: Ah, indeed, that's a useful little heuristic. Thanks! --Xover (talk) 14:10, 5 October 2020 (UTC)
It also occurs that when that change finally goes through, we have ~900k "red" OCR-dumped pages (#1, yay!) which will all be missing their paragraph breaks - is there an API call to fetch the text layer so we can have a "reload layer" gadget? Sometimes the existing OCR is better than the OCR gadget gives if the IA did a particularly good job. Inductiveloadtalk/contribs 14:35, 5 October 2020 (UTC)
@Inductiveload: I presume there is, but I've not looked closely for it. Phe's OCR tries to get the text layer already in the DjVu first, and only runs Tesseract (3.x, with custom language files) if that fails. If MW didn't provide some way to get at it, Phe's OCR gadget would have to download the entire (possibly 1GB+) DjVu file to get at a given page's text layer.
In addition, I know MW extracts the text layer and stores it in the database in one of the image metadata fields. This is probably/possibly what causes some DjVu files to fail to extract the text layer: it's overrunning a metadata-sized field with a huge blob of text with XML markup.
Oh, BTW, I suspect some users may have preferred Phe's OCR simply because it did better at extracting the existing text layer (cf. the Phab above), and not because its OCR was actually any better. As best I can tell, for 99% of cases it would have been giving the text layer from the DjVu and not actually new OCR. The remaining 1% of cases are probably related to multi-language support and fractur text. You may find poking around https://github.com/phil-el/phetools informative.
Oh, and… Note the difference between the number in "Not proofread" and the number in "Not scan-backed". We have a massively disproportionate number of Page: pages that are "Not proofread". I'll leave open the question of from whence these come. --Xover (talk) 14:55, 5 October 2020 (UTC)
Hmm, interesting. I assumed some kind of OCR was happening, because the OCR from the (black and white) OCR button doesn't match the DjVu text layer, though it also seems suspiciously fast. E.g. Page:Lord of the World - Benson - 1908.djvu/42 has straight quotes in the text later and curlies in the OCR-button'd text, and other differing scannos like "141/2" vs "1414" for 14½.
Re the very high "red" page ratio, I have noticed it before. Even more interesting to me perhaps is how frWS and deWS have such incredibly low problematic rates (I have made zero effort to actually find out what the policies are there, perhaps they just class pages needing images or foreign scripts as something other than "problematic", or do they have crack teams scan-fixers and image-extractors?). Inductiveloadtalk/contribs 16:48, 5 October 2020 (UTC)

Uncovered files

Whilst going through the files I have recommended for deletion, I have come across the following other files, which may need your attention.

I have finished all of the files beginning with “A,” and listed them on WS:PD, as you have seen. TE(æ)A,ea. (talk) 18:34, 17 July 2020 (UTC).

Files for B

I also noticed File:British Indian Ocean Territory Constitution Order 10.06.2004.pdf, although I am not sure if this is acceptable for Wikimedia Commons. TE(æ)A,ea. (talk) 12:41, 18 July 2020 (UTC).

@TE(æ)A,ea.: Thanks. I think I've got all of them; except Bat Wing and the British Indian Ocean Territory Constitution Order for which I am still trying to figure out the copyright situation. Archaeologia Britannica is currently at WS:PD so I left it to be handled there. Arizona Proposition 302 is of unclear copyright status, and may need a trip to WS:CV. --Xover (talk) 18:59, 20 July 2020 (UTC)

Files for C

As for the “Order” mentioned above, it is some form of U. K. government document, but I do not know if it falls under one of the Crown Copyright exemptions. When I finish listing all of the files for speedy deletion, I plan on going over all of the other files held locally—the ones properly unfit for movement to Wikimedia Commons—and marking them more accurately with expiry dates. TE(æ)A,ea. (talk) 21:52, 20 July 2020 (UTC).

Files for D

Files for E

The files for speedy deletion are listed here, so as not to waste space on your talk page. The files for review are the following.

I have been putting off the work for these for some time, but I’ve finally started the work again, so I will leave comments on this page every so often. In addition, the page I listed above will also be updated with new listings from time to time; remove the old listings as you see fit. TE(æ)A,ea. (talk) 19:58, 25 September 2020 (UTC).

Files for F

Files for G

In addition, the files associated with The Great American Fraud should be moved to Wikimedia Commons, as that work itself is acceptable there; however, it is probably preferable to clean up that page before moving the images. TE(æ)A,ea. (talk) 23:09, 16 October 2020 (UTC).

Files for H

@TE(æ)A,ea.: I think I'm all caught up as of H now. Thanks for all the effort you're putting into this! But a quick note:
Just because the scan of an edition we have here was published in the US doesn't mean the work as such was first published in the US. Case in point: File:Howards End.djvu. Howard's End was first published in 1910 in the UK, and is subject to a pma. 70 term that lasts to 2040 there. Both the 1910 and 1921 editions are PD in the US because it is more than 95 years since they were published; which make them ok at enWS (which only considers US status) but not at Commons (which also considers status in the country of first publication). There were at least a couple of cases of this up above. --Xover (talk) 19:01, 18 October 2020 (UTC)

DjVu vs PDF

Greetings! I'm fiddling about my my scripts for uploading and whatnot and I'm wondering what to do about the DjVu vs PDF question.

Motivation: I am hoping to upload the 50 volumes of the The Works of the British Poets since that should be a pretty decent baseline for scan backing any orphan poems over time, but this question applies more generally to other multi-volume scan uploads, especially periodicals.

Since the IA has stopped making DjVus (sad face), this means that most of these volumes are PDFs from the IA and some (15/50) would be home-brewed DjVus from Hathi. The real question is: should I bother expending effort making sure volumes are all in one format or the other?

DjVu seem to thumbnail faster, but since the format is so unloved by everyone but us, is there any point regenerating the 35 IA PDFs into DjVus? The effort required to do this is not enormous (just need to massage the IA OCR XML into my DjVu-making script instead of using Tesseract), but it's certainly not zero.

Alternatively, go the other way, and just upload the Hathi images raw to the IA and let them (slowly) generate a PDF and use that (benefits: adds the work to the IA too).

Alternatively again, ignore the 35 IA volumes and use the Hathi images and re-OCR them all (and/or gain access to the HT API for OCR, which I don't think I have, as non-institutional members only get web client access via Friend of UofM accounts).

The only "real" pain point in having a mix of PDF and DjVu, other than the slow PDF thumbnails, is probably that {{Works of the British Poets volumes}} would have to be manually finagled to use the right extensions on a per-volume basis. Inductiveloadtalk/contribs 15:14, 17 October 2020 (UTC)

@Inductiveload: Immediate thoughts…
  • Consistency has a quality of its own, or something like that. Having everything be the same format makes lots of little things simpler through simple removal of impedance and cognitive load.
  • DjVu gives us far better control and options for manipulation, including any doctoring. Any part of a DjVu can be extracted losslessly (but manipulating the page image does require a reencoding roundtrip).
  • MediaWiki does a far better job extracting a text layer from DjVu files than from PDF (I've tested with literally the same text layer: it's ridiculous). I suspect this is only partly due to shoddy coding in MW: it looks like PDF text layers have features less suitable for our use case (more page layout and formatting than logical separation and structure).
  • The IA XML is… fragile (cf. IA-upload's troubles). The ABBY OCR has a slight edge over Tesseract in most cases, but I always prefer regenerating it from scratch: that gives me full control over the resulting text layer and avoids running into the related bugs in MW. I've very rarely ran into a truly pathological case for Tesseract that ABBY handles well: most of the time the difference borders on academic.
  • IA DjVus get some of their compression by background-separation, but they also scale down and use aggressive compression settings. It gives pathological results on certain (poor quality) inputs, so for anything I care about I regenerate it from the scans instead.
Bottom line, if I were to go at this, I would grab all the scan images, set my script to work on them, and then come back to check on it after a day or two. I wouldn't even use any of the existing DjVus: there's not much difference between generating 15, 35, and 50 DjVus; the computer is just replacing my space-heater for a little bit longer. --Xover (talk) 15:45, 17 October 2020 (UTC)
That makes sense, the heating in the office is electric anyway :-D. Probably it's worth me iterating on my DjVu creation process a bit more. I might try to figure out from the IA peeps how they do the background separation - running it with a slightly less aggressive compression coefficient might allow better results without an excessive file-size tradeoff.
Something else that might give and edge would be figuring out how to train Tesseract for our works. We certainly have various "classes" of files we commonly see:
I wonder if the last two could benefit from dedicated Tesseract trainings which could even be integrated into the OCR button. Inductiveloadtalk/contribs 16:55, 17 October 2020 (UTC)
@Inductiveload: I'm no LSTM expert by any measure, but by my extremely limited understanding, it should be possible to train it for these classes (possibly excepting the flyspeck print: there's too little blood in that stone to begin with). But how much effort it'd take and how much improvement is a different matter. I'm also not sure whether it'd be feasible to deal with long s and the ligatures through training alone: I suspect at least some such things will need explicit support in the engine (like italics etc.). What's more likely to work is training for the styles of fonts used in 1700s (+/-) printing, and the generally inconsistent typography and page layout. I also believe Tesseract does some word matching, so when orthography is different it will have extra trouble that custom training may eliminate.
Regarding file size, I am entirely ignoring that issue for now. It would be nice to get smaller files, but our DjVus are a drop in the bucket at Commons; and the file size has essentially zero effect on the load time etc. of thumbnails here (they're all run through Ghostscript with pathological runtimes into the tens of seconds completely unrelated to the efficiency of our compression). So while obese files offend my techie preference for efficiency, they have very little practical effect.
I also worry that background-separation will have marginal effect in our use case: it will require generating a mask for the actual text, which is going to be hard to automate at scale for the same reasons OCR has trouble with our works. But it would be extremely interesting to see how IA did that and what compression settings they used. It seems impossible that we couldn't get much better compression, without going as far as IA and the attendant problems. For my personal typical use there's also the possibility to apply custom settings per work, which wouldn't have been practical for IA; but that means I could try the aggressive settings first, and then just back off if the results are poor. --Xover (talk) 07:39, 18 October 2020 (UTC)
I asked at the IA forum and got a great response by one of their people. Tl;dr, the background segmentation is called "mixed raster content" (MRC) and was applied by a proprietary DjVu tool and isn't provided in DjvuLibre (i.e. c44 and friends).
So unless we want to invent an MRC encoder for DjVu (see link in the reply above), we're stuck with what we've got in terms of encoding. I get that the filesize isn't really a major issue and the thumbnailer doesn't really care (and at least DjVu is faster than PDF), but it pains me to be uploading ~100MB files that could more reasonably be <10MB. And even if novels are OK, some works that are ~1000 pages of dense text can easily get a bit out of hand.
I'll investigate training Tesseract over time, I feel that it's likely that with a bit of processing (OpenCV??) and training we can likely improve it quite a bit. Even if it isn't full-auto having "profiles" to regenerate OCR would be quite handy for classes of texts we usually struggle with. Inductiveloadtalk/contribs 11:30, 18 October 2020 (UTC)

Tesseract hint

FYI, I found out the hard way that if you are multithreading something that shells out to call Tesseract, the performance is pitiful (like, 20 times slower or worse) if you don't set the environment variable OMP_THREAD_LIMIT=1 to restrict each thread's Tesseract instance to a single thread. Inductiveloadtalk/contribs 21:33, 17 October 2020 (UTC)

@Inductiveload: Thanks. I'd actually stumbled across that issue while looking for a different bug in Tesseract, but I thought they'd changed the default to 1 for that very reason? I've not run into it in the wild, mainly because I rarely run multiple Tesseracts in parallel, but when I do I'm not sitting around waiting for it to finish (fire and forget). It would probably have bitten me hard on the WMCS backend for my OCR script though, as soon as it was used for more than my very occasional testing. Hmm. Which actually reminds me of something… --Xover (talk) 07:38, 18 October 2020 (UTC)

Tesseract sensitivity to spaces

Do you have any idea how to "de-sensitise" Tesseract from finding spaces around punctuation? It seems very keen to produce text like

“ Good men,” say I, “take of my wordes kepe :

I'm playing with Tesseract "-c" options, but nothing seems to do anything of interest. Inductiveloadtalk/contribs 09:36, 19 October 2020 (UTC)

@Inductiveload: Nope, sorry. I think it is inevitable since typographically there is a space there. You'd need to code specifically for this case to be able to correct for it. I fix this in JS in my ocr fixup scripts, and it's one of the fixups I'd considered adding to my OCR gadget. --Xover (talk) 11:29, 19 October 2020 (UTC)
I was thinking more about the space after the open double quote (image here), but even the spaces before the semi/colons seem like a thinner spaces (a w:Thin space?) than the inter-word spacing. Oh well, I guess since so so many OCRs have this it's more useful to fix up in JS so it's more generally useful. Inductiveloadtalk/contribs 11:50, 19 October 2020 (UTC)
@Inductiveload: Hmm. It's possible this is not quite what it seems. Tesseract treats punctuation lexically as words, so in any context where words are being joined with spaces they will suddenly amass a space that isn't actually there in the input. I do that in my hOCR parser, and it's likely Tesseract does the same in its plain text output. It is an obvious way to implement it so likely to appear in many implementations. But in any case, the distinction gives scope to be smart about this. --Xover (talk) 18:31, 24 October 2020 (UTC)
Well, as long as Tesseract is producing “ and ” correctly, it's easy to post-process, because “[space] and [space]” are wrong. But when it's a straight quote, it's harder to work out what's right and wrong. Inductiveloadtalk/contribs 19:53, 24 October 2020 (UTC)
@Inductiveload: Yes, indeed. But for the curlies: importScript('User:Xover/ocrtoy.js'); and then hit the   button in the editor toolbar on a page that otherwise exhibits this problem. It just looks behind to see if the previous "word" was /^[“‘«]$/, and whether the current "word" is /^[”’»;]$/, and in either case it forcibly concatenates the current "word" onto the end of the previous word (instead of adding it to the array that will later be joined by spaces). --Xover (talk) 20:00, 24 October 2020 (UTC)
In terms of general OCR post-processing, I've been collecting some useful heuristics into a JS script (with the help of a big wordlist and grep). There are some scannos that are blatantly bad orthography like tlie and sometimes you can work out where false spaces are. For example, diffi is more likely to be the prefix of the next word than either as a stand-alone word or as a suffix to another word (nothing English ends "diffi").
The JS isn't really ready for prime-time, but perhaps there are useful things in there you can use? It certainly seems to tidy things up quite a bit.
To be honest, I'm starting to wonder if, despite being painfully hip, some kind of machine-learning thing might actually be the way forward. Feed it piles and piles of OCR and piles or the same thing but corrected and see if it can work out what "feels" right. Inductiveloadtalk/contribs 20:09, 24 October 2020 (UTC)

Hi..

Can you make this "make sense" thanks? I've tried to pagelist this three times , and it didn't make sense.

Rebuilding the file "page by page" if needed is strongly suggested. ShakespeareFan00 (talk) 18:49, 22 October 2020 (UTC)

@Xover: I can deal with this if you don't have bandwidth :-) Inductiveloadtalk/contribs 15:35, 23 October 2020 (UTC)
@Inductiveload: -- I am handling this. Hrishikes (talk) 15:43, 23 October 2020 (UTC)
Good luck with the De-Googling. May your numbers be consecutive and your scan complete! Inductiveloadtalk/contribs 15:47, 23 October 2020 (UTC)
Thank you both! --Xover (talk) 15:58, 23 October 2020 (UTC)

Phab ticket you might be interested in: phab:T267617

Hi! Quick heads up for a phab ticket that might interest a gadgety person: phab:T267617 Index page's page links should have the page index-in-file in them (e.g. as attribute). Inductiveloadtalk/contribs 10:30, 10 November 2020 (UTC)

@Inductiveload: Thanks. Incidentally, you can watch components and projects in Phabricator and be notified whenever a new task is registered for it. So for e.g. Wikisource and ProofreadPage, click the tag in an existing task, then navigate to its overview page in the left nav menu (by default you get its work board), and then use the watch button in the top right. --Xover (talk) 14:38, 10 November 2020 (UTC)

New OCR tool

Hello Xover,

I have tried your new gadget at the following pages with very good results:

  • Page:The Queens Court Manuscript with Other Ancient Bohemian Poems, 1852, Cambridge edition.djvu/110: Very good text recognition, only it inserts empty lines between most of the lines of the poem (not all). The same problem appeared also in 117, but these are only exceptions, other pages of the book were OK and the OCR was sometimes even better than in the original OCR layer. I do like the curly quotes and apostrophes, although other people may not be so happy about them (I guess it would be too difficult to let the user choose in some preferences).
  • Page:The Story of Prague (1920).djvu/206 Very good OCR competing with the original OCR layer. I like the empty lines between paragraphs which the original OCR layer did not have. Both of them have problems with acutes above some Czech vowels and they both transcribe "mediæval" as "medieval".
  • Page:The Bohemian Review, vol2, 1918.djvu/217 Your gadget has no problem to read the text in columns and beats the original OCR in line recognition again. The only problem is the header of the newspaper, whose text is quite well recognized in the original OCR, but makes problems to your gadget.
  • Page:The Bohemian Review, vol2, 1918.djvu/237 This page is an extremely hard test for any OCR, as two upper columns belong to one article and two lower columns to another article. The original OCR layer failed to recognize this and so did yours, but in fact I did not expect any success here and I would be really astonished if the result were different.

To sum it up, I do like your gadget as it proved its usefulness in my tests. Although there is some space for improvements, imo it can replace the previous Phe’s gadget and I do thank you for its creation. It would be great, if the gadget were not only an external tool difficult to be repaired by other people than you in case of some problems in future, but if it could be open for wider community, and ideally, if it could be a part of Mediawiki so that it was not so easy to ignore its potential failure in future as it happened with the Phe’s tool . --Jan Kameníček (talk) 20:32, 11 November 2020 (UTC)

@Jan.Kamenicek: Thank you: that was exceedingly thorough!
First, I need to clarify that what we're here talking about is all Phe's code. The new script I asked you to test is a copy of MediaWiki:Gadget-ocr.js, which adds the "OCR" button to the toolbar, sends the request to the https://phetools.toolforge.org/ backend service, and then adds the result to the text box. Much of the discussion in the Phabricator task was regarding various fixes to that backend service. You can see the sum total of the changes I made to the script here (all of it is tweaking how the script deals with the whitespace in the OCR output from the backend service). So all credit here goes to Phe; I've just been doing minor tweaks to try to get it working again.
In addition to this I've been working on my own, completely independent, OCR gadget; which I have mentioned in passing but not really shown to anybody yet (it's too primitive and buggy). That was motivated primarily by making something to tide the community over until WMF Community Tech comes up with a new and (hopefully) better supported OCR tool. Now that Phe's OCR is (hopefully) fixed the need for that is probably not as great, but I may still keep working on it in order to experiment with giving the user some more options. For example whether to output curly or straight quotes, whether to unwrap lines within paragraphs, and possibly other such transformations. I am also looking at letting the user specify a primary and one or more secondary languages for a given page. Right now Phe's OCR assumes all text requested from enWS are in English, and so it will mostly not recognise any runs of text in other languages, except insofar as they are written in characters in common with English. For Chinese, Cyrillic, etc., or languages with extensive use of accents and ligatures (i.e. Polish etc.), this is almost guaranteed to give poor results. By specifying that "This page is mostly in English, but it also contains some words in Polish" it is possible that we can get better OCR results for these pages.
In any case… Based on your testing and feedback above it sounds like the fixes I made to Phe's OCR have been about as successful as we can hope for, and we're at the point where we can update the main Gadget and announce that Phe's OCR is back up. --Xover (talk) 13:19, 12 November 2020 (UTC)
@what we're here talking about is all Phe's code: Ah, I see :-) Nevertheless, it does not make your credit any smaller! Thanks a lot for getting the tool to work, hardly anybody hoped it could still happen :-) --Jan Kameníček (talk) 22:07, 13 November 2020 (UTC)
@Jan.Kamenicek: Yeah, I had mostly given up hope of a fix, so when an opportunity presented itself I jumped at the chance. Hopefully this will tide the community over until Community Tech can build a new tool that is at least less dependent on a single contributor, even if there are limits to how many resources they can give it once it's built. --Xover (talk) 09:30, 14 November 2020 (UTC)

The story of Prague.djvu

Hello Xover. I have uploaded File:The story of Prague.djvu which I converted from File:The story of Prague.pdf, but the quality of the djvu file is very bad. May I ask you if you could convert it so that the original quality of the scan stayed? Thanks very much. --Jan Kameníček (talk) 15:58, 26 September 2020 (UTC)

Hello again. Meanwhile I found out that although visually the scans in .djvu are very poor, IndicOCR works very well, so it is not that urgent. It may still be good to convert it in a better way for visual purposes, but if you have better work to do, just forget it, it is really not necessary. --Jan Kameníček (talk) 09:35, 27 September 2020 (UTC)
Oh, now I see that I was too slow to write you, I should have made my mind to write you earlier. Now, it looks awsome, thank you very much!!! --Jan Kameníček (talk) 09:38, 27 September 2020 (UTC)
@Jan.Kamenicek: Done. The Internet Archive had the same scan so I used the scan images from there, simply because it is more convenient to download there.
I also found that IA had a scan of the 1920 second reprint of the work (which looks to be entirely identical) but in much higher scan resolution (about 2.7x) and generally good quality. So I uploaded that too at Index:The Story of Prague (1920).djvu. Since this is an image-heavy work I recommend prioritising the higher resolution scan unless there are specific reasons for preferring the first printing (beyond it being the first printing).
Incidentally, I also strongly recommend tacking on the year of publication in parenthesis after the work's title when uploading; even when you don't anticipate there ever being multiple editions transcribed. Multiple editions can come up for any number of unpredictable reasons, and even when they do not the year of publication in the file name helps put the file in context in a number of situations (telling what's what at a quick glance in a file list, for example). --Xover (talk) 09:40, 27 September 2020 (UTC)
Ad images: That is true, I will consider this. Unfortunately, I have already manually processed and uploaded the images acquired from the older edition, which was quite a lot of work :-( Should I have noticed the edition and copy you have pointed to, I would definitely use that one for the images.
Ad parentheses: True, I will keep this advice in mind. --Jan Kameníček (talk) 09:56, 27 September 2020 (UTC)
So finally I decided to work on the 1920 edition whose scans have better resolution and which you uploaded too (thanks very much for that). May I just ask whether you changed the position of the map in book? The copy at IA has it a couple of pages later. Currently the position of the map makes a small problem: The list of illustrations says that an image of View of Prague in 1606 faces page no. 206, but in the uploaded copy it is the map that faces this page instead. According to the list of illustrations the map should face page 212. It is true that such a position (inside the book’s Index) does not make much sense, maybe it could be moved just in front of the Index (i.e. only behind the other three plates). It would not solve the problem of facing page 212 (which would have to be handled e.g. by SIC template) but it would solve the problem of facing the page 206. What do you think? --Jan Kameníček (talk) 17:38, 17 November 2020 (UTC)
@Jan.Kamenicek: The details are hazy, but I seem to recall the map was placed in a way that was problematic for some reason, and I had to make a judgement call on where to put it. As I recall there were several such issues with the scan, but most had an obvious resolution. In any case, I think I still have the files sitting around so I'll take a look and see if I have anything intelligent to contribute. Your judgement will probably be much better than mine on this though, since you're more familiar with the work. --Xover (talk) 17:53, 17 November 2020 (UTC)
@Jan.Kamenicek: Oh, hmm, it comes back to me now… I moved the map where it is mainly based on the page numbering: with the map the four illustrations cover pp. 207–210, with the last page at 206 and the index starting at 211. Without it we're one page short. In view of the list of illustrations pegging the map to be on p. 212 (which I probably didn't notice at the time) I would be inclined to move it back to that position (well, facing 212, not on p. 212, but that's a minor quibble) and inserting a dummy blank page after the three other plates (before p. 211). What do you think? --Xover (talk) 18:07, 17 November 2020 (UTC)
Yes, I agree with this. Unfortunately, I am not able to manage djvus, may I ask you to handle it? It would be really helpful. --Jan Kameníček (talk) 18:57, 17 November 2020 (UTC)
@Jan.Kamenicek: Of course! I’ll try to get it done some time tomorrow. --Xover (talk) 19:00, 17 November 2020 (UTC)

Thou hast sprung my trappe carde

I see you there, fiddling in Marlowe! Do you think this could become a PotM or maybe make Christopher Marlowe a collab when the current one fades? He's kind of a "thing", but all his works here are a hot mess and need scan backing (except Ignoto!), and we could do with some "olde worlde" originals if possible too.

Re. curly quotes they are straight in the OCR and I usually use straight through sheer laziness (Compose+<+' is fiddly) and inertia. I have no philosophical objections to them, and I do think they look better. Inductiveloadtalk/contribs 18:01, 12 November 2020 (UTC)

@Inductiveload: Yeah, the Early Modern classics are woefully patchy here. Marlowe is probably a good collab since his oeuvre is a reasonable size, unlike, say, Shakespeare or Middleton (thank god for poets who get themselves killed young!). On curlies, I automate it with a script cribbed from Sam, and may eventually get around to adding it as a per-work auto-fixup ala that header thingy. --Xover (talk) 18:08, 12 November 2020 (UTC)
@Inductiveload: Oh, I meant to mention… I found a couple of instances of {{dhr|$1}} in there (the title page I think) that looked like a buggy helper script at work. You may want to go looking for that one.
And, while I'm teaching grandpappy to suck eggs, since {{ts}} took the worst pain out of formatting tables, I've completely stopped using {{TOC begin}} and friends. Plain tables gives better control, less messy markup, and doesn't require recalling template-specific syntax for structure (with a table, the structure is explicit and in your head, and you look up any formatting you need; vs. the opposite for the various TOC templates). It was a bit of a pain to start, but it turns out there aren't that many variations in the tables so it quickly overtook the TOC templates in efficiency. I heartily recommend it! --Xover (talk) 08:07, 13 November 2020 (UTC)
Thanks, I fixed it. I got 99 problems and regexes are 98 of them (forgot the parens, so there was no capture group 1 >_<).
Re TOC, those templates certainly aren't ideal, for various reasons including bad interactions with ProofreadPage and the MW parser (c.f. phab: T232477, which you know already). I wonder if TemplateStyles + CSS classes on the TR elements might be worth a try for a slightly more semantic feel? "Direct formatting" with {{ts}} or similar is a fairly blunt weapon IMO, though the blunter weapons can be more reliable, and the TS+class approach might tip towards overwrought? At least it's not quite as fraught as {{TOCstyle}}. Inductiveloadtalk/contribs 20:52, 13 November 2020 (UTC)
@Inductiveload: Hmm. Since what we're doing is essentially direct formatting, reproducing the original work rather than applying our own style to indicate the same semantics, I don't think @class is a good match. By using a template ({{ts}}) we get the same benefits of abstraction as a CSS class, but retain the convenience. I suppose we could create {{trs}} that emits @class, but I think the problem there is more that CSS styling tables is quirky as heck (or, it was when last I tried, but I'm not up to date).
But all this reminds me of an… experiment… I have ongoing: {{sbs}} and {{sss}} (with accompanying {{sbe}} and {{sse}} closing templates). The mnemonic is for "styled block start" and "styled span start", and both of them boil down to spitting out a div or span with the provided arguments as CSS class names, styled by TemplateStyles. They've got a couple of different goals, but the initial impetuous was to find a better approach to styling poetry (I hate the poem tag, and detest long rows of br). In addition I grew tired of the scattershot of templates with inconsistent naming conventions and arguments, and spotty documentation, and annoying syntax weirdness when you try to nest templates or put all the text in a template argument or…
So for a typical poem I would do something like: {{sbs|fine centered-block pre-wrap}} … lines of poetry … {{sbe}}. Or for a title page where all the text is centered: {{sbs|centered-text}} … normal formatting, except no need for {{c}} … {{sbe}}. Since they're just div or span they can be arbitrarily and predictably nested, and the block vs. inline semantics are explicit in the template. And since what we're doing with the templates is applying styles, using classes is a pretty natural fit, and lets us reuse general CSS knowledge rather than inventing our own style language again for each template.
There's stuff they can't solve (hanging indent for wrapped poem lines being the standard example), and they're a bad fit for anything needing flexibility (no parameterized TemplateStyles). But a surprising amount of our most used templates mainly just apply a static formatting for which these are straightforward replacements. The lack of knobs and dials may also encourage a healthy shift away from obsessively trying to reproduce details that really aren't important and on which an inordinate amount of volunteer time and frustration (see SF00's periodic bursts of exasperation) is wasted. I'm envisioning the docs to be a list of the available classes, each documenting standard workarounds for common issues, and with side-links to traditional formatting templates where knobs and dials can be tweaked if needed.
I'm trying these out on works I work on myself to get a feel for how well they work and what the "standard" workarounds for various problems will have to be. So far I'm pretty happy with {{sbs}} but find myself using {{sss}} comparatively little, mainly because the syntax gets more verbose than the old way for inline use (maybe it should be a meta-template and have a suite of wrappers applying each effect?). I'm still not completely convinced it will work to mix and match CSS classes this way without running into the same conflicts inconsistent templates do.
In any case, thoughts and input on these and this approach are very much welcome. If you want to try it out then keep in mind I don't really consider them stable so you'll need to be prepared for breakage. --Xover (talk) 09:22, 15 November 2020 (UTC)
For TOC tables (specifically TOCs) with row-based classing, my thinking is that something like this:
|- class=toc_row_1-1-1
| I
| Chapter 1
| 2
is simpler and the intent more "visible" than:
|-
| {{ts|vtp|ar|wnw}} | I
| {{ts|vtp|pl1|wa}} | Chapter 1
| {{ts|vbm|ar|pl1}} | 2
{{TOC begin}} and {{ts}} are roughly contemporaneous, and the reason for the former is that the style-spam in TOCs gets tiresome, repetitive and tricky to adjust later.
In nearly all other cases, my main concern is that centralising all the CSS into global classes, while clearly better from a DRY perspective, is also somewhat fragile, as the CSS classes will be shotgunned throughout thousands of pages and can break, and break silently due to how TemplateStyles works, if someone makes a well-meaning edit to, say, the fine class. This is why I have generally stayed away from "global" CSS (à la Template:Table class) and leaned more towards work-specific CSS like Template:Os Lusiadas (Burton, 1880)/errata.css.
Re poem, I suspect that a new extension or a new tag in the existing extension (say <ppoem>, where p stands for "proper") that does span-per-line and p/div-per-stanza is better than anything we can hack up on the wikicode side, even with module support. Inductiveloadtalk/contribs 12:02, 15 November 2020 (UTC)
@Inductiveload: Apples and oranges. {{ts}} is a shortcut for adding @style to table cells, and the equivalent would be {{trs}} (or whatever) to add table row styles. Because table rows are, by virtue of their semantics, more general than table cells, the arguments for having {{trs}} emit @class rather than @style are stronger. Personally I am not convinced @class makes sense at any level more granular than the page (and the most natural fit is at the work level), but at the table row level I am at least prepared to entertain an argument.
On CSS I agree on the general point, but I think that's a longer term issue of better CSS support (PRP support for per-work CSS, maybe something like LESS and a hierarchy of CSS to cover cases in between MediaWiki:Common.css and inline styles, beyond just TemplateStyles. Definitely agree a MW extension to replace the current poem tag is needed for a real solution, but I don't think that's realistic in any reasonable timeframe so I'm focussing on stuff that can (hopefully) be made to work within the current limitations. The CSS stuff in {{sbs}} being one prong, and a Lua module a possible alternative approach.
Of course, I am not at all sure anything short of an extension will work: the parser and remex insert themselves so aggressively that they tend to sabotage any even moderately complex markup and styling. --Xover (talk) 13:38, 15 November 2020 (UTC)

Gadget in progress

Just a quick note for something to play with if you have some cycles to spare one day (no action required, just for interest).

It's a "re-imagining" of the popups gadget. Using a slightly different plug-in-like architecture, I hope it can be a bit more flexible that the enWP-centric popups gadget. To try it:

mw.loader.load("/w/index.php?title=User:Inductiveload/popups_reloaded.js&action=raw&ctype=text/javascript");
mw.loader.load("/w/index.php?title=User:Inductiveload/popups_reloaded.css&action=raw&ctype=text/css", 'text/css');

Probably will spew a few errors to console on occasion and the UX is a bit jarring sometimes, but it's already better than the old popups for my nefarious purposes IMO. Inductiveloadtalk/contribs 00:23, 21 November 2020 (UTC)

@Inductiveload: Neat! Upgrading or replacing Popups has been on my wishlist for a long time; with the two main issues being improved styling and better support for previewing PRP-backed pages. I probably won't have time to play with it any time soon, but when things improve I'd love to take it for a spin. --Xover (talk) 12:33, 22 November 2020 (UTC)

The History of the Bohemian persecution (1650)

Hello. May I ask you to convert File:The History of the Bohemian persecution (1650).pdf into djvu? There is absolutely no hurry, I have enough work to do :-) --Jan Kameníček (talk) 19:08, 21 November 2020 (UTC)

@Jan.Kamenicek: File:The History of the Bohemian Persecution (1650).djvu. The OCR quality is… not great. You may want to try the Google OCR gadget to see if it does better on the worst pages. But at least the image resolution is ~2x the PDF version. Let me know if there are any out of order pages or other such issues that needs fixing. --Xover (talk) 00:49, 22 November 2020 (UTC)
Thanks very much! I expected the OCR layer would be bad, so it did not surprise me. However, comparing e. g. [3] with [4] I can see that in the PDF version the OCR recognizes long ſ, while in the DJVU version it replaces it for f which did surprise me. I am mentioning it just as a curiousity, it is not a problem at all, as it needs to be replace for "s" anyway and maybe the whole OCR needs to be replaced e.g. using the Google gadget (which seems sliiiiggghtly better ). Thanks again. --Jan Kameníček (talk) 09:29, 22 November 2020 (UTC)
@Jan.Kamenicek: Tesseract (the OCR engine I use) does not recognise long s, so these will never have that right. It's trained on more modern texts so pre-18th century texts will be pretty hit and miss. Sorry. --Xover (talk) 12:29, 22 November 2020 (UTC)
I see, I did not know that you exchanged the OCR layer. I noticed that it was better than in the PDF (except the long s), but I thought that it was due to better OCR extraction from djvu than from pdf by Mediawiki. So I thank you for this too. --Jan Kameníček (talk) 13:10, 22 November 2020 (UTC)

ES6 in JS that may end up in a gadget

Heads up: if you use ES6 syntax (let, fat arrow, etc. etc.) in scripts, it will choke if you try to make it a gadget and you'll spend ages unpicking your shiny new hotness and replacing it with old and busted. Inductiveloadtalk/contribs 12:12, 1 December 2020 (UTC)

@Inductiveload: Hmm. You sure that's not just the normal scoping issues? What is it that breaks exactly? --Xover (talk) 12:34, 1 December 2020 (UTC)
@Xover: something in the ResourceLoader stack rejects it. You get errors something like
JavaScript parse error (scripts need to be valid ECMAScript 5): Parse error: Missing ; before statement in file 'MediaWiki:Gadget-sandbox.js' on line 4
You can try it out by enabling the "Sandbox" gadget in your user preferences. Line 4 of MediaWiki:Gadget-sandbox.js is the let x = 1; line. Inductiveloadtalk/contribs 12:53, 1 December 2020 (UTC)
@Inductiveload: Argh! Yeah, as usual the MW situation is a mess. phab:T75714 will give you the gist of it, but the issue seems to be the lack of a JS minifier written in PHP that supports ES6 combined with lack of priority to the task due to IE still providing 3% of global hits on WMF sites. In essence I think that means ES6 is effectively blacklisted until the WMF raise the Grade A browser support criteria to include ES6. --Xover (talk) 13:23, 1 December 2020 (UTC)
@Inductiveload: The "shiny new hotness" is now at mw.loader.load("//en.wikisource.org/w/index.php?title=User:Xover/loupe.js&action=raw&ctype=text/javascript"); if you want to play. No testing to speak of, and written in full "scratching my own itch" mode, so expect breakage. It probably needs a toggle to turn it on and off, and the layering is off at the edges (that's prolly PRP's fault though), and the size is hard-coded, and… But, anyway, feel free to play with it (and to steal any bits you want obviously: I tipped over into actually cobbling this together when I saw your code for grabbing the thumbnail URL from the API, which I'd been procrastinating on figuring out, in the index grid thingy), or laugh and point derisively (it ain't pretty is what I'm saying). --Xover (talk) 15:14, 1 December 2020 (UTC)
Awesome! Very stylish!
I'm still unsure of the One True Way (TM) to configure gadgets (e.g. width) - so far the only response I got at MW is "use the options API", which is a good way in terms of UX and also when the data is available (i.e. right from the start), but perhaps rather limited in terms of being ale to drive the configuration programatically.
Furthermore, after digging about in PageNumbers.js I'm also unsure if mw.cookies or the Options API are a better bet for storing things like current visibility state.
BTW, I've made some notes at User:Inductiveload/Script development about "offline" development which can be a bit less frustrating that saving every typo into a page's history! Inductiveloadtalk/contribs 17:36, 1 December 2020 (UTC)
@Inductiveload: I think the options API is a good fit for storing user preference, active choices that rarely change, while cookies are a better fit for state and things that can change based on an ad hoc toggle. For PageNumbers.js I'd think of user.options as analogous to {{default layout}}, and mw.cookie as parallel to the toolbar toggle. But don't think too hard about this with PageNumbers.js in mind: that code is a mishmash of different things bolted together and written under the constraints of what MW provided half a decade ago (and affected by cultural factors like fights over the default stylesheet etc.). It needs a thorough rethink before it's an apt use case for anything.
You might also keep in mind mw.storage for other kinds of things you want to stash away. My auto-header scripts stores the current chapter title there, in a per-work key. It's semi-persistent (webstorage is size-capped) and limited to current browser, but for a lot of use cases it's plenty good enough.
All of that is to say, I'm not seeing the use case you have in mind when you say perhaps rather limited in terms of being ale to drive the configuration programatically (too much ale? ;D). I'd love a more full-fledged preferences framework (ala what Apple gives you in Cocoa), but for what it is, the stuff in MW seems… adequate. I feel far more stifled by the design and limitations of OOUI (and MW HTML output when viewed as an API). But it seems likely I'm missing something there?
On offline development, I'm mostly just lazy and find the solutions both awkward and overkill for my needs (I don't write all that much code here). Your notes will be a great help overcoming the "lazy" bit though, if I ever feel I need to take the plunge. They're very helpful and we should stash them somewhere prominent where others in need can find them.
PS. The loupe has been updated, with some general cleanup and fixes after a bit more testing (also de-ES6-ified). It turned out nicely enough that I'm starting to think of Gadget-ifying it. It currently breaks other interactions with the page image so it probably needs some way to toggle it on and off. (definitely needs more testing too, I'm half-arsing this in stolen snatches of time) Thoughts, on this that or the other? --Xover (talk) 18:04, 5 December 2020 (UTC)
Sounds sensible re cookies vs options API. I feel like there's probably a fairly waide grey area where neither is wrong.
Re "perhaps rather limited in terms of being ale to drive the configuration programatically: What I mean is things like being able to do very general setup like (psuedocode)
if (pagename.startsWith("Foo") or namespace === "Page") {
    config.replacements.push([
        /myregex([a-z])/,
        function(x) { return "[[ + x + "]]"; }
    ]);
}
where allowing all possible such things would be awkward to express in a way that can be stored and manipulated only though an HTML form (unless it's eval'd JS or something evil like that. Setting strings or ints is one thing, but the beauty (and curse) of allowing JS gadgets is you can have some really powerful configuration options.
I'll investigate the loupe more when I have a mo Inductiveloadtalk/contribs 17:16, 6 December 2020 (UTC)
@Inductiveload: Ah, I see. Well, to me, that's really neither settings nor state; we're talking something akin to a plugin or an extension. And, no, MW doesn't really have any good facility for that use case. The best you can probably do there is a JSON content model page in the user's user space with a Special:BlankPage subpage and a JS GUI for managing it. Which is needlessly complicated and labour-intensive, admittedly. Or possibly you could do it using runtime hooks, where userspace scripts can add regexes to the gadget code through the window object (which I don't think they fence, but I could be wrong) or custom events (ditto). But, yeah, no great options.
In fairness, I'm not really sure how you would design a good system for that in the MW architecture. Our needs here are pretty unique, and they are distinctly non-regular (varying from work to work), creating a need for far more flexibility at the end-user level than most other use cases. --Xover (talk) 18:07, 6 December 2020 (UTC)
The thing I've done/seen done before without going through window is a MW hook fire()/hook(), but, like I was fretting about before, this will only work if you can send the hook() before the fire(). Which admittedly is not so bad when the action fires upon a user's action (e.g. a toolbar click, save, etc), but gives me pause when it happens right away (e.g. page load), as the gadget/user JS relative load ordering is AFAIK totally undefined, though if the gadget waits for DOM ready, user common.js "almost" certainly would be complete. But, if the user config happened, say, in a callback from some external AJAX load (e.g. checking if that page had an IA descriptions available via the File info page), it could easily come after the DOM settling. Inductiveloadtalk/contribs 11:13, 7 December 2020 (UTC)

Started in good faith, but in places the scans aren't readable.. Once again viewing the book directly on IA show no quality degrade..

PDF compression issues? ShakespeareFan00 (talk) 20:22, 5 December 2020 (UTC)

@ShakespeareFan00: I've regenerated a new DjVu from the source scans (at 3.5x resolution) and migrated the index and pages. I didn't spot the quality issues that triggered your request (didn't look that hard), but if the new version is significantly better you may want to point Fæ at the before and after (specific pages); both because I happened to see they've discussed the relative quality of PDF vs. DjVu, and because it's relevant to the IA upload project. IAs PDFs are not particularly high quality, and even their DjVus are sometimes pathologically bad. For master data you really really want the original .jp2 scan files (in a pinch converted to plain .jpeg, but preferably not); the DjVu (and PDFs) are a derived format to make it convenient for various secondary purposes. (I'd follow up directly but I just don't have the spare cycles these days). --Xover (talk) 14:18, 6 December 2020 (UTC)
Generally , The IA PDF's or Djvu's have been a LOT better than ex Google Books scans.. So it's only 3 works from IA I'v had to ask for regeneration on.

(The purpose of Fae's projects on Commons was a "backup" option, Indvidual scans can be regenerated as needed. See also the phabricator ticket I lodged earlier. If it was possible to find an existing IA- pdf, and do a Hi-quality re-gen semi-automatically , I'd be happy, but at the moment manual requests are a suitable workaround.ShakespeareFan00 (talk) 15:25, 6 December 2020 (UTC)

Sandboxes and pagenumbers

Daily WTF updates:

  • I have adjusted the sandbox gadget to pull from "User:" + wgUserName + "/gadget-sandbox.js" so anyone can play.
  • I may (may) have found a way to make it work. It's a hack and I don't know why it works.
  • I found having Firefox dev tools open and the HTTP cache disabled prevents the bug from manifesting, even when it's loading as a gadget. And even then it doesn't happen very often for me.

Such fun.

Also I made a Hathi OCR downloader and changed my HOCR parser to a push parser like the cool kids and after getting it all done and discovering that all the text layers are coming out misaligned >_<, discovered Hathi's OCR is great at words and terrible at page segmentation so it's still better to just throw it at Tesseract. Humph. Inductiveloadtalk/contribs 05:08, 16 December 2020 (UTC)

@Inductiveload: It's definitely timing-triggered: the first time the offsets are calculated the page geometry is not in its stable state, so that final .refresh_offsets is needed to shift them into the correct position. I was able to reliably reproduce the issue yesterday (Safari, macOS, sandbox gadget, testing on B's test case) so I'll look for a less-timing-dependent way to fix it. I have a hunch the basic issue may be that we're modifying the DOM during pagenumbers.init while running inside $(document).ready(), but with no gate to make sure the DOM has (re)coalesced before calculating the offsets.
Loading an arbitrarily named page in wgUserName's userspace as a gadget sounds… iffy. It has security implications, performance implications, and it affects the timing (which is one major reason to load as this gadget in the first place) by, among other things, waiting on mw.util. It's exceedingly clever, but I think it's also excessively clever: the gadgets have barely been touched in years, and right now you and I are the only ones touching anything here. Sam might if they ever get any spare time again, but other than that the need for this is close to zero. I'm not feeling the cost—benefit on this one, is what I'm saying.
Hathi's OCR quality: that comports with my experience. Both Hathi and IA (and Google) have some aspects that are better on some works, but there is no clear general winner for all works. --Xover (talk) 07:32, 16 December 2020 (UTC)
Re the gadget sandbox: you have to enable the gadget and ignore the "DONT USE THIS GADGET" notice in the process. And only interface admins (i.e. me) can edit other users' JS anyway so loading you own JS should be safe enough, considering we also load JS from all over the place, including other wikis (like mulWS and ruWP). And if I were to go rogue, I'd hit MW:common.js, not a gadget with probably 0 or 1 users on a good day. The reason I did it was so I could load something as a gadget too without co-opting the sandbox.
Re the timings - a MutationObserver might be the thing to use, but the question is which mutations? Inductiveloadtalk/contribs 07:46, 16 December 2020 (UTC)
@Inductiveload: Sandbox: Yeah, don't read too much emphasis into those comments. I'm not sending up a red flag, just saying the cost—benefit doesn't add up for me vs. just having two or three user-specific sandbox gadgets. The calculus would change if there was a bigger need (we don't want 5+ sandbox gadgets). Your observation about cross-loading JS is a much bigger concern regarding security and attack vectors and potential impact.
PageNumbers: yeah, watching mutations is one track I plan to explore, but I'm not sure we have a good place to watch or that these events will necessarily reflect stable page geometry (I'm a bit fuzzy on how closely DOM and rendered geometry track each other in modern browsers). I'm thinking it's more likely we can do something like hoisting the .wrapAll() out so it runs earlier, and either attach a .load handler sooner (so it actually gets run), hook into .ready() in a way that triggers off the .wrapAll(), or possibly a Promise-based solution.
BTW, I've started adding the dynlayout-exempt class to the key templates we want excluded: {{header}}, {{license}}, and {{authority control}}. My thinking is this is a determination that belongs in the template, when we actual control the generation of the page element; and we'll save the JS manipulation for things we don't control (like the edit box). It'll be a cleaner separation and make for cleaner code. I'm also toying with some ideas we could explore long-term to make MW and PRP directly support this use case (but we should think that through well before bugging the devs; maybe see it in connection with sidenotes?). Things would be much easier if we had a PRP equivalent to #mw-content-text to glom onto, and a dedicated left and right column (div) to stuff things into without needing the .wrapAll() stunt in local JS. If that came from MW or PRP it could more easily be Grid or Flexbox based, and save us some trouble. But I digress…
The separation of concerns track also got me thinking about how to handle the various styles involved in PageNumbers,js. I'd originally thought to make everything stylesheets in MediaWiki:-space that we just apply from JS. But some of this is stuff where we are not just applying a style, but toggling or changing values for a property (i.e. we might need to unapply a given property rather than rely on it getting overwritten). This especially goes for properties that are not really part of the individual layouts (most of what's in the current gadget .css), so it may be that that's a dividing line. I haven't concluded (or thought all that deeply) on it, but figured I'd toss it out there in case you had any thoughts. Having a dynamic layout boil down to having a MediaWiki:dynlayout-someid.css and possibly a simple configuration (for things like displayed name) in the gadget is very tempting (and I think it might just be possible to avoid hardcoding anything regarding individual layouts in JS too, but haven't gotten around to testing that out yet). --Xover (talk) 09:15, 16 December 2020 (UTC)
@Inductiveload: Oh, and I'm guessing the reason this works is that you're attaching a new .ready() handler, and these are executed in the order they were attached, so it ends up executing sequentially after the current .ready() handler; which just happens to be late enough that the page geometry is final. I don't see any obvious reason that variant should be any more gated after final rendering than just plain calling the function there would. --Xover (talk) 13:25, 16 December 2020 (UTC)
I get it and the absurdity of adding of a ready hook inside the ready hook doesn't escape me. So, I guess what I'm saying is...
setInterval(pagenumbers.refresh_offsets, 1000);
:trollface: Inductiveloadtalk/contribs 14:33, 16 December 2020 (UTC)
@Inductiveload: Heh heh! Don't think it didn't occur to me! :-)
BTW, MutationObserver won't work: it triggers off DOM changes, not property changes (like .offsetTop). --Xover (talk) 14:48, 16 December 2020 (UTC)

Anglo-Saxon Riddles of the Exeter Book

Can I request a download of a scan from HathiTrust? —Beleg Tâl (talk) 17:24, 16 December 2020 (UTC)

@Beleg Tâl: A 1963 translation, by a translator who died in 1964. Copyright? --Xover (talk) 17:42, 16 December 2020 (UTC)
The work is tagged as {{PD-US-no-renewal}}; it entered PD in 1992. You can view the uploader's rationale here. —Beleg Tâl (talk) 18:21, 16 December 2020 (UTC)
@Beleg Tâl: File:Anglo-Saxon Riddles of the Exeter Book (1963).djvu. As an experiment I tried a relatively aggressive compression profile (it's 4MB). Please check that it didn't destroy the quality before M&S. --Xover (talk) 18:45, 16 December 2020 (UTC)
What settings did you use, out of interest? Works of Bentham come out well over 2GB (I didn't use a size limit, so it's kinda my fault, but still). Inductiveloadtalk/contribs 19:28, 16 December 2020 (UTC)
@Inductiveload: The scan was black and white already, and with very good background separation (color-wise), so I simply converted to PBM and used the cjb2 encoder. I didn't tweak any of the encoder settings: the output was 2.6MB (the rest are the color covers, and one internal page with some gray tones, that I added manually afterwards from JPEG), versus 90MB for the same test using PGM for input. --Xover (talk) 21:01, 16 December 2020 (UTC)

Category:Ready for export

So I finally got round to creating this category and Help:Preparing for export. Which is why I've been hunting things like {| align=center.

Feel free to throw your favourite things into it. Pending phab:T270387 it's not that much use, but one day soonish hopefully we can have an OPDS catalog like an actual library.

So far my main pain points have been:

  • Old formatting like {| align=center, which comes out more like text-align: center; on my reader.
  • Pages with TOCs on subpages
  • Pages using dotted TOC leaders, which, despite some expedient hacks to ws-noexport the most egregiously broken elements, still don't render correctly on all devices. So far I've not put them in the category, but actually they're not nearly as bad as they were, they're just a bit raggedy rather than totally borked.

Inductiveloadtalk/contribs 07:09, 18 December 2020 (UTC)

Re Lua per Template talk:header/year

Thanks for the fix, I will give it a prod in a while.

I understand about conversion to Lua, though it is outside of my knowledge base, the one reason that I hesitate to push. When we do it, I hope that it is numbers of smaller components so that it is easier to identify issues, easier to update and document, and maybe then more usable across the range of places/namespaces that we re-use the logic. I cannot fathom getting right output from Module:WikidataIB <call me simple>. I also want to know that when we do it that we review our metadata aspects, as I think that we are still blurry to Wikidata to inhale our data well. Stuff that is outside my comfort zone. — billinghurst sDrewth 20:27, 22 December 2020 (UTC)

@Billinghurst: Yeah, one main goal in converting to Lua is to re-architect it to reduce duplicate code spread around a million places. It should be possible to have a single Lua header module backing at least {{header}} and {{translation header}}, and with reusable functions that can be invoked elsewhere if needed. I am also hoping we can clean up and modernise the HTML we output, and to let the different formatting between header/translation live in a style sheet, but that may take some unspooling of the spaghetti that's accumulated over the years.
On the metadata I can't really comment intelligently as I haven't really look into it that closely. But I can say that if Module:WikidataIB confuses you that's probably mostly to do with the information model and interfaces that Wikidata itself provides: I find these deeply weird and confusing every time I try to do anything but the very simplest operation on Wikidata. I'm sure it makes perfect sense to those steeped in it, but to anyone else it just looks plain alien. It's the same sort of feeling I used to get when talking to the RDF folks two decades ago: you can tell they're really really smart, but you're never quite sure they know what planet we're currently on. --Xover (talk) 22:43, 24 December 2020 (UTC)
  I will continue to tidy up the drudgery and get our templates in order and that underlying alignment, especially biographical and encyclopaedic. I will also spread my search for some metadata experts who can assist us. More my skillset. — billinghurst sDrewth 23:20, 24 December 2020 (UTC)

So many directions to look...

It is very hard to track all the different methods to alter display of texts. So I'm always looking behind the curtain. I found use of {{sbs}} and then wondered what the hey behind

<templatestyles src="sbs/styles.css" />

After a while I found the Wikipedia: page for templatestyles and a couple other bits, though it is a stumbling block that there is no descriptive write-up for the non-ivory-tower people.

Anyway, finally got back to trying to figure out why this was needed on the particular page, and still don't know 'why'.

But I did notice that Template:Sbs/styles.css has two mentions of

 div.ws-template-sbs.smaller {
   font-size: 83%;
 }
 div.ws-template-sbs.smaller {
   font-size: 83%;
 }

Anyway anyway, is templatestyles usage supposed to be work/genre -specific, or just another collection of personal choices? Shenme (talk) 04:50, 27 December 2020 (UTC)

Found some more hints? Would be nice if centralized:
Shenme (talk) 09:41, 27 December 2020 (UTC)
@Shenme: In the wider Wikimedia and MediaWiki universe, TemplateStyles is not particularly an end-user feature. It is an extension to MediaWiki that was designed specifically to solve the problem of templates hard-coding a lot of physical formatting in a way that cause some technical and practical problems, and which makes it hard to tweak the formatting because you have to edit the convoluted template syntax. So instead, the TemplateStyles extension gives you a html-like tag—<templatestyles src="styles.css" />—that lets you load a CSS stylesheet at that point in the template (usually at the start) and in a way that ensures deduplication (using the same template multiple times will only load the style rules once) and scoping (all the CSS selectors are scoped to only apply to the content area).
On Wikisource however, we have somewhat more need of specific formatting: on Wikipedia all articles should look roughly the same, but each of our works have unique formatting quirks. Which means we have need of the functionality of TemplateStyles as somewhat more of an end-user accessible feature, rather than a template-developer feature. We have feature requests in to extend both the TemplateStyles extension and Proofread Page extension to enable various easy ways to add a per-work stylesheet to our works through the Index: page. One major thing we're missing is the ability to pass variables to the TemplateStyles stylesheets, so that we can do things like let the end user specify a precise letter spacing in em, the way we can with hard-coded formatting in a template. This won't show up in the near future, unfortunately, and in the mean time we can't really go all-in on TemplateStyles; but we're still trying to use it in the places where it makes sense (anywhere we need formatting but we don't need to give the end user unlimited flexibility to tweak values). That is, the main use of it is technical right now.
{{BookCSS}} is a somewhat manageable way to use TemplateStyles to enable per-work stylesheets. It gives the end user a template-based interface to specify the stylesheet for a work, and a centralised location to store the stylesheets. You could in principle use a <templatestyles src="styles.css" /> tag anywhere manually, so {{BookCSS}} is just about user friendliness and manageability.
{{sbs}} on the other hand is currently more of an experiment. While our formatting templates provide unlimited flexibility for tweaking, most of the time we do not actually use that flexibility: we just use default values, or we use a small number of values falling into a broad category (think small, medium, or large; vs. a numeric value from 0–100). These can fairly easily be handled by TemplateStyles at the expense of flexibility that is rarely used and mostly not really needed even when we do use it. In addition I have been bothered by the inconsistent interfaces and names provided by our formatting templates. We have {{center block}} and {{block center}}. Some templates need the text they apply to in a parameter to the template, some automatically fall back to start and end templates, and some provide /s and /e variants. Some need a unit specified for parameters, and hard code the unit and take only a number ({{bar|2}} but {{gap|2em}}).
{{sbs}}/{{sbe}} (mnemonic: styled block start and end) and their inline siblings {{sss}} and {{sse}} (mnemonic: styled span start and end) are an experiment to see if we can improve this using TemplateStyles and a lightweight template wrapper around a stylesheet where all the interesting stuff is defined. By using CSS selectors ("style names") as the template parameters, what we're effectively doing is just adding CSS classes to the div and span elements. The templates are always start+end templates, and each invocation of them can apply any formatting for which we have rules in the stylesheet. Staying close to CSS also means we can re-use "muscle-memory" and know-how for those who have worked with CSS before, and Google will be an effective help for looking stuff up in general CSS resources.
So far {{sbs}}is a qualified success, but {{sss}} looks like it'll be a bit too verbose to be worthwhile. There are some unsolved problems, and it remains to be seen if the community will take to this approach, so if you want to play with it in mainspace you should be prepared for the possibility you may have to go back and redo pages without it (if it gets actively deprecated or something). That's why there's no documentation for it yet: I am purposefully leaving the bar a bit high so people won't get fooled into using it with the caveats that currently apply.
Hope that was helpful, and please don't hesitate to ask if there is anything else that is unclear or you're wondering about. --Xover (talk) 12:37, 27 December 2020 (UTC)

Zawis and Kunigunde

Hello. May I ask you to have a look at File:Zawis and Kunigunde (1895).djvu? The IA uploader uploaded there some extra first page and now I have found out that also all the OCRs are shifted, e.g. page 15 has the OCR layer of page 16. I reported it to the phabricator, but now I need this particular file get fixed. I would help me very much but there is no hurry. --Jan Kameníček (talk) 14:02, 30 December 2020 (UTC)

@Jan.Kamenicek: Done. Please check the result. --Xover (talk) 15:26, 30 December 2020 (UTC)
Oh, that was quick, and the result is great. Thanks very much! --Jan Kameníček (talk) 15:49, 30 December 2020 (UTC)

Fonts

I just tried to prod along phab:T166138.

Couple of other fonts we could do with are a sans and serif "Outline" font, because clever though {{font outline}} is, it doesn't look great. E.g. the second line of this page for sans. But I can't immediately see good SIL-licensed candidates. Any ideas? priority: lowest.

Also for reference, phab:T270743 tracks the ability to export used ULS fonts. Inductiveloadtalk/contribs 11:49, 31 December 2020 (UTC)