"On Discoveries and Inventions" completedEdit

I have finished supplying all the passages that I had left out in my original translation of Prus' 1873 lecture "On Discoveries and Inventions".
Due to differences of syntax between Polish and English, and to the way sentences are sometimes completed on a succeeding page, in places there will be mismatches, between the two language versions, at the end of a page.
I am struck by how germane the author's observations, set down 147 years ago, are to our day. This is reflected in several of my notes. These are clearly marked as the translator's, do not disturb Prus' text, and conceivably might interest some readers. However, I leave their retention to your judgment.
I hope that when the text is restored to Wikisource, a link can be provided to the scan of the original text that you were able to locate.
Kindly let me know, should there be questions about any part of my translation.
Would you recommend that I add "CC-BY-SA-3.0" to my other Wikisource translations that currently carry only a "GFDL" license?
Nihil novi (talk) 12:37, 21 June 2020 (UTC)
@Nihil novi: Excellent news, and great work!
Page mismatches due to language differences are to be expected, so that's nothing to worry about. The work now lives at Translation:On Discoveries and Inventions, and the old title (On Discoveries and Inventions) will remain valid as a redirect until such time as a different work of the same title is added here (at which time we we'll presumably find some suitable way to disambiguate and link the new location). Please have a look through Translation:On Discoveries and Inventions to check that everything still looks good after transclusion. One thing in particular to look out for is two paragraphs running together. This happens when the end of a paragraph coincides with the end of a page. In these cases we have to manually tell the software to preserve the paragraph break by placing the template {{nop}} at the end of the first of the two pages.
The footnotes, and even the Wikipedia links, are problematic though: these both count as annotations and are not permitted in normal works. Annotated versions are supposed to go into a separate copy of the work that is clearly labelled as an annotated edition. In other words, we will have to do something to address that. However, for various technical reasons, I am not sure how we could sensibly do that just now; so for now I propose we just leave it as is and I'll try to come up with something there. If anyone should object in the mean time the links and footnotes can be easily removed (and since all old revisions are kept in the page history, can also be easily restored if needed).
I would also encourage you to go through each page and update the page status to "Proofread" for the pages you consider complete and finished (which should be all of them as I understood you). We can then try to find another Polish speaker to go through them and "Validate" them. This two-step transcription process is standard for English language works (two people independently verify that the transcription is correct), but we might as well employ it for translations too even if the situation is slightly different.
And, finally, I've updated the wikipage at Translation:On Discoveries and Inventions to use the requisite header template ({{translation header}}), which always sets the translator to "Wikisource". This is because such translations are considered to be collaborative and ongoing efforts, somewhat akin to Wikipedia articles. In reality this is unlikely to be a significant factor for this specific work (the idea was more aimed at something like a collaborative new translation of Tolstoï or Aristotle), but… In any case, I noticed you had set the translator's name to what I presume is your real name. Would you like us to credit you (and if so, by that name or just by your username here?) on the work's talk page? It will be a lot less visible, I'm sorry to say, but that is the standard way we have of doing it (using the {{textinfo}} template). Most contributions here are simply "credited" through the username appearing in each page's revision history, but translations are a little different so adding a separate note about it feels appropriate.
Regarding your other translations: yes, do please replace the {{GFDL}} tag with {{CC-BY-SA-3.0}} to avoid any confusion. Technically, every original contribution you make here is dual-licensed under both those licenses for historical reasons (there's some fine print about it just above the "Publish changes" button in the editing form when you edit a page), but when an explicit license tag that contradicts it is added to a page it causes confusion and may end up with the work being deleted. --Xover (talk) 14:52, 21 June 2020 (UTC)
Thank you.
I am reviewing "Translation:On Discoveries and Inventions", checking on paragraph divisions. So far, I find two paragraphs run together: "Until 1846,..." should mark the start of a new paragraph; but when I switch to editing mode, the text will not advance beyond the title page. I tried entering "nop" into my Wikisource translation text, but that does not split the two run-together paragraphs on the "Translation..." text. How can I accomplish this correction?
Nihil novi (talk) 22:56, 21 June 2020 (UTC)
Problem apparently solved: I see the correction now made in "Translation...".
Nihil novi (talk) 23:11, 21 June 2020 (UTC)
@Nihil novi: Good to hear! Let me also take the opportunity to thank you for your contributions, and for putting up with our at times arcane tools, practices, and policies. We're aware this all could be a lot more user friendly, but let's just say that that makes us appreciate anyone willing to stick it out despite the challenges even more! :) In any case, thanks for contributing, and do please feel free to ask me if you need help with anything else. You can also always ask at Wikisource:Scriptorium/Help, where the whole community will see it, in case I am not available (it's an entirely volunteer driven project, so individual people here tend to have unpredictable availability). --Xover (talk) 04:50, 22 June 2020 (UTC)
Thank you for patiently shepherding me through this process.
I am also indebted to you for prompting me to translate the missing passages.
At one time, I seriously considered emulating the monks who worked anonymously in their scriptoria.
Having, however, published papers and books under my name as their translator and sometimes their editor, for bibliographic reasons I would appreciate being credited with this translation, as previously, by my civilian name.
And could this translation also be listed on the "Author:Bolesław Prus" page and the "Author:Christopher Kasparek" page?
If an annotated edition of this piece is feasible, I think it could help connect the author's mind and times with the present-day reader's.
From what you write, no one should object to my substituting the "GFDL" license with "CC-BY-SA-3.0" at my other translations, and I will try to do so. Their original Polish texts are available for comparison on Wikisource.
I hope I may indeed again impose on you for advice.
Thank you.
Nihil novi (talk) 07:18, 22 June 2020 (UTC)
User:Piotrus has generously completed his review and validation of the English translation of On Discoveries and Inventions [1] by Bolesław Prus (Aleksander Głowacki).
I gather that the translation has now attained full rights of residence on Wikisource.
I would like to again thank you for encouraging me to complete the partial English translation, of some years back; for tutoring me on Wikisource procedures and techniques; and for offering your own helpful comments on the translation.
I wonder whether I could further impose on you: to credit the translator (Christopher Kasparek), as we discussed above? I fear that, were I to attempt doing this myself, someone would have to correct my errors made in the process.
I trust you are successfully maintaining social distance during this Covid–19 pandemic!
Many thanks,
Nihil novi (talk) 19:15, 14 September 2020 (UTC)
@Nihil novi: Great work; and kudos for the fortitude of sticking with it through the really rather less than user friendly tools and process! An eminently interesting work, and its implementation will stand as an example we can point future contributors to!
I have added a note to the translation's talk page—Translation talk:On Discoveries and Inventions—crediting you with contributing the translation. As a translation that has not been previously published (on a proper publishing house) our policy is to treat it like a collaborative work (i.e. so that Piotrus's contributions are acknowledged) by crediting it in the work's main header as a "Wikisource translation". In this particular case that's a little awkward, I feel, since you were clearly the main translator and it would be most natural to simply credit you as such; but our policy doesn't really allow for that, and it would have negative effects in other cases. But on the work's talk page we are free to explain the situation more specifically. It is also now featured in the "New texts" section of our Main Page.
In any case, if you want to translate more of Prus' works and need assistance, please do not hesitate to ask. --Xover (talk) 08:25, 17 September 2020 (UTC)
Thank you.
It is good to see Prus's prescient 1873 lecture now available on Wikisource in English, 147 years after he delivered it in Warsaw in Polish.
Do you happen to know how Wikisource came by the scan of the lecture's printed version?
Is there a straightforward way for Wikisource to obtain scans of other Polish public-domain works, perhaps from the Polish National Library?
Nihil novi (talk) 09:20, 17 September 2020 (UTC)

Technical noodling on AnnotationsEdit

@Inductiveload: I'm a little short on spare cycles just now, so perhaps I could prevail on you to help me think through this a bit?

This work is now a scan-backed Wikisource translation and an annotated work. Annotations need to be in a separate and clearly labelled page. Since it is actually scan-backed (translated page by page in Page:-space) we can't (well, "shouldn't") just cut&paste but use multiple transclusion. Which means we need some technical facility to handle the difference dynamically.

This work uses two kinds of annotations: translator's notes (footnotes), and wikipedia links. My original thought (which I've not had the cycles to flesh out yet) was something like {{annotation note}} and {{annotation link}} that will output nothing/unlinked text in the Translation: but spit out <ref>…</ref> and wikilinked text (respectively) in the annotated version.

In addition to not thinking through the template details (there may be better ways, or the approach might be infeasible), I have no good idea how to distinguish between unannotated and annotated versions of a work. Some previous approaches have relied on the annotated version being on a /Annotated subpage, or on having …(Annotated) in the page name. None of those approaches have been good (but possibly for other reasons than the trigger). But I have no clear idea of alternative approaches.

Thoughts? Ideas? --Xover (talk) 09:53, 30 June 2020 (UTC)

@Xover: Hmm, It's a tricky one. Any template that needs selective output will have to be sensitive to the environment at render time. AIUI, one can only really control the namespace and the page name. We can't really control the namespace (both will be Translation, I assume). Forcing a a subpage like "/Annotated" is going to end in tears, because the two top level works ("Work" and "Work/Annotated") will become interleaved under "Work" (regardless of if you do "Work/Annotated/Sub/pages" or "Work/Sub/pages/Annotated"). You could pick out a title suffix like "(annotated)" with parser functions or modules.
Either way you'll bake the string "/Annotated" or "(annotated)" into the templates. At least you'd want to leave headroom for different annotations, so maybe a pattern like "(annotated( - XXX)?)".
As for the templates, care needs to be taken to not end up with the situation that {{modern}} ended up in. Better hygiene of the template formatting might help here, but for any substantial level of annotation, the Wikicode is going to be a mess.
And alternate solution is to have only one version and provide selective visibility though Javascript. Switching the CSS visibility of a class or two should work. This is how {{ls}} worked long long ago, you could choose how it was displayed. The intention was to allow various options such as old orthography (long-s, etc), Wikilinks, etc to be under separate control. But it never gained traction and it was then broken and not missed enough to be repaired. I have not idea what this would do to exports. This avoids needing two transcluded copies (and therefore you can't link to it separately), but doesn't really change the Wikicode. Inductiveloadtalk/contribs 10:22, 30 June 2020 (UTC)
@Inductiveload: Thinking out loud (and definitely not deeply): New namespace for annotations, Annotation:, with policy to say only two kinds are permitted, inter-project links (i.e. wikipedia) and footnotes. Annotation: can contain both annotated normal works and annotated Wikisource translations, and distinguishing is done with {{header}} vs. {{translation header}}. All works in the namespace must be scan-backed, already transcluded to either mainspace or translation, and must use one of the approved annotation templates (starting set: the two I sketched above; additional ones as we come up with them). No grandfather clause: existing works migrated there must also be migrated to be compliant with that policy. The namespace is the trigger for the annotation templates. Thoughts?
It also occurs to me that Spangineer has made an initiative to move WS:ANN to actual policy, which might "synergize" well with trying to introduce this kind of scheme. --Xover (talk) 18:32, 30 June 2020 (UTC)
That would probably be a decent solution. However, transcluding the same text twice, once to mainspace and once to annotation-space, does kind of pre-suppose that we only have one annotated version. You could imagine that there might be two annotations of the same work. Though I think this is unlikely to actually happen.
Also, proofreading the annotations would be annoying: the wikicode would be cluttered and you'd need a way to force annotations on and off in page space. Though I suppose that could have a simple gadget with a side-bar control.
Other policy-sided thoughts all sound fairly sensible. I'm not sure if the "allowed annotation" list is a bit restrictive, but then again, I don't actually expect any completed "intense" annotations to actually exist any time soon. I could image line-by-line analysis of, e.g. Bhagavad-Gita (Besant 4th)/Discourse 1 or Translation:The Story of the Stone/Chapter 1 or something. But, TBH, that's beginning to stray off the Wikisource reservation slightly and perhaps slightly into Wikibooks/versity territory. Inductiveloadtalk/contribs 09:53, 10 July 2020 (UTC)
@Inductiveload: To my mind, limiting annotations to one per (edition of a) work seems like a reasonable first approximation. At least absent counter-examples my thinking is that we do not want competing annotated versions: we want collaboration to make one single even better one. But this, and the limited kinds of permitted annotations I envision, is coloured by my desire to have a clean and fresh start here. Not "anything goes, and we'll dial back anything later deemed problematic", but "these specific deviations from the normal proofreading are permissible, and we'll consider any additional variants if a good use case comes up". With only wikilinks (and with guidance designed to avoid "sea of blue" problems, ala w:WP:OVERLINK) and footnotes—both eminently containable—it should be within reason in terms of editability. Or so is my hope at any rate. I imagine such annotations will either be added by the one first proofreading a work, and while doing the proofreading, in which case the annotation artefacts are what they want and not in the way; or they are added after the fact to a work that has already been proofread, in which case they will (obviously) not get in the way of the proofreading.
You're right that this model will not be a good fit for a line-by-line analysis or similar (possible, but not a good fit). I'm not sure what the solution for such annotations are. Some would certainly be Wikibooks/Wikiversity material, but I can imagine there being a significant grey area. I am comfortable kicking that can down the road though. I'm thinking a strictly limited starting point that is intended to be expanded—slowly and carefully—over time in order to maintain some semblance of control over scope and quality; not that it should never be expanded.
There's also a little voice nagging at me that either ProofreadPage or whatever they're using over at Wikibooks, if relevant, might conceivably be expanded in some way to allow for multiple "branches" off the same file/index. With Multi-content revisions and some of the related tech, you might have in-software support for creating both translations and annotations off the same proofread set of pages. Maybe, after proofreading, you could hit a "sync to sandbox"-type link to populate the "Annotation" slot of the Page: pages with a copy of the wikitext from the main proofread slot. Turn on and off annotations dynamically in mainspace, maybe? That's probably overkill right now, but as a sort of long term pie-in-the-sky type thing… --Xover (talk) 13:22, 10 July 2020 (UTC)

Uncovered filesEdit

Whilst going through the files I have recommended for deletion, I have come across the following other files, which may need your attention.

I have finished all of the files beginning with “A,” and listed them on WS:PD, as you have seen. TE(æ)A,ea. (talk) 18:34, 17 July 2020 (UTC).

Files for BEdit

I also noticed File:British Indian Ocean Territory Constitution Order 10.06.2004.pdf, although I am not sure if this is acceptable for Wikimedia Commons. TE(æ)A,ea. (talk) 12:41, 18 July 2020 (UTC).

@TE(æ)A,ea.: Thanks. I think I've got all of them; except Bat Wing and the British Indian Ocean Territory Constitution Order for which I am still trying to figure out the copyright situation. Archaeologia Britannica is currently at WS:PD so I left it to be handled there. Arizona Proposition 302 is of unclear copyright status, and may need a trip to WS:CV. --Xover (talk) 18:59, 20 July 2020 (UTC)

Files for CEdit

As for the “Order” mentioned above, it is some form of U. K. government document, but I do not know if it falls under one of the Crown Copyright exemptions. When I finish listing all of the files for speedy deletion, I plan on going over all of the other files held locally—the ones properly unfit for movement to Wikimedia Commons—and marking them more accurately with expiry dates. TE(æ)A,ea. (talk) 21:52, 20 July 2020 (UTC).

Files for DEdit

Files for EEdit

The files for speedy deletion are listed here, so as not to waste space on your talk page. The files for review are the following.

I have been putting off the work for these for some time, but I’ve finally started the work again, so I will leave comments on this page every so often. In addition, the page I listed above will also be updated with new listings from time to time; remove the old listings as you see fit. TE(æ)A,ea. (talk) 19:58, 25 September 2020 (UTC).

Files for FEdit

Files for GEdit

In addition, the files associated with The Great American Fraud should be moved to Wikimedia Commons, as that work itself is acceptable there; however, it is probably preferable to clean up that page before moving the images. TE(æ)A,ea. (talk) 23:09, 16 October 2020 (UTC).

Files for HEdit

  • File:HC-afrikaans.pdf (and the associated page) should be deleted, as it is not written in the English language;
  • File:HC - Frans.pdf (and the associated page) should be deleted, as it is not written in the English language;
  • File:Hamilton Korea full view text.pdf should be deleted (and the associated pages overridden) by this superior file, as the current file is quite inadequate;
  • File:Heathen-frontis.jpg should be moved to Wikimedia Commons, as are already the other images from that work;
  • File:Hector Macpherson - Herschel (1919).djvu should be moved to Wikimedia Commons, as (as is indicated on the title page) it was published simultaneously in the U. K. and the U. S., and can therefore be considered to have been first published in the United States according to Commons procedure (I believe);
  • File:Heralds of God.djvu should be moved to Wikimedia Commons, because it is in the public domain in the U. S. (the country where it was first published);
  • File:How to Write Music.djvu should be moved to Wikimedia Commons, because it is in the public domain in the U. S. (the country where it was first published) and
  • File:Howards End.djvu should be moved to Wikimedia Commons, because it is in the public domain in the U. S. (the country where it was first published). TE(æ)A,ea. (talk) 23:09, 16 October 2020 (UTC).
@TE(æ)A,ea.: I think I'm all caught up as of H now. Thanks for all the effort you're putting into this! But a quick note:
Just because the scan of an edition we have here was published in the US doesn't mean the work as such was first published in the US. Case in point: File:Howards End.djvu. Howard's End was first published in 1910 in the UK, and is subject to a pma. 70 term that lasts to 2040 there. Both the 1910 and 1921 editions are PD in the US because it is more than 95 years since they were published; which make them ok at enWS (which only considers US status) but not at Commons (which also considers status in the country of first publication). There were at least a couple of cases of this up above. --Xover (talk) 19:01, 18 October 2020 (UTC)

Index:Narratives of the mission of George Bogle to Tibet.djvuEdit

Dear Xover, thank you for taking care to identify and upload a consistent edition of Queen Victoria's Letters. It is a pleasure to work on it now.

In 2015 I had a very similar problem (missing pages) with the edition in the subject. I had uploaded] 3 versions of this book to Wikidata before finding out they all had different missing pages.

I discussed that issue with someone on WS back then, and was told that on Google Books there were files to be found, that did not have those problems. Unfortunately, I am currently unable to find this discussion.

Could you please help find the non-faulty version of this book or fix the current scan? --Tar-ba-gan (talk) 23:08, 3 August 2020 (UTC)

@Tar-ba-gan: I've done what I could at Index:Narratives of the Mission of George Bogle to Tibet (1879).djvu, based on Internet Archive identifier : dli.csl.5002, which was the best scan i could find. I wasn't able to find a decent scan of the map here. If you track down one it is ok to just use that in the transcription even though this scan only has a partial map: it'll be a judgement call on what best serves our readers. Let me know if you need help moving anything over from Index:Narratives of the mission of George Bogle to Tibet.djvu. Keep in mind you can just move each page that you want to preserve over to its new position, using the "Move" command in the "More" menu on the page you want to move. If you let me know when you're done with the old Index:/Pages: I can delete them.
PS. Apologies that this took so long. It was a somewhat complicated case, and no decent scans to be found. --Xover (talk) 15:30, 7 August 2020 (UTC)
No wonder it took time to identify the more complete scan! I had tried and failed miserably, and this kind of text (Explorers/Himalayas) is "systematically" important for me (unlike Queen Victoria's letters) so I was quite frustrated about that for years. Thanks for solving this! --Tar-ba-gan (talk) 08:04, 8 August 2020 (UTC)
Dear Xover, after a bit of work I find that the situation is as peculiar as this: I think the old faulty scan with a couple pages missing cannot be removed until the new transcription project is finished. The thing is, OCR and occasionally page preservation of the most recent scan is quite bad, so the best I can do is to open new pages simultaneously in both scans and copypaste the text from the older scan to the most recent one. --Tar-ba-gan (talk) 23:06, 10 August 2020 (UTC)
@Tar-ba-gan: Ouch! I'm sorry I couldn't get you a better starting point, but I don't think there is a lot I can do about the OCR quality. This scan just doesn't give the OCR engine a lot to work with (it's a combination of several factors, chiefly the lack of contrast between the text and the background, and the texture in whatever paper they printed this on; not to mention that the stamps in the header are confusing the OCR engine terribly). My only suggestion is to try enabling the Google OCR gadget in your preferences and try that on the bad pages. In pathological cases like this it can sometimes give much better results. Other than that we'd need a better scan to get better results, and I was unable to find one with all pages present etc.
I can generate a DjVu from any collection of images, so if you want to try to cobble a complete copy together with images from multiple scans that would be a possibility. It'd be rather a lot of fiddly manual work, so whether it's worth the effort depends on just how bad the current OCR quality is.
I'm sorry I couldn't be of more help here.
PS. Oh, and don't worry about the other scan. In the Index: and Page: namespaces there is no particular hurry. We just don't want duplicates and faulty scans sitting around indefinitely so that users waste time proofreading them. --Xover (talk) 07:41, 11 August 2020 (UTC)

Files for speedy deletionEdit

Could you move the list to a sub-page? It takes up a lot of space on the main deletions page, and the deletions aren’t controversial. As for the listings, I am working on this month’s WS:PotM work right now, and won’t be able to get back to going through the files for a week or so. TE(æ)A,ea. (talk) 12:17, 12 August 2020 (UTC).

@TE(æ)A,ea.: The existing stuff on PD can just get closed and archived off on the usual timer (I'll do it on my next spin through processing that). But the rest of these you can just dump here on my talk since it looks I'm the only one processing these anyway. Transwikied files is the speedy criterion with the least potential for controversy, and if anybody had objections on principle they've had the chance to raise them on PD for a while now. --Xover (talk) 14:14, 12 August 2020 (UTC)

Additional deletionsEdit

Per this discussion, the following pages should be deleted:

By the way, I will be able to deal with more loose files soon, so you have that work to look forward to. TE(æ)A,ea. (talk) 21:14, 26 August 2020 (UTC).

@TE(æ)A,ea.: Done. And thanks! --Xover (talk) 06:03, 27 August 2020 (UTC)

occupational categories rejigEdit

I have set up proof of concept conversions for some of the occupation categories

and the requisite Template:Category disambiguation and configured HotCat to not allow the category's addition, and instead to show the sub-cats. Hoping that you are a HotCat user and willing to test and confirm that this will work. — billinghurst sDrewth 05:12, 28 August 2020 (UTC)

Though maybe the template should be renamed to align with c:Template:MetaCatbillinghurst sDrewth 05:21, 28 August 2020 (UTC)
@Billinghurst: Limited testing, but it seems to work very well so far! The template is, I think, what enWP calls a "fully diffused" category, and they take their categories seriously over there, so it might be worthwhile to see if we could crib something from there for the template. --Xover (talk) 18:28, 28 August 2020 (UTC)
Grr! Sometimes I think MediaWiki is obtuse on purpose. this edit showed a new timestamp (18:29) when I previewed the change, and the edit history confirms it was saved at :29, but somehow the saved timestamp shows :28 and thus the ping didn't work. Infuriating!
In any case, Billinghurst, see above for the testing. I also had a quick look at the templates enWP uses for diffusing categories (vs. category disambiguation), which is listed in the navigation template at w:Template:Other category-header templates ("Maintenance" section). Most obviously relevant here would be w:Template:Container category and w:Template:Diffusing subcategory, both of which look reasonable, code-wise. But whether we want to treat this as disambiguation or diffusion I've no particular opinion on. --Xover (talk) 07:50, 29 August 2020 (UTC)

What is Property?Edit

I have made the change, with the exception of chapter 4 and 5 of the “First Memoir,” as these were originally divided, but are not now. TE(æ)A,ea. (talk) 15:32, 7 September 2020 (UTC).

@TE(æ)A,ea.: Done. Please check that I didn't mess anything up. --Xover (talk) 16:28, 7 September 2020 (UTC)


Hi Xover, could you restore the Charter of Fundamental Rights of the European Union page please? It's very important. I can easily amend the annotations that you object to: this will be much easier than restarting the page. The relevant link is here: http://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:12012P/TXT

It's linked to on the Wikipedia page, so we need something back up. Wikidea (talk) 15:18, 9 September 2020 (UTC)

@Wikidea: I have restored the text of the deleted page in User:Wikidea/Charter of Fundamental Rights of the European Union. Please do not move it to mainspace before it is in compliance with policy, particularly the annotations policy.
Wikisource primarily hosts previously published works as published, and allow editions annotated by contributors only as an adjunct to properly proofread previously published editions. For this particular case, you should find and upload the PDF of each of the 2000, 2004, and 2007 (iirc) editions of this act; set up an Index for each; and proofread each of them page by page. Once we have all the relevant originals transcluded (at, say, Charter of … Union (2000), Charter of … (2004), Charter of … (2007)) it would be acceptable to set up a comparison of them at Charter of Fundamental Rights of the European Union (Annotation).
Please also keep in mind that we do not use plain wikimarkup for things like headings, nor use automatically generated tables of contents. For a typical heading we would use some combination of formatting templates such as {{center}}, {{x-larger}}, etc.; and for the table of contents typically actual table markup mimicking the original, if there is one present, or {{AuxTOC}} otherwise.
Please feel free to ping me if you need assistance, or you can ask at the help section of the Scriptorium. --Xover (talk) 04:09, 10 September 2020 (UTC)
Thanks, much appreciated. Wikidea (talk) 15:16, 10 September 2020 (UTC)

Your messageEdit

Creating OCR text does not in any way make it more difficult for others to proofread text. It significantly accelerates the proof reading process by creating division of labour and by attracting search engine traffic (which increases the number of proof readers). I have every intention of proof reading the pages in question. That said, if you are going to subject me to this kind of harassment, I will leave Egyptian Literature alone. James500 (talk) 12:00, 13 September 2020 (UTC)

I also have to question the attempt to assert ownership of a scan you have not edited for four months (as far as I can see). I do not appreciate being accused of having no intention of proof reading by someone who has not done any proof reading himself for that length of time. James500 (talk) 12:41, 13 September 2020 (UTC)
@James500: You are entitled to your opinion regarding the utility of creating such pages, but I have now given you my opinion on the matter and politely asked you to refrain from doing so. That you choose to cast such a request as "harassment" or an attempt to assert "ownership"—or, indeed, appear to treat anyone disagreeing with you on any subject or matter in a similar fashion—suggests to me that you may find contributing to a collaborative and consensus-based project challenging. In light of that you may wish to consider whether your approach here is really the one best suited to achieve a constructive result. There are many projects and services on the web where one may act unilaterally, but on a fundamentally collaborative one it is necessary to adapt to the needs of other contributors, even at the expense of one's own preferred approach.
PS. The common practice when someone leaves a message on your talk page is to respond there, adding a {{ping}} to let the one who left the message that you have responded, and not to remove the message and then leave a disconnected reply on the other person's talk page. The reason is that this approach keeps replies in one place and in context of the original message. Even if your personal preference is to do otherwise I would encourage you to follow the common practice in order to facilitate better collaboration with other contributors.
PPS. One of the reasons I have not proofread any pages of that work in the last few months is that I have been busy with other tasks, primarily helping community members such as yourself with the tasks they need assistance with. Your feedback on that prioritization has been noted. But if you wish to actually proofread that work, in a collaborative fashion and in line with the style established for it, then I absolutely encourage you to do so. Collaboration, not unilateral and personally preferred actions, are the core of this project, as I have previously emphasised. --Xover (talk) 13:09, 13 September 2020 (UTC)
Falsely accusing me of having no intention of proof reading pages is a personal attack and an assumption of bad faith. I have no problem characterising that kind of comment as harassment. Further, it was not clear whether your message was actually a request or an indication that you might use admin tools. If you are not proposing to use admin tools, you really should make that clear. James500 (talk) 13:45, 13 September 2020 (UTC)
@James500: Do you have some particular reason to expect it would be necessary to employ admin tools in connection with a polite request placed on your talk page? That seems like a rather odd assumption to make.
Contrariwise, based on a random sampling of the several hundred such pages you have created this month alone, none of which have actually been Proofread, it seemed a reasonable assumption that the same would be the case for the pages in question as well. If my assumption was in error, I invite you to demonstrate the mistake by Proofreading at least as many pages as you have created as Not Proofread. In fact, that would, in my opinion, be the very most desirable outcome.
In the mean time, I encourage you to strike your accusations of "harassment", "false accusations", "personal attack", "assumption of bad faith", and reiterated accusation of "harassment". That you choose a confrontational mode of interaction is one thing, but casually throwing around such accusations is not acceptable behaviour. --Xover (talk) 14:28, 13 September 2020 (UTC)
It is not appropriate for you to make comments about what you imagine is going on in my head. That includes assertions that I do or do not intend to do something. Nor is it appropriate for you to seek to put me to proof that I do or do not intend to do something. Especially after I have already started to correct the pages in question. James500 (talk) 14:46, 13 September 2020 (UTC)
I have struck out all of my comments because I do not wish to have any further interaction with you. Nor do I consent to such interaction. James500 (talk) 15:48, 13 September 2020 (UTC)
Thank you for striking out those accusations. That is appreciated. Regarding your wish to avoid further interaction, I shall certainly try to accommodate that. But I must stress that on a collaborative project it is generally not possible to reserve oneself from interacting with other contributors. --Xover (talk) 17:13, 13 September 2020 (UTC)

Quietly Night is Falling... (By I.S. Nikitin)Edit

You deleted poem "Quietly Night is Falling..." from page https://en.wikisource.org/wiki/Author:Ivan_Savvich_Nikitin. I'm Anton Demin. It was my translation. Can you restore it? unsigned comment by Ad2271 (talk) 23:03, 16 September 2020‎ (UTC).

@Ad2271: Hi Anton. Thanks for getting in touch regarding this.
The issue here was that for all content that comes from elsewhere we need documentation, of some reasonable form, that it has either entered the public domain (its copyright protection has expired) or has been actively licensed under a compatible license (typically one of the Creative Commons licenses). For content created directly here by one of our contributors (such as this comment or your preceding one) the site's Terms of Service takes care of the licensing part: whenever you save an edit there are licensing terms listed as you irrevocably agree to release your contribution under the CC BY-SA 3.0 License and the GFDL.
For translations there are additional complexities because the translation gets its own copyright that is independent of the original. And that was the matter at issue here: the original by Nikitin was in the public domain (copyright had expired), but the translation was credited to an "A.Demin" for whom we had no information to determine copyright status and no evidence of licensing under a compatible license. And under those circumstances our copyright policy requires us to delete the work in order to avoid violating copyright.
However, we do accept user translations, known as "Wikisource translations", by contributors to the site; and as direct contributions they are covered by the licensing terms imposed by the Terms of Service. These translations get special naming (a Translation:) prefix in the page name, a specially formatted header ({{translation header}} instead of {{header}}), and are credited as "translated by Wikisource" instead of a named physical person. Attribution to individual users of the site are done through the revision history of the page (what you get at the "View history" tab at the top of the page), and the idea is that almost all pages on the site will be the result of a collaboration so crediting any individual user would generally be misleading or impractical.
In any case… In this specific case, and based on your message here, I think we can probably undelete the translation, move it to the Translation: namespace, and switch out the header; and then just link to the documentation of its authorship here. --Xover (talk) 06:29, 17 September 2020 (UTC)
@Ad2271: Ok, I've undeleted it and updated as described above. It is now available at Translation:Quietly Night is falling. --Xover (talk) 07:44, 17 September 2020 (UTC)
@Xover: Thanks a lot for the quick response and page recovery.--Ad2271 (talk) 12:08, 17 September 2020 (UTC)

We sent you an e-mailEdit

Hello Xover,

Really sorry for the inconvenience. This is a gentle note to request that you check your email. We sent you a message titled "The Community Insights survey is coming!". If you have questions, email surveys@wikimedia.org.

You can see my explanation here.

MediaWiki message delivery (talk) 18:48, 25 September 2020 (UTC)

The story of Prague.djvuEdit

Hello Xover. I have uploaded File:The story of Prague.djvu which I converted from File:The story of Prague.pdf, but the quality of the djvu file is very bad. May I ask you if you could convert it so that the original quality of the scan stayed? Thanks very much. --Jan Kameníček (talk) 15:58, 26 September 2020 (UTC)

Hello again. Meanwhile I found out that although visually the scans in .djvu are very poor, IndicOCR works very well, so it is not that urgent. It may still be good to convert it in a better way for visual purposes, but if you have better work to do, just forget it, it is really not necessary. --Jan Kameníček (talk) 09:35, 27 September 2020 (UTC)
Oh, now I see that I was too slow to write you, I should have made my mind to write you earlier. Now, it looks awsome, thank you very much!!! --Jan Kameníček (talk) 09:38, 27 September 2020 (UTC)
@Jan.Kamenicek: Done. The Internet Archive had the same scan so I used the scan images from there, simply because it is more convenient to download there.
I also found that IA had a scan of the 1920 second reprint of the work (which looks to be entirely identical) but in much higher scan resolution (about 2.7x) and generally good quality. So I uploaded that too at Index:The Story of Prague (1920).djvu. Since this is an image-heavy work I recommend prioritising the higher resolution scan unless there are specific reasons for preferring the first printing (beyond it being the first printing).
Incidentally, I also strongly recommend tacking on the year of publication in parenthesis after the work's title when uploading; even when you don't anticipate there ever being multiple editions transcribed. Multiple editions can come up for any number of unpredictable reasons, and even when they do not the year of publication in the file name helps put the file in context in a number of situations (telling what's what at a quick glance in a file list, for example). --Xover (talk) 09:40, 27 September 2020 (UTC)
Ad images: That is true, I will consider this. Unfortunately, I have already manually processed and uploaded the images acquired from the older edition, which was quite a lot of work :-( Should I have noticed the edition and copy you have pointed to, I would definitely use that one for the images.
Ad parentheses: True, I will keep this advice in mind. --Jan Kameníček (talk) 09:56, 27 September 2020 (UTC)
So finally I decided to work on the 1920 edition whose scans have better resolution and which you uploaded too (thanks very much for that). May I just ask whether you changed the position of the map in book? The copy at IA has it a couple of pages later. Currently the position of the map makes a small problem: The list of illustrations says that an image of View of Prague in 1606 faces page no. 206, but in the uploaded copy it is the map that faces this page instead. According to the list of illustrations the map should face page 212. It is true that such a position (inside the book’s Index) does not make much sense, maybe it could be moved just in front of the Index (i.e. only behind the other three plates). It would not solve the problem of facing page 212 (which would have to be handled e.g. by SIC template) but it would solve the problem of facing the page 206. What do you think? --Jan Kameníček (talk) 17:38, 17 November 2020 (UTC)
@Jan.Kamenicek: The details are hazy, but I seem to recall the map was placed in a way that was problematic for some reason, and I had to make a judgement call on where to put it. As I recall there were several such issues with the scan, but most had an obvious resolution. In any case, I think I still have the files sitting around so I'll take a look and see if I have anything intelligent to contribute. Your judgement will probably be much better than mine on this though, since you're more familiar with the work. --Xover (talk) 17:53, 17 November 2020 (UTC)
@Jan.Kamenicek: Oh, hmm, it comes back to me now… I moved the map where it is mainly based on the page numbering: with the map the four illustrations cover pp. 207–210, with the last page at 206 and the index starting at 211. Without it we're one page short. In view of the list of illustrations pegging the map to be on p. 212 (which I probably didn't notice at the time) I would be inclined to move it back to that position (well, facing 212, not on p. 212, but that's a minor quibble) and inserting a dummy blank page after the three other plates (before p. 211). What do you think? --Xover (talk) 18:07, 17 November 2020 (UTC)
Yes, I agree with this. Unfortunately, I am not able to manage djvus, may I ask you to handle it? It would be really helpful. --Jan Kameníček (talk) 18:57, 17 November 2020 (UTC)
@Jan.Kamenicek: Of course! I’ll try to get it done some time tomorrow. --Xover (talk) 19:00, 17 November 2020 (UTC)

Paragraphs in text-layers; or a hackEdit

I was going to ask you if you thought we could get the paragraphs in Djvu files into the text layer as blank lines, then I found that you've not only requested it, but even have a patch pending for it at phab:T230415!

In the meantime, while the wheels grind slowly, perhaps you'd be interested in a filthy JS hack?

      // Insert para breaks when a short line looks like a paragraph end.
      let short_line_thresh = 45; // set per-work if 45 isn't right
      let lines = editor.get().split(/\r?\n/);

      for (let i = 0; i < lines.length - 1; i++) {
        // Short line followed by punctuation and a fresh sentence on the next line
        if ((lines[i].length < short_line_thresh) &&
            lines[i].match(/[.!?'"”’—]\s*$/) &&
            lines[i+1].match(/\s*['"“‘A-Z0-9]/)) {
          lines[i] += "\n";


It obviously won't catch all paragraphs, but a majority have "short" last lines. Inductiveloadtalk/contribs 13:43, 5 October 2020 (UTC)

@Inductiveload: Ah, indeed, that's a useful little heuristic. Thanks! --Xover (talk) 14:10, 5 October 2020 (UTC)
It also occurs that when that change finally goes through, we have ~900k "red" OCR-dumped pages (#1, yay!) which will all be missing their paragraph breaks - is there an API call to fetch the text layer so we can have a "reload layer" gadget? Sometimes the existing OCR is better than the OCR gadget gives if the IA did a particularly good job. Inductiveloadtalk/contribs 14:35, 5 October 2020 (UTC)
@Inductiveload: I presume there is, but I've not looked closely for it. Phe's OCR tries to get the text layer already in the DjVu first, and only runs Tesseract (3.x, with custom language files) if that fails. If MW didn't provide some way to get at it, Phe's OCR gadget would have to download the entire (possibly 1GB+) DjVu file to get at a given page's text layer.
In addition, I know MW extracts the text layer and stores it in the database in one of the image metadata fields. This is probably/possibly what causes some DjVu files to fail to extract the text layer: it's overrunning a metadata-sized field with a huge blob of text with XML markup.
Oh, BTW, I suspect some users may have preferred Phe's OCR simply because it did better at extracting the existing text layer (cf. the Phab above), and not because its OCR was actually any better. As best I can tell, for 99% of cases it would have been giving the text layer from the DjVu and not actually new OCR. The remaining 1% of cases are probably related to multi-language support and fractur text. You may find poking around https://github.com/phil-el/phetools informative.
Oh, and… Note the difference between the number in "Not proofread" and the number in "Not scan-backed". We have a massively disproportionate number of Page: pages that are "Not proofread". I'll leave open the question of from whence these come. --Xover (talk) 14:55, 5 October 2020 (UTC)
Hmm, interesting. I assumed some kind of OCR was happening, because the OCR from the (black and white) OCR button doesn't match the DjVu text layer, though it also seems suspiciously fast. E.g. Page:Lord of the World - Benson - 1908.djvu/42 has straight quotes in the text later and curlies in the OCR-button'd text, and other differing scannos like "141/2" vs "1414" for 14½.
Re the very high "red" page ratio, I have noticed it before. Even more interesting to me perhaps is how frWS and deWS have such incredibly low problematic rates (I have made zero effort to actually find out what the policies are there, perhaps they just class pages needing images or foreign scripts as something other than "problematic", or do they have crack teams scan-fixers and image-extractors?). Inductiveloadtalk/contribs 16:48, 5 October 2020 (UTC)

DjVu vs PDFEdit

Greetings! I'm fiddling about my my scripts for uploading and whatnot and I'm wondering what to do about the DjVu vs PDF question.

Motivation: I am hoping to upload the 50 volumes of the The Works of the British Poets since that should be a pretty decent baseline for scan backing any orphan poems over time, but this question applies more generally to other multi-volume scan uploads, especially periodicals.

Since the IA has stopped making DjVus (sad face), this means that most of these volumes are PDFs from the IA and some (15/50) would be home-brewed DjVus from Hathi. The real question is: should I bother expending effort making sure volumes are all in one format or the other?

DjVu seem to thumbnail faster, but since the format is so unloved by everyone but us, is there any point regenerating the 35 IA PDFs into DjVus? The effort required to do this is not enormous (just need to massage the IA OCR XML into my DjVu-making script instead of using Tesseract), but it's certainly not zero.

Alternatively, go the other way, and just upload the Hathi images raw to the IA and let them (slowly) generate a PDF and use that (benefits: adds the work to the IA too).

Alternatively again, ignore the 35 IA volumes and use the Hathi images and re-OCR them all (and/or gain access to the HT API for OCR, which I don't think I have, as non-institutional members only get web client access via Friend of UofM accounts).

The only "real" pain point in having a mix of PDF and DjVu, other than the slow PDF thumbnails, is probably that {{Works of the British Poets volumes}} would have to be manually finagled to use the right extensions on a per-volume basis. Inductiveloadtalk/contribs 15:14, 17 October 2020 (UTC)

@Inductiveload: Immediate thoughts…
  • Consistency has a quality of its own, or something like that. Having everything be the same format makes lots of little things simpler through simple removal of impedance and cognitive load.
  • DjVu gives us far better control and options for manipulation, including any doctoring. Any part of a DjVu can be extracted losslessly (but manipulating the page image does require a reencoding roundtrip).
  • MediaWiki does a far better job extracting a text layer from DjVu files than from PDF (I've tested with literally the same text layer: it's ridiculous). I suspect this is only partly due to shoddy coding in MW: it looks like PDF text layers have features less suitable for our use case (more page layout and formatting than logical separation and structure).
  • The IA XML is… fragile (cf. IA-upload's troubles). The ABBY OCR has a slight edge over Tesseract in most cases, but I always prefer regenerating it from scratch: that gives me full control over the resulting text layer and avoids running into the related bugs in MW. I've very rarely ran into a truly pathological case for Tesseract that ABBY handles well: most of the time the difference borders on academic.
  • IA DjVus get some of their compression by background-separation, but they also scale down and use aggressive compression settings. It gives pathological results on certain (poor quality) inputs, so for anything I care about I regenerate it from the scans instead.
Bottom line, if I were to go at this, I would grab all the scan images, set my script to work on them, and then come back to check on it after a day or two. I wouldn't even use any of the existing DjVus: there's not much difference between generating 15, 35, and 50 DjVus; the computer is just replacing my space-heater for a little bit longer. --Xover (talk) 15:45, 17 October 2020 (UTC)
That makes sense, the heating in the office is electric anyway :-D. Probably it's worth me iterating on my DjVu creation process a bit more. I might try to figure out from the IA peeps how they do the background separation - running it with a slightly less aggressive compression coefficient might allow better results without an excessive file-size tradeoff.
Something else that might give and edge would be figuring out how to train Tesseract for our works. We certainly have various "classes" of files we commonly see:
I wonder if the last two could benefit from dedicated Tesseract trainings which could even be integrated into the OCR button. Inductiveloadtalk/contribs 16:55, 17 October 2020 (UTC)
@Inductiveload: I'm no LSTM expert by any measure, but by my extremely limited understanding, it should be possible to train it for these classes (possibly excepting the flyspeck print: there's too little blood in that stone to begin with). But how much effort it'd take and how much improvement is a different matter. I'm also not sure whether it'd be feasible to deal with long s and the ligatures through training alone: I suspect at least some such things will need explicit support in the engine (like italics etc.). What's more likely to work is training for the styles of fonts used in 1700s (+/-) printing, and the generally inconsistent typography and page layout. I also believe Tesseract does some word matching, so when orthography is different it will have extra trouble that custom training may eliminate.
Regarding file size, I am entirely ignoring that issue for now. It would be nice to get smaller files, but our DjVus are a drop in the bucket at Commons; and the file size has essentially zero effect on the load time etc. of thumbnails here (they're all run through Ghostscript with pathological runtimes into the tens of seconds completely unrelated to the efficiency of our compression). So while obese files offend my techie preference for efficiency, they have very little practical effect.
I also worry that background-separation will have marginal effect in our use case: it will require generating a mask for the actual text, which is going to be hard to automate at scale for the same reasons OCR has trouble with our works. But it would be extremely interesting to see how IA did that and what compression settings they used. It seems impossible that we couldn't get much better compression, without going as far as IA and the attendant problems. For my personal typical use there's also the possibility to apply custom settings per work, which wouldn't have been practical for IA; but that means I could try the aggressive settings first, and then just back off if the results are poor. --Xover (talk) 07:39, 18 October 2020 (UTC)
I asked at the IA forum and got a great response by one of their people. Tl;dr, the background segmentation is called "mixed raster content" (MRC) and was applied by a proprietary DjVu tool and isn't provided in DjvuLibre (i.e. c44 and friends).
So unless we want to invent an MRC encoder for DjVu (see link in the reply above), we're stuck with what we've got in terms of encoding. I get that the filesize isn't really a major issue and the thumbnailer doesn't really care (and at least DjVu is faster than PDF), but it pains me to be uploading ~100MB files that could more reasonably be <10MB. And even if novels are OK, some works that are ~1000 pages of dense text can easily get a bit out of hand.
I'll investigate training Tesseract over time, I feel that it's likely that with a bit of processing (OpenCV??) and training we can likely improve it quite a bit. Even if it isn't full-auto having "profiles" to regenerate OCR would be quite handy for classes of texts we usually struggle with. Inductiveloadtalk/contribs 11:30, 18 October 2020 (UTC)

Tesseract hintEdit

FYI, I found out the hard way that if you are multithreading something that shells out to call Tesseract, the performance is pitiful (like, 20 times slower or worse) if you don't set the environment variable OMP_THREAD_LIMIT=1 to restrict each thread's Tesseract instance to a single thread. Inductiveloadtalk/contribs 21:33, 17 October 2020 (UTC)

@Inductiveload: Thanks. I'd actually stumbled across that issue while looking for a different bug in Tesseract, but I thought they'd changed the default to 1 for that very reason? I've not run into it in the wild, mainly because I rarely run multiple Tesseracts in parallel, but when I do I'm not sitting around waiting for it to finish (fire and forget). It would probably have bitten me hard on the WMCS backend for my OCR script though, as soon as it was used for more than my very occasional testing. Hmm. Which actually reminds me of something… --Xover (talk) 07:38, 18 October 2020 (UTC)

Tesseract sensitivity to spacesEdit

Do you have any idea how to "de-sensitise" Tesseract from finding spaces around punctuation? It seems very keen to produce text like

“ Good men,” say I, “take of my wordes kepe :

I'm playing with Tesseract "-c" options, but nothing seems to do anything of interest. Inductiveloadtalk/contribs 09:36, 19 October 2020 (UTC)

@Inductiveload: Nope, sorry. I think it is inevitable since typographically there is a space there. You'd need to code specifically for this case to be able to correct for it. I fix this in JS in my ocr fixup scripts, and it's one of the fixups I'd considered adding to my OCR gadget. --Xover (talk) 11:29, 19 October 2020 (UTC)
I was thinking more about the space after the open double quote (image here), but even the spaces before the semi/colons seem like a thinner spaces (a w:Thin space?) than the inter-word spacing. Oh well, I guess since so so many OCRs have this it's more useful to fix up in JS so it's more generally useful. Inductiveloadtalk/contribs 11:50, 19 October 2020 (UTC)
@Inductiveload: Hmm. It's possible this is not quite what it seems. Tesseract treats punctuation lexically as words, so in any context where words are being joined with spaces they will suddenly amass a space that isn't actually there in the input. I do that in my hOCR parser, and it's likely Tesseract does the same in its plain text output. It is an obvious way to implement it so likely to appear in many implementations. But in any case, the distinction gives scope to be smart about this. --Xover (talk) 18:31, 24 October 2020 (UTC)
Well, as long as Tesseract is producing “ and ” correctly, it's easy to post-process, because “[space] and [space]” are wrong. But when it's a straight quote, it's harder to work out what's right and wrong. Inductiveloadtalk/contribs 19:53, 24 October 2020 (UTC)
@Inductiveload: Yes, indeed. But for the curlies: importScript('User:Xover/ocrtoy.js'); and then hit the   button in the editor toolbar on a page that otherwise exhibits this problem. It just looks behind to see if the previous "word" was /^[“‘«]$/, and whether the current "word" is /^[”’»;]$/, and in either case it forcibly concatenates the current "word" onto the end of the previous word (instead of adding it to the array that will later be joined by spaces). --Xover (talk) 20:00, 24 October 2020 (UTC)
In terms of general OCR post-processing, I've been collecting some useful heuristics into a JS script (with the help of a big wordlist and grep). There are some scannos that are blatantly bad orthography like tlie and sometimes you can work out where false spaces are. For example, diffi is more likely to be the prefix of the next word than either as a stand-alone word or as a suffix to another word (nothing English ends "diffi").
The JS isn't really ready for prime-time, but perhaps there are useful things in there you can use? It certainly seems to tidy things up quite a bit.
To be honest, I'm starting to wonder if, despite being painfully hip, some kind of machine-learning thing might actually be the way forward. Feed it piles and piles of OCR and piles or the same thing but corrected and see if it can work out what "feels" right. Inductiveloadtalk/contribs 20:09, 24 October 2020 (UTC)

Index:A Collection of the Public General Statutes of the United Kingdom 1805 (45 George III).pdfEdit


Can you make this "make sense" thanks? I've tried to pagelist this three times , and it didn't make sense.

Rebuilding the file "page by page" if needed is strongly suggested. ShakespeareFan00 (talk) 18:49, 22 October 2020 (UTC)

@Xover: I can deal with this if you don't have bandwidth :-) Inductiveloadtalk/contribs 15:35, 23 October 2020 (UTC)
@Inductiveload: -- I am handling this. Hrishikes (talk) 15:43, 23 October 2020 (UTC)
Good luck with the De-Googling. May your numbers be consecutive and your scan complete! Inductiveloadtalk/contribs 15:47, 23 October 2020 (UTC)
Thank you both! --Xover (talk) 15:58, 23 October 2020 (UTC)

Phab ticket you might be interested in: phab:T267617Edit

Hi! Quick heads up for a phab ticket that might interest a gadgety person: phab:T267617 Index page's page links should have the page index-in-file in them (e.g. as attribute). Inductiveloadtalk/contribs 10:30, 10 November 2020 (UTC)

@Inductiveload: Thanks. Incidentally, you can watch components and projects in Phabricator and be notified whenever a new task is registered for it. So for e.g. Wikisource and ProofreadPage, click the tag in an existing task, then navigate to its overview page in the left nav menu (by default you get its work board), and then use the watch button in the top right. --Xover (talk) 14:38, 10 November 2020 (UTC)

New OCR toolEdit

Hello Xover,

I have tried your new gadget at the following pages with very good results:

  • Page:The Queens Court Manuscript with Other Ancient Bohemian Poems, 1852, Cambridge edition.djvu/110: Very good text recognition, only it inserts empty lines between most of the lines of the poem (not all). The same problem appeared also in 117, but these are only exceptions, other pages of the book were OK and the OCR was sometimes even better than in the original OCR layer. I do like the curly quotes and apostrophes, although other people may not be so happy about them (I guess it would be too difficult to let the user choose in some preferences).
  • Page:The Story of Prague (1920).djvu/206 Very good OCR competing with the original OCR layer. I like the empty lines between paragraphs which the original OCR layer did not have. Both of them have problems with acutes above some Czech vowels and they both transcribe "mediæval" as "medieval".
  • Page:The Bohemian Review, vol2, 1918.djvu/217 Your gadget has no problem to read the text in columns and beats the original OCR in line recognition again. The only problem is the header of the newspaper, whose text is quite well recognized in the original OCR, but makes problems to your gadget.
  • Page:The Bohemian Review, vol2, 1918.djvu/237 This page is an extremely hard test for any OCR, as two upper columns belong to one article and two lower columns to another article. The original OCR layer failed to recognize this and so did yours, but in fact I did not expect any success here and I would be really astonished if the result were different.

To sum it up, I do like your gadget as it proved its usefulness in my tests. Although there is some space for improvements, imo it can replace the previous Phe’s gadget and I do thank you for its creation. It would be great, if the gadget were not only an external tool difficult to be repaired by other people than you in case of some problems in future, but if it could be open for wider community, and ideally, if it could be a part of Mediawiki so that it was not so easy to ignore its potential failure in future as it happened with the Phe’s tool . --Jan Kameníček (talk) 20:32, 11 November 2020 (UTC)

@Jan.Kamenicek: Thank you: that was exceedingly thorough!
First, I need to clarify that what we're here talking about is all Phe's code. The new script I asked you to test is a copy of MediaWiki:Gadget-ocr.js, which adds the "OCR" button to the toolbar, sends the request to the https://phetools.toolforge.org/ backend service, and then adds the result to the text box. Much of the discussion in the Phabricator task was regarding various fixes to that backend service. You can see the sum total of the changes I made to the script here (all of it is tweaking how the script deals with the whitespace in the OCR output from the backend service). So all credit here goes to Phe; I've just been doing minor tweaks to try to get it working again.
In addition to this I've been working on my own, completely independent, OCR gadget; which I have mentioned in passing but not really shown to anybody yet (it's too primitive and buggy). That was motivated primarily by making something to tide the community over until WMF Community Tech comes up with a new and (hopefully) better supported OCR tool. Now that Phe's OCR is (hopefully) fixed the need for that is probably not as great, but I may still keep working on it in order to experiment with giving the user some more options. For example whether to output curly or straight quotes, whether to unwrap lines within paragraphs, and possibly other such transformations. I am also looking at letting the user specify a primary and one or more secondary languages for a given page. Right now Phe's OCR assumes all text requested from enWS are in English, and so it will mostly not recognise any runs of text in other languages, except insofar as they are written in characters in common with English. For Chinese, Cyrillic, etc., or languages with extensive use of accents and ligatures (i.e. Polish etc.), this is almost guaranteed to give poor results. By specifying that "This page is mostly in English, but it also contains some words in Polish" it is possible that we can get better OCR results for these pages.
In any case… Based on your testing and feedback above it sounds like the fixes I made to Phe's OCR have been about as successful as we can hope for, and we're at the point where we can update the main Gadget and announce that Phe's OCR is back up. --Xover (talk) 13:19, 12 November 2020 (UTC)
@what we're here talking about is all Phe's code: Ah, I see :-) Nevertheless, it does not make your credit any smaller! Thanks a lot for getting the tool to work, hardly anybody hoped it could still happen :-) --Jan Kameníček (talk) 22:07, 13 November 2020 (UTC)
@Jan.Kamenicek: Yeah, I had mostly given up hope of a fix, so when an opportunity presented itself I jumped at the chance. Hopefully this will tide the community over until Community Tech can build a new tool that is at least less dependent on a single contributor, even if there are limits to how many resources they can give it once it's built. --Xover (talk) 09:30, 14 November 2020 (UTC)

Thou hast sprung my trappe cardeEdit

I see you there, fiddling in Marlowe! Do you think this could become a PotM or maybe make Christopher Marlowe a collab when the current one fades? He's kind of a "thing", but all his works here are a hot mess and need scan backing (except Ignoto!), and we could do with some "olde worlde" originals if possible too.

Re. curly quotes they are straight in the OCR and I usually use straight through sheer laziness (Compose+<+' is fiddly) and inertia. I have no philosophical objections to them, and I do think they look better. Inductiveloadtalk/contribs 18:01, 12 November 2020 (UTC)

@Inductiveload: Yeah, the Early Modern classics are woefully patchy here. Marlowe is probably a good collab since his oeuvre is a reasonable size, unlike, say, Shakespeare or Middleton (thank god for poets who get themselves killed young!). On curlies, I automate it with a script cribbed from Sam, and may eventually get around to adding it as a per-work auto-fixup ala that header thingy. --Xover (talk) 18:08, 12 November 2020 (UTC)
@Inductiveload: Oh, I meant to mention… I found a couple of instances of {{dhr|$1}} in there (the title page I think) that looked like a buggy helper script at work. You may want to go looking for that one.
And, while I'm teaching grandpappy to suck eggs, since {{ts}} took the worst pain out of formatting tables, I've completely stopped using {{TOC begin}} and friends. Plain tables gives better control, less messy markup, and doesn't require recalling template-specific syntax for structure (with a table, the structure is explicit and in your head, and you look up any formatting you need; vs. the opposite for the various TOC templates). It was a bit of a pain to start, but it turns out there aren't that many variations in the tables so it quickly overtook the TOC templates in efficiency. I heartily recommend it! --Xover (talk) 08:07, 13 November 2020 (UTC)
Thanks, I fixed it. I got 99 problems and regexes are 98 of them (forgot the parens, so there was no capture group 1 >_<).
Re TOC, those templates certainly aren't ideal, for various reasons including bad interactions with ProofreadPage and the MW parser (c.f. phab: T232477, which you know already). I wonder if TemplateStyles + CSS classes on the TR elements might be worth a try for a slightly more semantic feel? "Direct formatting" with {{ts}} or similar is a fairly blunt weapon IMO, though the blunter weapons can be more reliable, and the TS+class approach might tip towards overwrought? At least it's not quite as fraught as {{TOCstyle}}. Inductiveloadtalk/contribs 20:52, 13 November 2020 (UTC)
@Inductiveload: Hmm. Since what we're doing is essentially direct formatting, reproducing the original work rather than applying our own style to indicate the same semantics, I don't think @class is a good match. By using a template ({{ts}}) we get the same benefits of abstraction as a CSS class, but retain the convenience. I suppose we could create {{trs}} that emits @class, but I think the problem there is more that CSS styling tables is quirky as heck (or, it was when last I tried, but I'm not up to date).
But all this reminds me of an… experiment… I have ongoing: {{sbs}} and {{sss}} (with accompanying {{sbe}} and {{sse}} closing templates). The mnemonic is for "styled block start" and "styled span start", and both of them boil down to spitting out a div or span with the provided arguments as CSS class names, styled by TemplateStyles. They've got a couple of different goals, but the initial impetuous was to find a better approach to styling poetry (I hate the poem tag, and detest long rows of br). In addition I grew tired of the scattershot of templates with inconsistent naming conventions and arguments, and spotty documentation, and annoying syntax weirdness when you try to nest templates or put all the text in a template argument or…
So for a typical poem I would do something like: {{sbs|fine centered-block pre-wrap}} … lines of poetry … {{sbe}}. Or for a title page where all the text is centered: {{sbs|centered-text}} … normal formatting, except no need for {{c}} … {{sbe}}. Since they're just div or span they can be arbitrarily and predictably nested, and the block vs. inline semantics are explicit in the template. And since what we're doing with the templates is applying styles, using classes is a pretty natural fit, and lets us reuse general CSS knowledge rather than inventing our own style language again for each template.
There's stuff they can't solve (hanging indent for wrapped poem lines being the standard example), and they're a bad fit for anything needing flexibility (no parameterized TemplateStyles). But a surprising amount of our most used templates mainly just apply a static formatting for which these are straightforward replacements. The lack of knobs and dials may also encourage a healthy shift away from obsessively trying to reproduce details that really aren't important and on which an inordinate amount of volunteer time and frustration (see SF00's periodic bursts of exasperation) is wasted. I'm envisioning the docs to be a list of the available classes, each documenting standard workarounds for common issues, and with side-links to traditional formatting templates where knobs and dials can be tweaked if needed.
I'm trying these out on works I work on myself to get a feel for how well they work and what the "standard" workarounds for various problems will have to be. So far I'm pretty happy with {{sbs}} but find myself using {{sss}} comparatively little, mainly because the syntax gets more verbose than the old way for inline use (maybe it should be a meta-template and have a suite of wrappers applying each effect?). I'm still not completely convinced it will work to mix and match CSS classes this way without running into the same conflicts inconsistent templates do.
In any case, thoughts and input on these and this approach are very much welcome. If you want to try it out then keep in mind I don't really consider them stable so you'll need to be prepared for breakage. --Xover (talk) 09:22, 15 November 2020 (UTC)
For TOC tables (specifically TOCs) with row-based classing, my thinking is that something like this:
|- class=toc_row_1-1-1
| I
| Chapter 1
| 2
is simpler and the intent more "visible" than:
| {{ts|vtp|ar|wnw}} | I
| {{ts|vtp|pl1|wa}} | Chapter 1
| {{ts|vbm|ar|pl1}} | 2
{{TOC begin}} and {{ts}} are roughly contemporaneous, and the reason for the former is that the style-spam in TOCs gets tiresome, repetitive and tricky to adjust later.
In nearly all other cases, my main concern is that centralising all the CSS into global classes, while clearly better from a DRY perspective, is also somewhat fragile, as the CSS classes will be shotgunned throughout thousands of pages and can break, and break silently due to how TemplateStyles works, if someone makes a well-meaning edit to, say, the fine class. This is why I have generally stayed away from "global" CSS (à la Template:Table class) and leaned more towards work-specific CSS like Template:Os Lusiadas (Burton, 1880)/errata.css.
Re poem, I suspect that a new extension or a new tag in the existing extension (say <ppoem>, where p stands for "proper") that does span-per-line and p/div-per-stanza is better than anything we can hack up on the wikicode side, even with module support. Inductiveloadtalk/contribs 12:02, 15 November 2020 (UTC)
@Inductiveload: Apples and oranges. {{ts}} is a shortcut for adding @style to table cells, and the equivalent would be {{trs}} (or whatever) to add table row styles. Because table rows are, by virtue of their semantics, more general than table cells, the arguments for having {{trs}} emit @class rather than @style are stronger. Personally I am not convinced @class makes sense at any level more granular than the page (and the most natural fit is at the work level), but at the table row level I am at least prepared to entertain an argument.
On CSS I agree on the general point, but I think that's a longer term issue of better CSS support (PRP support for per-work CSS, maybe something like LESS and a hierarchy of CSS to cover cases in between MediaWiki:Common.css and inline styles, beyond just TemplateStyles. Definitely agree a MW extension to replace the current poem tag is needed for a real solution, but I don't think that's realistic in any reasonable timeframe so I'm focussing on stuff that can (hopefully) be made to work within the current limitations. The CSS stuff in {{sbs}} being one prong, and a Lua module a possible alternative approach.
Of course, I am not at all sure anything short of an extension will work: the parser and remex insert themselves so aggressively that they tend to sabotage any even moderately complex markup and styling. --Xover (talk) 13:38, 15 November 2020 (UTC)

Gadget in progressEdit

Just a quick note for something to play with if you have some cycles to spare one day (no action required, just for interest).

It's a "re-imagining" of the popups gadget. Using a slightly different plug-in-like architecture, I hope it can be a bit more flexible that the enWP-centric popups gadget. To try it:

mw.loader.load("/w/index.php?title=User:Inductiveload/popups_reloaded.css&action=raw&ctype=text/css", 'text/css');

Probably will spew a few errors to console on occasion and the UX is a bit jarring sometimes, but it's already better than the old popups for my nefarious purposes IMO. Inductiveloadtalk/contribs 00:23, 21 November 2020 (UTC)

@Inductiveload: Neat! Upgrading or replacing Popups has been on my wishlist for a long time; with the two main issues being improved styling and better support for previewing PRP-backed pages. I probably won't have time to play with it any time soon, but when things improve I'd love to take it for a spin. --Xover (talk) 12:33, 22 November 2020 (UTC)

The History of the Bohemian persecution (1650)Edit

Hello. May I ask you to convert File:The History of the Bohemian persecution (1650).pdf into djvu? There is absolutely no hurry, I have enough work to do :-) --Jan Kameníček (talk) 19:08, 21 November 2020 (UTC)

@Jan.Kamenicek: File:The History of the Bohemian Persecution (1650).djvu. The OCR quality is… not great. You may want to try the Google OCR gadget to see if it does better on the worst pages. But at least the image resolution is ~2x the PDF version. Let me know if there are any out of order pages or other such issues that needs fixing. --Xover (talk) 00:49, 22 November 2020 (UTC)
Thanks very much! I expected the OCR layer would be bad, so it did not surprise me. However, comparing e. g. [2] with [3] I can see that in the PDF version the OCR recognizes long ſ, while in the DJVU version it replaces it for f which did surprise me. I am mentioning it just as a curiousity, it is not a problem at all, as it needs to be replace for "s" anyway and maybe the whole OCR needs to be replaced e.g. using the Google gadget (which seems sliiiiggghtly better ). Thanks again. --Jan Kameníček (talk) 09:29, 22 November 2020 (UTC)
@Jan.Kamenicek: Tesseract (the OCR engine I use) does not recognise long s, so these will never have that right. It's trained on more modern texts so pre-18th century texts will be pretty hit and miss. Sorry. --Xover (talk) 12:29, 22 November 2020 (UTC)
I see, I did not know that you exchanged the OCR layer. I noticed that it was better than in the PDF (except the long s), but I thought that it was due to better OCR extraction from djvu than from pdf by Mediawiki. So I thank you for this too. --Jan Kameníček (talk) 13:10, 22 November 2020 (UTC)

ES6 in JS that may end up in a gadgetEdit

Heads up: if you use ES6 syntax (let, fat arrow, etc. etc.) in scripts, it will choke if you try to make it a gadget and you'll spend ages unpicking your shiny new hotness and replacing it with old and busted. Inductiveloadtalk/contribs 12:12, 1 December 2020 (UTC)

@Inductiveload: Hmm. You sure that's not just the normal scoping issues? What is it that breaks exactly? --Xover (talk) 12:34, 1 December 2020 (UTC)
@Xover: something in the ResourceLoader stack rejects it. You get errors something like
JavaScript parse error (scripts need to be valid ECMAScript 5): Parse error: Missing ; before statement in file 'MediaWiki:Gadget-sandbox.js' on line 4
You can try it out by enabling the "Sandbox" gadget in your user preferences. Line 4 of MediaWiki:Gadget-sandbox.js is the let x = 1; line. Inductiveloadtalk/contribs 12:53, 1 December 2020 (UTC)
@Inductiveload: Argh! Yeah, as usual the MW situation is a mess. phab:T75714 will give you the gist of it, but the issue seems to be the lack of a JS minifier written in PHP that supports ES6 combined with lack of priority to the task due to IE still providing 3% of global hits on WMF sites. In essence I think that means ES6 is effectively blacklisted until the WMF raise the Grade A browser support criteria to include ES6. --Xover (talk) 13:23, 1 December 2020 (UTC)
@Inductiveload: The "shiny new hotness" is now at mw.loader.load("//en.wikisource.org/w/index.php?title=User:Xover/loupe.js&action=raw&ctype=text/javascript"); if you want to play. No testing to speak of, and written in full "scratching my own itch" mode, so expect breakage. It probably needs a toggle to turn it on and off, and the layering is off at the edges (that's prolly PRP's fault though), and the size is hard-coded, and… But, anyway, feel free to play with it (and to steal any bits you want obviously: I tipped over into actually cobbling this together when I saw your code for grabbing the thumbnail URL from the API, which I'd been procrastinating on figuring out, in the index grid thingy), or laugh and point derisively (it ain't pretty is what I'm saying). --Xover (talk) 15:14, 1 December 2020 (UTC)
Awesome! Very stylish!
I'm still unsure of the One True Way (TM) to configure gadgets (e.g. width) - so far the only response I got at MW is "use the options API", which is a good way in terms of UX and also when the data is available (i.e. right from the start), but perhaps rather limited in terms of being ale to drive the configuration programatically.
Furthermore, after digging about in PageNumbers.js I'm also unsure if mw.cookies or the Options API are a better bet for storing things like current visibility state.
BTW, I've made some notes at User:Inductiveload/Script development about "offline" development which can be a bit less frustrating that saving every typo into a page's history! Inductiveloadtalk/contribs 17:36, 1 December 2020 (UTC)