(Redirected from Wikisource:S)
The Scriptorium is Wikisource's community discussion page. Feel free to ask questions or leave comments. You may join any current discussion or start a new one; please see Wikisource:Scriptorium/Help. Project members can often be found in the #wikisource IRC channel webclient. For discussion related to the entire project (not just the English chapter), please discuss at the multilingual Wikisource. There are currently 423 active users here.



Re-purpose WikiProject OCR to WikiProject ScansEdit

I propose to re-purpose the defunct (6 years since the last edit) WikiProject OCR to be "WikiProject Scans" with a focus on scan-backing works and acquiring scans for new works. OCR is not a problem we have often any more, since the OCR button(s) work well even when the scans don't have an existing text layer. Adding OCR to scans can remain a part of WikiProject Scans, but it will also acquire new competencies:

  • Acquisition and set-up of scans for existing works
  • Repair of existing scans (patching pages, placeholders, etc.)
  • Tracking of backlogs for scan-related maintenance

Based on the link saying "users can request scans" at Wikisource:WikiProject, OCR probably already has these aspects, but it's been forgotten. Inductiveloadtalk/contribs 14:33, 6 July 2020 (UTC)

  •   Support. James500 (talk) 15:30, 6 July 2020 (UTC)

Bot approval requestsEdit


For creating the remaining red links of A Chinese and English vocabulary, in the Tie-chiu dialect#Contents. I might use AWB for future tasks, but this task will probably be manual. Suzukaze-c (talk) 05:15, 18 June 2020 (UTC)

Repairs (and moves)Edit

Designated for requests related to the repair of works (and scans of works) presented on Wikisource

Other discussionsEdit

PD-anon-1923 againEdit

The discussion of Happy Public Domain Day! has slipped into the archives without getting into some conclusion, so I would like to remind that the last suggestion in the above mentioned discussion was to create {{PD-US|year of death}} and deprecate {{PD/1923}} and {{PD-anon-1923}}. Is this solution OK?

BTW: if we decide to keep calling the license templates for pre-1925 works {{PD/1923}} and {{PD-anon-1923}}, it would be necessary at least to adapt the latter one so that it could be used for 1924 anonymous works too. --Jan Kameníček (talk) 16:21, 20 February 2020 (UTC)

  Support the change — I don't really care but it makes sense —Beleg Tâl (talk) 16:36, 20 February 2020 (UTC)
  •   Support likewise —Nizolan (talk) 01:54, 21 February 2020 (UTC)
  •   Oppose because the name emphasizes US. The point of the templates is to cover both US status and international status. A template that names the US will cause confusion, especially to newcomers. --EncycloPetey (talk) 02:02, 21 February 2020 (UTC)
    @EncycloPetey: So under your opinion, fixing a math wrong do even require consensus? Without consensus we should believe 1+1=3 rahter than 1+1=2? --Liuxinyu970226 (talk) 01:37, 1 April 2020 (UTC)
    Changes to established templates require consensus. We've had previous discussions and the community is divided on the issue concerning these templates. Proceeding with a change when the community has expressed such division is inappropriate because of the community discussion, not because of my opinion. --EncycloPetey (talk) 02:05, 1 April 2020 (UTC)
  •   Support. We are US-centric in our copyright approach. Given the number of times I've had to look up these type of templates here and on Commons, I might buy the idea that we should copy them, but otherwise, I think this is going to be as non-confusing as we get.--Prosfilaes (talk) 04:35, 21 February 2020 (UTC)
  •   Comment In your proposal, how do we code the year of the author's death for anonymous works? --EncycloPetey (talk) 04:38, 21 February 2020 (UTC)
    I am afraid I do not understand the question: anonymous works do not have any known author. I propose that for anonymous works we would have a template with similar wording as {{PD-anon-1923}}, but it would be called {{PD-anon-US}}. --Jan Kameníček (talk) 09:42, 21 February 2020 (UTC)
    That's also problematic, because the US is just one place that we display license information for. The current template displays that information for both the US and for countries with 95 years pma. --EncycloPetey (talk) 19:46, 21 February 2020 (UTC)

  Comment If there is a consensus to act, my recommendation is that we just move/rename the templates

  • pd/1923|yyyy -> PD-US|yyyy, yyyy=YoD, displays two templates as now
  • PD-1923 -> PD-US, where no $1 parameter it displays the one template
  • PD-anon-1923 -> PD-anon-US|yyyy, year of publication

and update the documentation around the place. Do any internal required tidying around internals of templates, and fixing double redirects. No need to deprecate anything, just move to the new nomenclature, and not worry about any of the old usage, or anyone continuing its use, as it matters not. — billinghurst sDrewth 11:15, 21 February 2020 (UTC)

  •   Oppose Firstly, because of the US emphasis. Yes, we follow US copyright law, but we also serve an international readership, not to mention contributors who are also bound by the copyright laws of other countries. Secondly, I think replacing "PD-1923" with "PD-US" is confusing. "PD-US" sounds like a generic template for "this work is PD in the US", but under this proposal it would mean "this work is PD in the US for the specific reason that it was published more than 95 years ago". BethNaught (talk) 22:16, 21 February 2020 (UTC)
    I do not understand in what way "the readership" is concerned in this… They see only the text of the template which is going to stay the same. --Jan Kameníček (talk) 23:08, 21 February 2020 (UTC)
      Comment I do not think that the suggested name of the template is more American-centred than the old one. E.g. {{PD/1923|1943}} has got two parts: "1923" is the American part referring to the American copyright laws, and the parameter "1943" is international referring to the countries where PD depends on the year of death. Nothing would change, only the American part would be called "US" instead of the nowadays non-sensical 1923, I really do not see any problem in that. --Jan Kameníček (talk) 23:08, 21 February 2020 (UTC)
    @BethNaught: The thing is that the only consideration we give to copyright compliance with regard to hosting is to the US copyright. Unlike Commons, we don't really care whether it is copyright in the country of origin. It is for this reason that I am reasonably comfortable with just stating PD-US and variants. The additional PD-old-70 and variants are for information only. — billinghurst sDrewth 00:43, 22 February 2020 (UTC)
  •   Comment I think this is an important issue, and I'd like to weigh in. I'm probably as familiar as (almost) any Wikimedian with the considerations around copyright law in various countries. But I do not see a clear statement of what the problem is that we're aiming to solve, or what the pros and cons are. I'm sure if I took an hour or two to dig through various archives, I could probably figure it out, but I'm not likely to have the time for that...nor should we expect every voter to do that. So given all that, I'm inclined to gently oppose, simply because I can't figure out what's going on, and it seems unwise to make a change that is difficult for community members to evaluate. Is it possible to sum up the issues more concisely so that I can give it more proper consideration, without having to do all the research myself? -Pete (talk) 22:44, 21 February 2020 (UTC)
    The problem I see is this: Until 1923 it made quite a good sense to have a template called PD-1923, because it referred to the fact that only pre-1923 works are in the public domain. However, the situation has changed, currently the time border is 1925-01-01 (or 1924-12-31) and it shifts every year. I perceive it as very confusing to call the template for pre-1925 works PD-1923 (why 1923???). At the same time it does not make sense to change the name of the template every year (PD-1923, …, PD-1925, …), it would be better to find a fitting universal name. --Jan Kameníček (talk) 23:16, 21 February 2020 (UTC)
    Ah, that's very helpful @Jan.Kamenicek:, thank you. I had misunderstood, I thought you were proposing a change to the functionality in addition to the name change.
    I agree that changing the name (a) such that it specifies "US" and (b) such that it references the 95 year rule, rather than the (now outdated) 1923 rule would be worthwhile. I agree with others that we should be cautious about US centrism; but the reality is, with a current title that assumes that it relates to US law, without stating it, we already have a high degree of US centrism in the title. In my view, it's better to state "US" as part of the name, to make it clear to editors (who are the primary audience for a template name) that it's about US law. So, my suggestion would be {{PD-US-95}} or similar. That conveys that it's about US law, and it's about the 95 year rule. Text on the template page/docs could clarify that the 1923 rule is now outdated, and subsumed under the 95 year rule.
    A related issue that I find confusing: I don't understand why we need two separate templates for {{PD-1923}} and {{PD/1923}}. I think this proposal only relates to the latter; would we be leaving PD-1923 intact? A decision on this is probably a matter for a separate discussion, but I'd like to know for sure what the intent of this proposal is. -Pete (talk) 23:45, 21 February 2020 (UTC)
    PD-1923 has no decision-making applies just a single template, it does not add the PD-old-nn variants. It has been utilised where we have been unable to determine a date of death, or for corporate publications which do not have PMA decisions. I addressed above that they would morph into PD-US, though we would need to handle them as parameterless. — billinghurst sDrewth 00:51, 22 February 2020 (UTC)
    Jan, that's not quite correct. Works published before 1923 are still in PD in the US for the same reason they were before. The 1923 date was a cutoff date beyond which we have never had to check. What has changed is that works that were under copyright later than that (from 1923 and 1924), and had their copyright renewed at one point, have now had that copyright protection expire. The works published before 1923 were not eligible for renewal and entered PD for a different reason than the works published in 1923 and 1924. It is one view to see the date as a shifting cutoff, but the cause of works from 1923 and 1924 entering public domain is actually different from those that were published prior to 1923. --EncycloPetey (talk) 03:13, 22 February 2020 (UTC)
    All works published more than 95 years ago are out of copyright because of the time since publication, no matter whether that's due to copyright notices, or renewals, or being in copyright for a full long term. For a work published before 1923, we've never been concerned about copyright notices or renewals, nor how long work published with copyright notice and renewal got in copyright. Why does it matter that a work published in 1924 may have got 95 years of copyright, whereas a work published in 1922 may have only got 75, when we don't really care about that 95 or 75 in the first place? We have no tag for "published abroad before non-US works got copyright in the US in 1891", because we don't care; it has always been sufficient for our purposes to say that it was published before 1923, and I don't see why it is not now sufficient to say that it was published more than 95 years ago.--Prosfilaes (talk) 04:59, 22 February 2020 (UTC)
    @Prosfilaes: I am presuming that this is in reference to the primary notice about copyright within the US, not the secondary notice for PD-old-nn which relates to copyright elsewhere in the world. The secondary notice can still apply for those of us not in the US, which is why we added it. — billinghurst sDrewth 05:08, 22 February 2020 (UTC)
    Yes, the primary notice. There's no need to worry about now-historical features of non-US countries, but certainly helpful to list the years since death.--Prosfilaes (talk) 05:18, 22 February 2020 (UTC)
    Yes and no. There are authors who have works published prior to 1925 who died late enough to still have works in copyright in their home country, so those notices are still very pertinent per Category:Media not suitable for Commons. — billinghurst sDrewth 05:30, 22 February 2020 (UTC)
    Right; I didn't mean to imply we should change the current secondary notices.--Prosfilaes (talk) 06:42, 22 February 2020 (UTC)
  •   Support U.S. copyright is of primary concern to Wikisource. Fixing the license so more 1923 and 1924 works appear on Wikisource even if still under copyright in other countries is so important. Abzeronow (talk) 19:46, 16 March 2020 (UTC)
  •   Support as this seems like the least problematic solution to the problem, and it doesn't make sense for us to keep delaying a resolution. Kaldari (talk) 18:09, 14 April 2020 (UTC)
  •   Comment It looks as though some people are hedging their bets: arguing for deprecating the template on the one hand but arguing for improving the template on the other. Since the template content has now changed, before this discussion has concluded, then proceduraily we should recast all votes, since the template named in this discussion thread no longer has the content it had at the start of this discussion. --EncycloPetey (talk) 20:42, 24 April 2020 (UTC)
    Hedging their bets? It is somehow improper to try and improve Wikisource for now, whether or not this template gets deleted? If we're going to get pedantic about policy, where is it written on the English Wikisource that we should recast all votes?--Prosfilaes (talk) 06:41, 25 April 2020 (UTC)
    No need to restart the votes, as the changes have been reverted. The template is the same as it was before the voting started. No changes should be made to any template if there is a discussion and voting ongoing about its future. If the changes were allowed and at the same time we would have to restart the voting after every change, we may never come to a conclusion; not everybody has time to vote about the same problem again and again. --Jan Kameníček (talk) 09:50, 25 April 2020 (UTC)
  •   Support If there must need a consensus to fix math wrongs, let it be. --Liuxinyu970226 (talk) 09:01, 7 May 2020 (UTC)

Time to talk nomenclature of author classification by occupationEdit

We have dodging the fix for a while and it is probably time to start what I think could be a long conversation in what we could do, and then probably followed by another on how we will do. (People may prefer that this is put to a separate RFC subpage rather than here at WS:S)

We have long had occupation categories for author pages (subcats of category:authors by occupation and category:authors by nationality. Below that we have the plainest of names that don't distinguish that pages there should be from our Author: namespaces.

At the same time we have not had a good set of categories for biographies of people (as the subject), though I have been working on the creation of these subcats (through category:biographies of people by nationality and category:biographies of people by occupation) though these creations will be well behind the corresponding author set.


To me, it is time that we start to have clarity to our author category nomenclature, and I don't know exactly what people would want, though we could be as basic as converting Category:Physicians

  • Category:Physicians (authors); or
  • Category:Author:Physicians; or
  • something more natural text Category:Physicians as authors

All have strengths, and with the HOTCAT system if we utilise {{category redirect}} we can even utilise a couple of schemes and HotCat will put to the designated target. Personally if using HotCat getting it to differentiate quickly is always my ideal.

Ultimately we would have to decide whether the existing Category:Physicians is permanently going to point to its author derivative for all time, or for a temporary time whilst we migrate and settle down the setup, and at a point possibly become a category "disambiguation" page that points to the alternatives for physicians.

And the final question are we creating separate (and maybe matching) category hierarchies one for author namespace pages, another for main ns pages, splitting portal ns pages between one or the other. Noting that in the general subject categories when we get some layers down we start to pick up both forms so the names do become necessary. For example Category:Medicine will have below it (somewhere) both biographies and author pages of physicians.

Further, for something like sailors we could have Category:…
  • Sailors as authors or Sailors (authors) or Authors:Sailors
  • Biographies of sailors
  • Stories of sailors new category created just now to take pages
and reasonably all three have been added to category:Sailors. The more that I look at it, I think those generic names should be disambiguation categories, especially as HotCat c:Help:Gadget-HotCat has means to manage disambiguation targets.
Do I take this as either a) people don't care, and I can go ahead and do as I please? or b) this is way too big a conversation, and you numb my brain? or c) what the hell are you talking about? — billinghurst sDrewth 04:43, 17 May 2020 (UTC)
A bit of both column B and C. I think your problem statement and the rough thrust of your proposed course of action make sense. I see nothing in the above I actively disagree with.
I think category names should always be "fully qualified" rather than rely on its parent categories to provide part of its definition. I also think long category names are a good thing, and natural language construction of them the best approach. We can have Category:Physicians as a top(ish) level cat, but it would just be a container for Category:Authors who are physicians and Category:Biographies of physicians and Category:Novels about physicians, and so forth (possibly with intermediate "…by topic" container cats).
But I would also very strongly urge that we start by making a guide/principles/documentation/policy/whatever page that describes the entire scheme (principles, examples, guidance on tricky cases, what pages should have what kind of categories, and what pages should have no categories, etc.), before asking the community to actually decide. This stuff is going to be de facto policy (even if we don't call it that) and will be "enforced" (I use the term loosely) through various mechanisms, so fleshing it out more is necessary before properly deciding. And if we have good guidance and a sensible category naming scheme and hierarchy, it's going be much easier for the community to help clean up / maintain it without stepping on each others' toes. --Xover (talk) 07:16, 17 May 2020 (UTC)
We already have Help:Categorization so let us work from there. I am going to push through changes based on your PHYSICIANS example above, as HotCat will give useful restrictions, and I want to test those in action as I go. — billinghurst sDrewth 14:27, 16 June 2020 (UTC)
Not "Author as ..." too much typing to get to the keyword as the point of differentiation. So much recommending is "Occupation as authors" — billinghurst sDrewth 04:00, 17 June 2020 (UTC)
  Question Is it possible the cats can be inhaled from Wikidata: e.g. author's item has "nationality: French" and "occupation: painter, poet", then a template and/or module looks at that and auto-cats into Category:French authors, Category:Painters and Category:French poets based on a set of rules: what categories we want, and how they are laid out at enWS (e.g. we have a cat for French poets but not Lithuanian poets, so Lithuanians go into Category:Poets). It seems a bit of a shame to expend so much effort of categories here when all the data is (or should be) at Wikidata.
Of course any categories not captured by Wikidata properties can still be manually added. Inductiveloadtalk/contribs 14:57, 16 June 2020 (UTC)
@Inductiveload: Ultimately I believe so, though we still need to build the categories and the structure. Though that is only going to work for the authors, AND we have to do more for our works. WHAT I would hope that we achieve at a point in time is through a WS biographical article, to the WD biographical item at and through main subject, and then access the person data and categorise the biographical work. (AGAIN I need an available structure, and to invert the current name structure). — billinghurst sDrewth 15:34, 16 June 2020 (UTC)

Existing maintenanceEdit

I have been very slowly cleaning main namespace articles from our author categories, and down to less than 200 , or thereabouts

which will be a reasonable sweep, though not perfect.

Some of the remaining maintenance is people categories (our somewhat contentious people categories) eg.


billinghurst sDrewth 17:49, 6 May 2020 (UTC)

Replacing image of musical score with Lilypond in non-scan-backed works?Edit

A recent discussion elsewhere has brought to light an issue that could benefit from wider community input.

How should we deal with replacing a score image with Lilypond in non-scan-backed works?

Our practice is generally to use Lilypond (gory details) to present musical scores to our readers—including replacing existing images of scores with Lilypond—even if they are not pixel-for-pixel identical. This is generally good and desirable for several reasons, including the ability to automatically generate an audio version of the score.

However, we have a large legacy of works that are not scan-backed, and that include the score as an image in the page. If we replace that image with Lilypond then we also break any possibility of validating the score, which is an even more fundamental practice on the project.

I have not found any specific guidance on this issue anywhere, so I am hoping the community can chime in with some views on how to handle this issue going forward. It's not the biggest problem we have, but it has led to disagreements, so a common course on it would be useful.

Some possibilities that point themselves out:

  • Ignore it. Just replace the image with Lilypond and ignore the lack of validation. The work in question is not scan-backed in any case, so this matters little.
  • Don't do it. Independent validation is fundamental to our processes, and when something precludes that it should not be done. We can live with static images of music scores on these works.
  • Link image in textinfo. We have other information about works in the {{textinfo}} template on the work's talk page, so the image can live there and be available for validation.
  • Include image thumbnail. The image of the score can be included directly on the work's page as a thumbnail so it is available for validation. This is how many non-scan-backed works include other illustrations so it's good enough for this case too.
  • Tag the score with a notice. We can put a text tag on the Lilypond score that makes clear that it is not yet validated, and links to or explains how to find the original image for verification.
  • Something else. A much better idea that the community will come up with in this discussion. 😎

I don't think this the sort of issue that has a clear-cut right and wrong answer, so is best dealt with by discussion and seeing if we can extract some rough consensus (as opposed to holding a vote or something like that). In other words, I would very much appreciate any and all thoughts and opinions on this; including "I really don't care about this!" because that's also useful guidance from the community. --Xover (talk) 08:48, 11 May 2020 (UTC)

  • Note: Pinging possibly interested contributors: Beeswaxcandle, EncycloPetey, Beleg Tâl. If there are others who may have a particular interest or relevant perspective on this, please ping them to this discussion. --Xover (talk) 08:52, 11 May 2020 (UTC)
  • I have seen a number of pages like this, especially from Portal:Sheet music. I believe that the preferable solution would be to create an index which includes the images, transclude the pages (if any Lilypond has been written), and leave it like any other work. (It could look like this if there is no Lilypond.) By the way, is there any reliable way for cross-page creation of Lilypond files? I think that it would help with this? TE(æ)A,ea. (talk) 11:35, 11 May 2020 (UTC).
  • I think the best solution which would keep the advantages of using Lilypond instead of a static image would be to link to the source image using {{textinfo}}. This keeps everything simple enough that people who wish to contribute new works in this fashion can do it without difficulty while allowing easy double-checking (without having to comb through the page history...). Similarly, if there are non-trivial differences between the Lilypond and the image then it would pose no problem to tag the page and notify the relevant persons of perceived problems (assuming the person who notices it does not know how to fix it). 14:01, 12 May 2020 (UTC)
  • My personal opinion is that labour-intensive improvements like Lilypond markup are not worth the effort for non-scan-backed works. Find a scan and back it first, and then you can craft the Lilypond score to match. Otherwise, I would just ignore it. —Beleg Tâl (talk) 03:12, 11 June 2020 (UTC)

Tech News: 2020-24Edit

21:11, 8 June 2020 (UTC)

Validating texts against alternate sourcesEdit

Quick question about what to do in a situation that arises occasionally. Sometimes a scan has illegible text, either because of bad printing, damaged pages or bad scanning (e.g. folder pages, obscuration, blurring or out of focus areas; generally, but not always, scans done by a particular company). Without access to alternative scans, that work cannot be completed. Sometimes, we'll even patch two or more incomplete scans of the same work together to make a complete version.

However, it's also common than works are published in many forms (republished, syndicated, etc., especially periodicals), and you could complete the text from a different source. For example the content of Page:Melancholy consequences of two sea storms.pdf/7 has a combination of scan defects (obscured left margin) and printing defects. The same content is available in various other places, for example The European Magazine, May 1796, which must have been within 8 years of the original (the event described was in 1786), another book and The Mariner's Chronicle, all of which appear consistent with the damaged scan.

How do we feel about proofreading the damaged scan from an alternate source that isn't simply a different scan of the same work, assuming that the alternative source is entirely consistent with the damaged one (e.g. if there's a 2em gap, the alternate has "the" in that place and it makes grammatical sense)? Naturally, the source that was used to fill in the gaps should be indicated, e.g. on the index talk page. Inductiveloadtalk/contribs 17:36, 9 June 2020 (UTC)

@Inductiveload: I would say that's a judgement call: if the missing or illegible bit is short and you're confident the alternate source is accurate, I'd say go for it. The longer/larger the missing part is and the less certainty can be had that the alternate source is accurate, the harder the substitution is to justify. So long as we're talking single words, even multiple instances of single words on a page (a cut off margin, say), and there's no particular reason to suspect changes between editions, I would pretty much always call it ok. Don't overthink it, is probably apt advice here. --Xover (talk) 17:57, 9 June 2020 (UTC)
Great, thanks. That what I thought, I just don't recall ever seeing the question asked. Obviously if you can't have confidence the text is that same then it's a no-go. Generally, except for the very worst scans, it's just a word here and there or maybe a bit of margin or a stray finger. Inductiveloadtalk/contribs 18:09, 9 June 2020 (UTC)
It also turns out there is a "secret" (as in, the documentation wasn't transcluded and there's no reference to it anywhere other than one User pages) template created recently: {{reconstruct}}, which is exactly what I'm talking about. That seems a sensible solution to mark out text that's "knowable" but not actually in the scan. (Turn out it was mentioned here: Wikisource:Scriptorium/Archives/2020-04#Reconstruction_on_context,_and_clipped_scans... but clearly I missed it). Inductiveloadtalk/contribs 11:23, 10 June 2020 (UTC)
I was just going to say we should create a template for such purposes but good to see one has already been created. But I think we need to make it more obvious that the text is reconstructed and from what source. I think it is very important we make know that the text is reconstructed and not part of the source and shouldn't just be added without the reader knowing. Maybe with a tooltip? a footnote? or maybe with an explanation after the reconstructed text? Maybe highlight the reconstructed text with a background color? What do you think? I'm thinking highlight the background text with a light color and add a footnote with an explanation that the text is reconstructed and from which source. @ShakespeareFan00: created the template. What do you think? Jpez (talk) 17:15, 11 June 2020 (UTC)
I think this is true only for reconstructing a completely lost text whose wording nobody can be certain of. But if we have a bad scan with some missing words which can be added from another source and if there is no reasonable doubt that the text of both sources is identical, we can add the text without disturbing the reader with some unnecessary templates reducing the enjoyment of reading. E. g. if all the legible text of our scanned work is identical with the text of the alternative source, we can often assume that also the missing words along the edge of a badly scanned page would be identical as well. However, our attitude should depend also on the ratio of the missing text and other factors, so we should consider it individually in different cases. --Jan Kameníček (talk) 17:46, 11 June 2020 (UTC)
Yeah I agree that if there's a word or a few letters missing or illegible we shouldn't go too crazy, especially if it's obvious what it is that is missing or can be found elsewhere. I do this myself many times, adding what it is that is obviously missing, and no one will ever be the wiser. But I guess there is a limit, and we are Wikisource after all, and it's important that we provide source backed texts. So I think that it's important that we should mention where inserted text has come from in many cases especially when it comes to larger blocks of text. Jpez (talk) 19:11, 11 June 2020 (UTC)
I think leaving a note on the Index talk (and maybe Page, or in a <!--comment-->) should be sufficient. I made a quick {{reconstructed from}} template which can put the sources in a nice list and add a category to the index. Feel free to hack with it! Inductiveloadtalk/contribs 20:54, 11 June 2020 (UTC)

Main page on mobileEdit

Hi, all,

From the Tech News above, there is an upcoming change that will affect the display of MediaWiki (or maybe just WikiMedia) main pages on mobile web.

Removing the hack from back-end and instead using TemplateStyles for special main-page formatting will give the communities more control of how it displays on mobile, so I think this will be good in the long term.

But in the immediate term, it would mean Wikisource main page will display in two columns rather than the current single-column. Looks okay to me on a tablet but phone will likely be not pretty.

JDLRobson has a quick CSS tweak at the Phab ticket (Option 1) which looks like it should only affect Minerva and not have other side-effects.

Is that something we can put into production for a quick live test (will need an admin as Main Page is protected), or do we want to create a separate page for testing and discussion?

Pelagic (talk) 22:17, 9 June 2020 (UTC)

@Pelagic: Working on it at Main Page/sandbox new. --Xover (talk) 23:09, 9 June 2020 (UTC)
Thanks, Xover. That looks awesome already! (And I learned something about CSS grid layouts.) Struck part of my original, since the existing layout doesn't work that way. Pelagic (talk) 06:54, 10 June 2020 (UTC)

Stale policy: shortcutsEdit

Quick prod at some stale-looking policy: Wikisource:Shortcut says "Reserved for Wikisource project reference pages (WS: namespace) only.", which is a statement which hails from back in in 2006.

This is clearly not true in practice, as WS:CUTS lists official shortcuts in several namespaces. I suggest replacing it with something that allows shortcuts in (off the top of my head) Main, Portal, Wikisource, Translation, Help and Category (I can't think why we'd need shortcuts to File, Page, Author, Index, User, Gadget; and Template, Module and MediaWiki are doubtful to me). And then add something that prevents over-keen shortcut creation for every single work, i.e. only large collective works like EB1911 and the other current denizens of Category:Mainspace pages with shortcuts are eligible. Basically any work that is eligible for {{engine}} is eligible for a shortcut, but not mandatory. Perhaps:

Shortcuts are allowed to the following namespaces: Main, Portal, Wikisource, Translation, Help and Category.
Shortcuts to the main namespace are allowed only for large collective works (e.g. encyclopedias, newspapers, etc.) that would benefit from the ability to access them quickly. Not every work needs a shortcut.

Obviously standard sanity overrides apply, as always. But if we're going to bother to have policy written down, we should make sure it reflects reality. Inductiveloadtalk/contribs 08:49, 10 June 2020 (UTC)

@Inductiveload: Very much agree on your last point, and the rest of your thoughts sound sensible. If you write that up I'd support the change. --Xover (talk) 08:58, 10 June 2020 (UTC)
The indented text above is a suggestion for the new text. I have purposely kept it short and simple and slightly open ended (e.g. define "large", "collective" and "benefit") in order to not pre-emptively shut down reasonable shortcut targets, while also not advocating a policy-driven shortcut creation spree (e.g. as would be implied if you said "large collective works should have shortcuts"). Perhaps it could say "generally allowed only" to imply a bit of latitude for sensible exceptions.
I can add the musing about {{engine}} but that's more a rule of thumb and there have been cases of over-keen addition of that template in the past. Plus, AFAIK, {{engine}} usage isn't regulated by policy, so using it to inform policy is ill-defined.
Any wording can always be refined if it turns out to be deficient, it's not life-and-death. Inductiveloadtalk/contribs 11:17, 10 June 2020 (UTC)

Translation header template -- seems to have just started rendering in duplicateEdit

Translation header template -- seems to have just started rendering in duplicate.

For example

For example

Nissimnanach (talk) 15:14, 10 June 2020 (UTC)Nissimnanach Nissimnanach (talk) 15:14, 10 June 2020 (UTC)Nissimnanach

It's not only translations, all headers are doing it today. Not sure why, but it's something to do with Javascript. Inductiveloadtalk/contribs 15:36, 10 June 2020 (UTC)
@Nissimnanach, @Inductiveload: There was a new MediaWiki deployed today (1.35/wmf.36) and it looks like some change to its HTML output broke local javascript here. I'm guessing it's the darned dynamic layouts / pagenumbers scripts causing the havoc: they move the HTML that the header template emits around in the DOM in… careless ways. Note for example that there are now two #mw-content-text nodes in these pages(!). And the code, besides being spread over multiple files, is loaded unconditionally from (and partially is implemented in) Common.js and runs for all users, so it's a right royal pain to debug. --Xover (talk) 17:18, 10 June 2020 (UTC)
@Xover: yep, I was trying to poke at that as it's the only place I could see mw-content-text and headerContainer being used locally. But as you say, Common.js is pretty impervious to debugging. Maybe as everything is broken right now anyway, now is a good time to rip it out into a turn-offable module (e.g. PageNumbers.js), since it can't get much more borked?
Having two #mw-content-text elements can't be intentional, isn't the whole point of IDs is they should be unique?
Also the heederContainer bit looks pretty suspect. Inductiveloadtalk/contribs 17:25, 10 June 2020 (UTC)
@Inductiveload: Don't get me started. (IDs should be unique; heeder is not a typo(!))
@Billinghurst: or anyone else with global-interface-admin: could you try disabling (delete, comment out, whatever) lines 135–142 in MediaWiki:Common.js and see if that fixes this problem? It'll break some other stuff, but it'll let us narrow down where the problem lies. --Xover (talk) 17:41, 10 June 2020 (UTC)
Meh. I'm having trouble catching the culprit in the debugger. MediaWiki:PageNumbers.js is what wraps #mw-content-text in three div elements (#pageContainer, #regionContainer, and #columnContainer). MediaWiki:Common.js lines 135–142 then picks out the header in the DOM and moves it outside those wrappers. And MediaWiki:DisplayFooter.js automatically generates a footer based on the header. I've been trying to set breakpoints in the debugger to catch the bit that's causing problems, and I can catch the double header (which is just a symptom) but I've so far failed to catch whatever it is that causes duplicate #mw-content-text. And despite breakpoints on the relevant line in PageNumbers.js I am failing to catch the DOM before those three wrapper div elements are added (which should be impossible).
I'll have to go offline for a bit now, but I'll get back to this when I can (breadcrumbs above if anybody else wants to try tackling this in the mean time). --Xover (talk) 18:23, 10 June 2020 (UTC)
BTW, line 141 in Common.js is what is adding the duplicate header ($( 'div#headerContainer' ).prependTo( $( 'div#mw-content-text' ) );). But it is getting thrown off by something else that I've not been able to track down. Possibly related to the html5 changes in the latest Tech News, and possibly also involving the previous issue about over-specified selectors (note the element name in the jQuery selectors there). --Xover (talk) 18:40, 10 June 2020 (UTC)
Note: in jQuery, $( 'div#mw-content-text' ) will return a list of two items if there are two elements with that ID. According to the jQuert prependTo docs, "If there is more than one target element, however, cloned copies of the inserted element will be created for each target except the last." So this is the culprit. Inductiveloadtalk/contribs 19:03, 10 June 2020 (UTC)

Also the while 1911 Encyclopædia Britannica/Nuisance has two headers the pagination at the side works correctly (via the translucent inclusion) however the next page 1911 Encyclopædia Britannica/Nukha uses ‹div class=indented-page>{{page break|485|left}} and the page number is being mixed into the text. -- PBS (talk) 17:35, 10 June 2020 (UTC)

@Nissimnanach, @Inductiveload, @PBS, @ネイ, @Jan.Kamenicek: This looks like a Mediawiki bug (probably in Vector), cf. phab:T255073. --Xover (talk) 20:05, 10 June 2020 (UTC)

Ok, Jdlrobson just uploaded a patch for this, so it's possible we'll get it fixed in an out-of-cycle deploy relatively soon (the alternative is the next scheduled deploy on next Wednesday). If that doesn't pan out I think I have a workaround we can implement locally (we just need to flag down an interface-administrator). Details on request, but I'm hoping we can avoid it. --Xover (talk) 21:03, 10 June 2020 (UTC)

  Comment The fix has been rolled out, though there is now an ugly artefact of visible <div class=indented-page> see 1911 Encyclopædia Britannica/Nukha. In the main namespace on transcluded pages PrP ignores that class, and in non-transcluded pages it should function. I cannot put time into working out the issue. — billinghurst sDrewth 00:02, 11 June 2020 (UTC)

This is now resolved. The indented-page issue appears to have been a followon problem that disappeared when page caches were purged. --Xover (talk) 06:05, 13 June 2020 (UTC)
I have just made this edit to 1911 Encyclopædia Britannica/Pittsburg (Pennsylvania) edit comment: "‹div class=indented-page>{{page break|678|left}} – {{page break|682|left}}" and the page is now displaying the problem that user:billinghurst describes. -- PBS (talk) 20:45, 15 June 2020 (UTC)

Interwiki is coming backEdit

As work-edition model based interwikis slowly become real in Wikisource, maybe somebody would be interested in discussion that I had with Tpt in this phabricator ticket recently. This presents my POV that may be different to others' POV in this matter. However, I think we still do not know how this model (and its implementations) should behave with complex cases so probably we should wait for an initial implementation and then start a wider inter-community discussion somewhere. Ankry (talk) 19:11, 11 June 2020 (UTC)

@Ankry: I thought this is going to be solved by the Community Tech team any moment because of meta:Community Wishlist Survey 2020/Wikisource/Inter-language link support via Wikidata, i. e. a wish that was chosen to be fulfilled by the community in the end of 1919. --Jan Kameníček (talk) 10:05, 12 June 2020 (UTC)

Transcluding a collection of interesting tidbits from the PSM projectEdit

I bookmarked about 160 paragraphs from The Popular Science Monthly which I found interesting and whimsical on their own, and especially when compared with the knowledge of today. Started to assemble a small sample HERE to give the community an idea of what I am talking about. I would like to transclude them to the main namespace in the next couple of months. Are there any objections or suggestions? — Ineuw (talk) 21:25, 11 June 2020 (UTC)

@Ineuw: I am uncertain what it is that you're proposing to do. Do you mean you intend to create a mainspace page with an arbitrary selection of interesting excerpts? If so, I would say that'd be against our policy on excerpts and, I suppose, annotations. Interesting as those tidbits are, they would be your selection and not reflect any published collection. --Xover (talk) 06:13, 13 June 2020 (UTC)
No problem.— Ineuw (talk) 17:09, 14 June 2020 (UTC)

Easy LST overhaul with handy new featuresEdit

Hi, I have overhauled the Easy LST gadget to allow (what I think) is a pretty handy thing: automatic replacements of text on save.

For example, you can configure it to replace <<cxlsc|XXXX>> with {{center|{{x-larger|{{small cap|XXXX}}}}}}. It will be replaced when you preview the page, so there is no further action you need to take.

You can also configure it to perform generic actions on load and save, which could be used for auto-generating a running header or page numbers without fragile templates. This could be done with other scripts, but this gadget provides a perfect hook-in location for these functions.

I combined this stuff with the Easy LST gadget we have because the LST thing requires to overwrite the click handler for the Save buttons, and having a separate gadget would allow only one to work, unless there was some really delicate interdependency. Since the functionality is exactly the same (text transform on load and save, it seems easy to combine). Other changes:

  • It is no longer naughtily placing its functions in the global namespace, it's all inside an IIFE
  • General tidy-up and linting
  • Can disable it with JS, so if you do want to disable Easy LST dynamically based on namespace or page name or something, you can do that

The script is at User:Inductiveload/easy_lst.js, there is some simple documentation at User:Inductiveload/easy_lst. As usual, you can try it out with

importScript('User:Inductiveload/easy lst.js');

If people like it, it can be gadgetised by putting it in the right place, which will need an Interface Admin. With no configuration, there should be no different visible to the user. I find it rather useful, hope you do too! Inductiveloadtalk/contribs 20:42, 12 June 2020 (UTC)

@Inductiveload: Awesome! And I'm ecstatically happy to see someone take an interest in our technical aspects: we have an insane amount of site-local custom CSS and JS that was not particularly well designed (in aggregate; I don't necessarily mean individual pieces of it), is prone to break due to changes in browsers or MediaWiki or skins, and often isn't actually needed anymore. We also have lots of opportunities for improvement and utility that we aren't taking advantage of.
That being said, I do have some comments on this particular script…
First of all, ResourceLoader modules do not execute in global scope; they get a "private" closure from RL's implementation method, so the IIFE is not needed. MediaWiki:Common.js, Special:MyPage/common.js, and similar do execute in global scope, but that's by design (for compatibility reasons) and is a completely different issue.
Second, you'll want to use jQuery to bind the event handlers using something like $('#wpPreviewWidget').click(on_save);; or, since you're installing the same handler for all the buttons, just attach it to the nearest parent with $('.editButtons').click(on_save);. The event handlers are executed sequentially in the order they are attached, so there's no problem with multiple Gadgets attaching handlers to the same nodes; and events bubble up from innermost elements to their parents until there's a handler for that event, so you can just install one handler for a containing element when you don't need to differentiate between the buttons. Which brings me to my final point…
As a general rule, it is a bad idea to mix multiple functionalities into the same script. This is true both from the user perspective (easy labelled sections isn't really a "replace text strings" type of function from a user perspective), and from a technical perspective (it bloats the code and makes it harder to figure out what it does, and creates a tight coupling between things that are functionally independent of each other). You want the least possible lines of code in a Gadget that still gets the job done. There're exceptions, of course, particularly when you get into huge utilities like Twinkle, where you might want to collect common code into a shareable library, but for anything we're futzing about with here on enWS the rule of thumb should hold in almost all cases.
My suggestion would therefore be to let Easy LST be Easy LST (well, it could certainly do with a more descriptive name and a link to docs from its Gadgets prefs entry!), and implement the new functionality as a separate Gadget. Easy LST is the kind of thing we want to enable by default for all users; the other code is more of a power-user feature that must be explicitly opted in to. --Xover (talk) 07:58, 13 June 2020 (UTC)
@Xover: Great tips, thanks! I'll look at doing it that way. I had used the old-skool handler registration directly from the original, and didn't think my way out the box. Is there any way to debug a script "as a Gadget"? I saw the functions in the global scope when I was manually loading the script from common, I didn't realise Gadgets get special treatment. Inductiveloadtalk/contribs 10:28, 13 June 2020 (UTC)
@Inductiveload: Not that I know of, no. If you have 2FA set up, once you get +sysop back, you can request interface-admin permissions and set up a copy in a hidden gadget (it can default to on, but then disable itself unless the user is your account). It's hacky and awkward, but it might do in a pinch. --Xover (talk) 11:39, 13 June 2020 (UTC)
@Xover:, OK, I'll hack along as I am for now and maybe revisit in future.
I also hope I'm doing the config hook thing correctly - I got the idea and method from the lint check tool, which seems to be pretty "modern" JS.
Also, what level of JS modernity do we target? In particular, can we use "let" rather than "var"? CanIUse says it's got about 95% support, mostly missing IE 11 (c. 2013). Inductiveloadtalk/contribs 12:28, 13 June 2020 (UTC)
Hmm, looks like the config hook thing will not work reliably (or at all?) in a gadget, because the fire/add function ordering is not defined. Anyone know the canonical way to post configs into gadgets? The current gadget looks at a global variable, which seems almost as dodgy. Inductiveloadtalk/contribs 14:57, 13 June 2020 (UTC)
@Inductiveload: Oh, by the way, since you mentioned headers and footers, you might find some use of a… dingus… I made to make finishing Match&Split works easier (that operates on similar principles to your script). See my message to BT at "Lab rat?". It's really hacky and prone to fail, but for my own use it's been really helpful. In short it's a user script that tries to eliminate any manual labour involved in setting page headers and footers. For the large volume of pages, the only manual job is to remember to tell it the new chapter title once the chapter changes. I plan to eventually clean it up and document it, but as it's a rather narrow use case it's not been a priority. Feedback and suggestions are always welcome though. --Xover (talk) 12:00, 13 June 2020 (UTC)
@Xover: I'll take a look. User:Inductiveload/Running header.js has a similar aim, but it copies content from the previous (actually n-2th) page. It's a manual Templatescript thing, but an option to auto-insert if the page is being created should be quite possible.Inductiveloadtalk/contribs 12:28, 13 June 2020 (UTC)
@Xover: a related issue I just opened that would be useful for your script (and I've run into the need myself before): phab:T255345 Inductiveloadtalk/contribs 13:59, 13 June 2020 (UTC)

Take 2Edit

@Xover: I've split it back out into 2 scripts. The jQuery suggestion for the click handler was (of course) right, so the two can live in harmony, side-by-side. The Easy LST script is now called "Easy Section Syntax" and if fundamentally unchanged, though a little bit tidier. I haven't been able to find a "correct" way to deal with the configuration as yet, so if a user does want to configure it (e.g. disable it on certain pages, etc), they have to turn the gadget off, then load it after their config in their user JS. I have therefore kept the IIFE, since it's needed in debug contexts and in user-configured contexts, and I don't think double IIFE won't cause any drama.

Probably the auto-replace script will stay as a user script rather than a gadget if this is how configuration has to work. Slightly annoying as it won't appear in the gadget list, but since it would need config anyway on a per-user basis, the only functional different is an extra importScript call. Inductiveloadtalk/contribs 22:49, 14 June 2020 (UTC)

@Inductiveload: You can still access globals as a Gadget, so just proxy the config through the window object. Something like:
// Custom replace functions are added to this array
window.SLAactions = [];

// A custom function
function fixOCR (wikiText) {

// Add custom function to the array
If you also need traditional "user preferences" style configuration, you can link the user to Special:BlankPage/SLAactions, catch that page in the script and display a prefs UI there, and then save the choices to the user's MW prefs using mw.user.options.set(). Or you can let the user use mw.config.set() from Special:MyPage/common.js and then just not persist the values. There's a simple example of using such a pref in User:Xover/Gadget-NopInserter.js (see WS:S#Update to NopInserter Gadget for context). --Xover (talk) 06:42, 15 June 2020 (UTC)
@Xover: Right, but is there any guarantee this:
// common.js
window.SLAactions = {...};
runs before the Gadget reads from it? It doesn't really matter for something like HotCat, where the user will only interact with it once common.js has almost certainly executed and completed. But for Easy LST (or SLAction, if there are load actions), if the user set a config, say that disabled the gadget, the gadget needs to know that before it storms ahead and messes with the text.
The hook method forces this by fire()ing before loading the script at all (thus before the add() is called). This introduces extra delay, as you have to do "load, run, load, run", rather than just "load, run, hook_fire"/"load, wait, hook_add, run" in parallel.
For very simple, static, config, the prefs-style would work fine, as mw.config.get("myPrefs"); should be available to the gadget. But say your config was "only do Easy LST if the page isn't DNB", what to do? Unless you have a dedicated page filter baked into Easy LST (code, complexity, size and maintenance overhead), you can't do it, and even if you had a page filter, maybe the user wants to key on phase of the moon, or user agent, etc etc etc.
My workaround so far is to allow the gadget to be run as a user script or a gadget, so the interested user can configure it and the default user can just get the defaults loaded by the gadget mechanism (faster and easier). In the case of SLAction, it'll probably have to be only a user script, as there's no huge benefit to gadgetising since the user will need to list their own actions in JS anyway. But if there were way, it would speed things up a tiny bit. Inductiveloadtalk/contribs 10:07, 15 June 2020 (UTC)
@Inductiveload: I'm having trouble tracking down information on the load order here. But I'm not sure it matters in this case: you can't start modifying the page until the DOM coalesces anyway ($.ready), and by that point any code in user scripts should have finished running. Or so I would assume in any case. --Xover (talk) 18:14, 16 June 2020 (UTC)
I feel like you might be right, but it's quite tricky to test out, since the gadgets are so undebuggable, and debug mode is not representative. As long as User:Common.js "gets on with it" and doesn't (say) apply config within a callback from a slow load or something (which could run after the DOM settles, even if the main "thread" of commons.js is complete), I imagine common.js would be done, and it does seem that way when I place a breakpoint at the end of User:common.js, or at least the main thread is done before functions like $(function(){this_one;}) fire.
My concern stems from mw:Extension_talk:Gadgets/Archive#Gadget_scripts_loaded_too_early, which indicates this was once an issue, but in 2008, but the thread died and I can't tell what the upshot was. I also asked at mw:Topic:Vo7zs0yio0xtp78g. The only suggestion so far is a fancified version of what you suggest with Special:BlankPage.
Perhaps I'm faffing about pointlessly for 99.9% of practical cases, but I'd quite like to know the One True Way to do this, especially if it concerns a default gadget.
† Actually my user:common.js is loaded from a local server, so what I think of a my common.js is already inside an async callback, but it loads fast. Inductiveloadtalk/contribs 09:08, 17 June 2020 (UTC)

Address by Theodore Roosevelt before the convention of the National Progressive Party in ChicagoEdit

Could an index be created for this page from the scan images here? The page is one of a number created by User:Progressingamerica recently, with links to on-line scans, but without a scan basis. TE(æ)A,ea. (talk) 21:42, 13 June 2020 (UTC).

File:Address of Theodore Roosevelt NPP - 1912.djvu. Annoyingly they seemed to be blocking automated downloads andthe basic user agent trick didn't work. >:(. Inductiveloadtalk/contribs 22:15, 13 June 2020 (UTC)
there you go Index:Address of Theodore Roosevelt NPP - 1912.djvu - text layer good. more of a pamphlet, the special collections librarians are more used to manuscripts, than books. Slowking4Rama's revenge 23:47, 13 June 2020 (UTC)

Should Category:Formatting templates and Category:Typography templates be combined?Edit

I noticed these two categories, which seem to have a great deal of overlap, not only in their descriptions, but also in their contents. Is there any reason to separate the categories? If not, could someone (perhaps using a bot) deprecate one of the categories? TE(æ)A,ea. (talk) 21:16, 14 June 2020 (UTC).

Tech News: 2020-25Edit

21:37, 15 June 2020 (UTC)

Auto-generated reports -- to portal: namespace?Edit

Miraclepine and I started a nascent conversation about the generation of reports of curated lists from Wikidata that might of interest and especially in lieu of categorisation which is hard work and forever incomplete here. Also with WD's data being regularly updated and curated it seems that we are wasting the power of how it can help us outside of the main namespace.

For example if we had a page Portal:Royal Victorian Order we could have section(s) that generate a series of lists of authors for each class of the order. Here we would be looking to use something like d:Wikidata:Listeria to do the heavy-lifting. (Not that I am an expert in Listeria)

Does the community see that the portal namespace is the place to generate such output?

I know that the Wikidatans like to generate those interesting queries, eg. d:Wikidata:Status updates/2020 06 15 and it would be great if we could have those pages generated in a curated means in a ready namespace.

Interested to hear the community's thoughts. — billinghurst sDrewth 13:58, 16 June 2020 (UTC)

@Billinghurst: As a Wikidatan, I would like to note here that I recently prepared this SPARQL query which may be useful for the Listeria lists. This one is for GCVOs, but may be used for any award as long as that award is listed in the items' "award received" property. It may take seconds to load.
#People on Wikisource with GCVOs
SELECT ?item ?itemLabel ?article WHERE {

    ?item wdt:P166 wd:Q12192290 . # person with GCVOs
    ?article schema:about ?item .
    ?article schema:isPartOf <>.
    SERVICE wikibase:label {
       bd:serviceParam wikibase:language "en"

Go to query page

ミラP 00:56, 17 June 2020 (UTC)

  I know that I have been adding that award data as I generate the entries from TIWW. Though of course, the information about entries here is quite sparse compared to the data that one enters into the main subject, and referencing the data additions at WD from the local data is just a complete PITA. Also one is always inhibited with the issue that WD is not great for having data on institutions from early 20thC and before.

I sometimes wonder what it is that people want to see in the generative/informational space that could be Wikisource. And when we do output, what is the output that is desired. I do know that ages ago Charles Matthews started Portal:British Museum which seems to me a could sort of Listeria-target page if the data has been added to WD, and we could concpetually looking to construct, either as complete pages, or constructed subpages. — billinghurst sDrewth 01:36, 17 June 2020 (UTC)

On the general idea of using Listeria here: I would be in favour. It's a versatile technology, and given knowledge of SPARQL not hard to apply.

d:Wikidata:ScienceSource focus list/Main subject needed is an example of mine where, in a Wikidata namespace, it is combined with a focus list (use of P5008 on Wikidata) to provide a maintenance list. Custom focus lists on Wikidata need only the creation of a single item identifying the underlying project, and are a neat way to handle curated sets on Wikidata.

I think such lists would be suitable in both the portal and Wikisource: namespaces here. Charles Matthews (talk) 04:32, 17 June 2020 (UTC)

yes, i would support migrating hand curated lists in general to listeria. you could also try a DNB , EB1911, etc. author list. eventually bibliographic metadata will be from wikidata (wikicite) as well. Slowking4Rama's revenge 12:33, 18 June 2020 (UTC)
I support using Listeria. It has the added advantage that, with minor tweaks, a query from here can be used in any other language version of Wikisource, or vice- versa. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:55, 20 June 2020 (UTC)

British works published after 1925Edit

Help:PD says that works published outside the US between 1925 and 1977 are public domain if they were PD in its home country as of 1 January 1996 and never published in the US prior to that date. According to Wikipedia:Copyright law of the United Kingdom, "Under the 1995 Regulations (set out below), the period of author's copyright was further extended, to the lifetime of the author and 70 years thereafter" so I understand the 1996 rule cannot be applied to British works whose authors died after 1925. Unfortunately, there is not written what the British laws say about works published in the UK after 1925 (e. g. in 1941) whose author is not known. May I ask for advice? --Jan Kameníček (talk) 21:59, 18 June 2020 (UTC)

If it's fiction (i.e. an "artistic work") then it might be PD if it was published before 1950. Commons has a template that summarises that. I'm not really sure though. —Sam Wilson 00:49, 19 June 2020 (UTC)
Thanks very much, although the work I am speaking about now is not fiction and so I cannot apply this, it useful to know for future. --Jan Kameníček (talk) 06:21, 19 June 2020 (UTC)
The UK gives 70 years from publication for anonymous works; so combined with the lack of the rule of the shorter term, UK anonymous works published after 1925 are still under copyright in the US.--Prosfilaes (talk) 04:27, 19 June 2020 (UTC)
@Prosfilaes: Thanks for the reply. I know that it is 70+ from publication for anonymous works in the UK now, what I was not sure was whether it had been 70+ from publication for anonymous works in 1996 too. --Jan Kameníček (talk) 06:21, 19 June 2020 (UTC)
@Jan.Kamenicek: It's probably better if you just give us the details about the specific work. It's hard to guess what factors may apply without that. --Xover (talk) 07:44, 19 June 2020 (UTC)
@Xover: Here is an catalogue entry. However, there is one problem with the entry: they state that it was published in Czechia (Česko), which must be an error. The work itself does not indicate the place of publication, there is just written "Compiled on the occasion of the Czechoslovak Army Exhibition, Stratford-upon-Avon, January, 1941". Czechia was occupied by Germany which was in war with the UK, where Czechs were trying to form some military units to fight against Germany. I cannot imagine that a work for an army exhibition in Britain was published in occupied Czechia. I think it must have been published directly in Britain. However, it is not that important, just three pages of text, so if it is too complicated, I can leave it. --Jan Kameníček (talk) 08:49, 19 June 2020 (UTC)
@Jan.Kamenicek: I can't prove it, but it seems almost certain that they went from 50 to 70 for anonymous works at the same time they went from life+50 to life+70.--Prosfilaes (talk) 01:21, 20 June 2020 (UTC)
UK went from 50 to 70 years in 1995 presumably per legislation per where it all aligned with the EU requirements. — billinghurst sDrewth 06:06, 20 June 2020 (UTC)
Thanks very much! I went through the linked text and there is written that 50 years were substituted by 70 years for anonymous works in the 1995 regulations too :-( So PD-anon-1996 cannot be applied to this work. --Jan Kameníček (talk) 08:31, 20 June 2020 (UTC)
@Jan.Kamenicek: Do you have a source file? I'd like to read it; and I believe it would be eligible to be uploaded to Commons. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:36, 20 June 2020 (UTC)
This catalogue entry names "Czechoslovak Army" as the publisher; I wonder if this was considered as de jeure publication "in" Czechoslovakia, given the Czechoslovak government-in-exile? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:49, 20 June 2020 (UTC)
@Pigsonthewing: Not yet, but I could scan it (the book is not allowed to be borrowed away from the library, so I just have to take my own scanner and scan it there).
Ad publisher: I understand what you mean but I am not able to decide this :-( --Jan Kameníček (talk) 11:00, 20 June 2020 (UTC)

Always needing to confirm when leaving Page editing?Edit

Does anyone else find that they are always asked to confirm when leaving a Page NS editing form, even when nothing in the form has been changed? I think I've got a fix for it, but maybe it's not an issue or is even helpful? —Sam Wilson 03:48, 19 June 2020 (UTC)

@Samwilson: I've not noticed this happening, and quick testing now was not able to reproduce in latest-ish Safari. If needed I can check a couple of other browsers. --Xover (talk) 05:24, 19 June 2020 (UTC)
@Xover: Hmm interesting, thanks. Maybe it's a Firefox- and Chrome-only issue? I see it reliably in both those browsers when I click edit on any page and then hit any link to leave the page. Actually, if I click a link super quickly (I guess before JS has finished loading) it doesn't happen, but every other time it does. —Sam Wilson 05:37, 19 June 2020 (UTC)
@Samwilson: On macOS I cannot reproduce in Safari, Firefox, Chrome, or Vivaldi in their latest release versions. On Windows I can reproduce it in Internet Explorer 11 and Chrome 83 (the only two I can easily access). --Xover (talk) 07:41, 19 June 2020 (UTC)
@Samwilson: I have got the same experience and it is pretty annoying. It would be brilliant if you were able to fix it! --Jan Kameníček (talk) 06:26, 19 June 2020 (UTC)
BTW, I have tried three browsers and I have this problem in all Chrome, FF and Opera. --Jan Kameníček (talk) 09:03, 19 June 2020 (UTC)
@Samwilson: To confirm that your preferences do not have ticked Warn me when I leave an edit page with unsaved changes. It isn't happening to me with Firefox. — billinghurst sDrewth 05:49, 20 June 2020 (UTC)
Noting that there will be pages where there are zero observable changes as there can be changes happening in the background of ProofreadPage. Try Page:The Catholic encyclopedia and its makers.djvu/162 and see if you get the same issue? — billinghurst sDrewth 05:51, 20 June 2020 (UTC)
@Billinghurst: Yep, I have that preference enabled, but I'm getting the warning even when there are no changes. It does happen on the page you linked. I'm not sure what changes in the background you're talking about, but it seems to me that it's poor UX to get this warning. If a user opens an edit page, and doesn't do or click on anything at all, and then tries to navigate away, they shouldn't get a confusing message about "data you have entered may not be saved". I think I've found the fix; see the above phab ticket. —Sam Wilson 05:36, 22 June 2020 (UTC)
@Samwilson: save the page and see if there is a micro-edit, or whether it is a false artefact. — billinghurst sDrewth 06:40, 22 June 2020 (UTC)
@Billinghurst: Nope, saving it produces no diff. I'm pretty sure it's a bug. —Sam Wilson 01:13, 23 June 2020 (UTC)

People moving unsourced pages to make an editionEdit

Twice in the last couple of weeks I have seen editors moving unsourced versions of work to be subpages of a specific work. I have reverted those moves and left notes to the users. I just wanted to flag, to the community who may be watching, that this is not a practice that we want. If we have sourced pages that came from a work, then we can move them to be subpages; if they are unsourced, they, however, become their own version, and we create a versions page. We are happy to have editions, that is a good thing, not a negative. If you need a hand to tidy up, or to create version pages, etc. then please ask for it, as I doubt anyone, especially any admin, minds assisting. If an old unsourced page is problematic then utilise WS:PD. — billinghurst sDrewth 06:46, 22 June 2020 (UTC)

As a person to whom this post is undoubtedly referring, I have to take issue with this.
The Wikisource USP is supposed to be side by side transcription. The most recent case where the above person has undone things is Coleridge's poem 'Monody on the Death of Chatterton'. According to the Versions page for it, there are 5 different versions on the site (Coleridge kept revising it), none of which is sourced (i.e. has an index page). I moved the 1796 version to Coleridge's 'Poems on Various Subjects', which I have started transcribing from Index:Poems on Various Subjects - Coleridge (1796).djvu, since the Versions page stated that it came from that source. So now there two versions of the same thing, from the same source (one verifiable, the other not (and which also contains several obvious errors)). How is that 'a good thing'?
If you look at the statistics pages, the English Wikisource is by far and away the worst of the sites for unsourced work. As part of the transcription work I do, I try and do some that migrates unsourced texts (e.g. Milton's 'Paradise Regain'd', Addison's 'Cato', Johnson's 'Rasselas') to help improve this, but one wonders what the point of doing so is.Chrisguise (talk) 09:05, 22 June 2020 (UTC)
Not sure I'm following the rest of this issue, but I'll note that an unsourced text that is redundant to scan-backed text is eligible for speedy deletion. And, again without necessarily having direct relevance to this discussion, I think scan-backing existing unsourced texts is one of the most important tasks for the project, and among the best for improving our overall quality. Light bulb comes on… Oh, wait a minute… @Chrisguise: Did you move the page before scan-backing it? There's no real reason to do that, and when you do it can look like what you're doing is creating a new unsourced edition from random bits and pieces (or putting unsourced text inside the structure of a scan-backed edition). I would strongly recommend either proofreading or using Match&Split and then transcluding over the unsourced text first and then moving it within the work's structure afterwards. If you need to check the transclusion while the proofreading is still in progress you can easily do that in a sandbox (or with just preview in an edit window without saving). --Xover (talk) 10:38, 22 June 2020 (UTC)
(edit conflict)
I am not taking this issue up with specifics with anyone in this forum, this is the more general reference and reminder to community about what we do.
Yes, the preferred means for text reproduction is side-by-side transcription followed by transclusion. That does not mean that we move an unsourced version, and the work that you are referring to is simply unsourced, and you don't know the source that OR used to bring the work here, it is quite unlikely that it was directly from the 1796 edition, so it is its own version, that just means that when you have your new version that we move Monody on the Death of Chatterton (1796) to Monody on the Death of Chatterton (1796) unsourced (or similar) and make the former page the {{versions}} page. As I mentioned, at that point we can address whether we want to delete the unsourced version. The mention is on the disambiguation page is just that, where it was published, it is not the source of our work, which is unsourced, we don't presume.
Who gives a toss about the statistics page, that is not what drives us; there is no WORST, there is what there is. We deal with deletions by community consensus, not by someone unilaterally overwriting a version. — billinghurst sDrewth 10:51, 22 June 2020 (UTC)
Not that this has any relevance to the general point you're making here, but in this specific instance it actually happens that Ottava specified which edition it was from, and it was Ottava that added it to the versions page with the reference to the edition. Personally I would have been satisfied with just the year given in the title (the early editions of this poem are relatively well mapped), but the additional datum should be sufficient even for a stringent standard of evidence, I would argue. But this is of course an exception: many if not most unsourced texts are not so unequivocally identifiable to a specific edition (which is one reason I would typically vote delete in a proposed deletion discussion for such texts). --Xover (talk) 11:12, 22 June 2020 (UTC)
don't know why you are elevating here. add your sourced and unsourced version here Monody_on_the_Death_of_Chatterton. it appears to be a textural match, compared to other versions. redirect or transclusion, is better than move. Slowking4Rama's revenge 15:29, 22 June 2020 (UTC)
I don't see the point in having an unsourced version alongside a source version. A version that has some clear sourcing that's different from the sourced version might be worth keeping, but an unsourced version might have copyright problems, might have deliberate or accidental changes or errors, and has no audience. Anyone who doesn't care should be using the sourced version, and anyone who does care will be using the sourced version or find the correct, if untranscribed, version on or HathiTrust. We shouldn't be adding replaced unsourced works to versions pages and make things more complex for no value; we should just be deleting them.--Prosfilaes (talk) 16:02, 23 June 2020 (UTC)
Maybe I am misunderstanding this, but if an unsourced work is (essentially) identical to a portion of a work that is in progress, I think the best thing to do is move the unsourced work to a subpage of the sourced work and then replace the unsourced content with a transcluded scan-backed copy. I do this a lot. It keeps the redirects and links all tidy and ensures that the work is improved by the addition of a source and a scan. —Beleg Tâl (talk) 16:47, 24 June 2020 (UTC)
As for my page moves, (being the other person here attacked,) I have created the Index: page here for the work which contained a large number of the hymns which I had moved.

Tech News: 2020-26Edit

18:48, 22 June 2020 (UTC)

Changes to the Main Page are coming before 13 JulyEdit

Regarding A temporary fix helped wikis make their main pages more mobile friendly. This was in 2012. It has not been recommended since 2017. It will not work after 13 July. above.

This change in MediaWiki necessitated changes to our Main Page, because the current version will be somewhat broken (on mobile) after the cutoff (mobile, desktop).

To address this I have made a new version at Main Page/sandbox new (mobile, desktop) that uses a different approach to achieving the same effect as the Mobile Frontend's "legacy transform"; slightly cleans up and modernises the code; and otherwise tries to make as few changes as possible.

These changes will have to be put in place no later than July 12 (I'll probably aim for doing it a little earlier), and before that it would be very good to have as extensive testing as possible in order to shake out the inevitable problems. Please include operating system and version (Windows, macOS, Linux, etc.); web browser (Safari, Chrome, Firefox, etc.) and version; type and preferably model of device (PC, Mac, iPhone, Galaxy, etc.); and, of course, a description of the problem. Reports confirming the absence of problems are also very useful!

Pinging users who have previously expressed an interest in this issue (but the more testers the better!): Pelagic, Billinghurst.

@Inductiveload: You have qualifications / interest relevant to this I believe?

@Mpaa: You're pretty technical. You around, and is this your cuppa?

As noted above this change is trying to mimic the existing version as best possible under the circumstances, so it is not "Main Page 2.0" or anything else that could be called a "redesign" or "new design". However I do think there's some potential for an actual redesign there to freshen up the design (a little; our design should be a little stodgy), make better use of available space, and perhaps rethink what we put on the mainpage and in what order they should be. Possibly some of the supporting templates could be beneficially rethought too to make them simpler to use. But that sort of thing would require proper community discussions unhampered by the looming deadline of MediaWiki changes, so that'll have wait for a different occasion. --Xover (talk) 08:13, 23 June 2020 (UTC)

@Xover: It seems to work pretty slickly for me. I only have one, non-critical observation:
  • The heading come out as UPPER CASE when it goes to single col mode. If you write them in "Title Case" in the template, and apply a CSS "text-transform: uppercase;" in the dual-col CSS only, you can avoid that
I have a few more in terms of general design on the page, but I don't want to mess with it in the same breath as fixing this. Inductiveloadtalk/contribs 09:08, 23 June 2020 (UTC)
@Inductiveload: Fixed. The current version has sentence case so I went with that rather than title case. --Xover (talk) 10:00, 23 June 2020 (UTC)
Agreed, there are lots of things we could discuss, but let's not make a bikeshed-tarball. Maybe a sub-component of WS:IGD (under the "I")? I have other problems with various issues regarding CSS for export, printing, etc, as well as gadgets and sidenotes, which could all do with a central "sweat the small stuff" location. Inductiveloadtalk/contribs 09:08, 23 June 2020 (UTC)
WS:GEARS where the gearheads collect issues and try to fix stuff? :) --Xover (talk) 10:05, 23 June 2020 (UTC)

Publication date of files uploaded to en.wsEdit

When uploading a file to, the uploading form asks to choose some licensing. One of the offered options is "First published in the United States before 1923", while the uploader should be offered "… before 1925". Can it be changed locally or is it necessary to ask at phabricator? --Jan Kameníček (talk) 08:45, 23 June 2020 (UTC)

@Jan.Kamenicek: The options in that dropdown come from MediaWiki:Licenses and can be changed there. On the presumption that the change is not controversial I have gone ahead and made that change. If that should not be the case then please let me know and I'll revert it while we discuss the issue. --Xover (talk) 09:02, 23 June 2020 (UTC)
Yes, that is what I meant, thanks for quick reaction. --Jan Kameníček (talk) 09:29, 23 June 2020 (UTC)
BTW, is it possible to make it automatic, so that it does not have to be changed manually every year, to prevent the current situation when the options were updated 2 years later than needed? --Jan Kameníček (talk) 09:34, 23 June 2020 (UTC)
@Jan.Kamenicek: Nope. Extensions (like ParserFunctions) are not run in that context so we can't (easily) make that value calculated. --Xover (talk) 09:49, 23 June 2020 (UTC)

A not insubstantial effort at Commons...Edit

As reported previously, Internet Archive is facing a lawsuit over copyright concerns.

There was a proposal at Commons to mirror public domain works from Internet Archive to Commons.

This is a partly courtesy notification to let Wikisource contributors and admins know about this.

Relevant threads at Commons:

As what was intended to be 'mirrored' is intened to be public-domain books (and some other resources which are "compatible" with Commons licensing), it would be appreciated if contributors and admins from Wikisource with relevant expertise and the time could assist in reviewing what's been uploaded so far and what may be coming shortly. The aim of this reviewing would be to ensure that anything still in copyright, but inadvertently uploaded, is rapidly removed.

In addition it would be helpful (if feasible) to have a complete list of IA identifers for "public-domain" works linked to from Wikisource as External Links, but which for whatever reason do not yet have a local copy available for transcription. This is so that if needed an appropriate script could be used to ensure the appropriate files for those identifiers are uploaded in future batches.

I am likely to be away from Wikisource and Commons for a while however... ShakespeareFan00 (talk) 21:37, 24 June 2020 (UTC)

Empty space at the top of a pageEdit

How can I get rid of the empty space at the top of Page:A Book of Czech Verse.pdf/24? --Jan Kameníček (talk) 16:42, 25 June 2020 (UTC)

Please see if this is what you meant. There is space on top because of the font-size line-height.— Ineuw (talk) 17:53, 25 June 2020 (UTC)
@Ineuw: The result is what I needed, thanks very much. However, the way to achieve it is far from being intuitive and, honestly, I do not understand it much :-(
I am also not sure about the reason (font-size line-height). When I look at the original layout of the page, it seems to me that there are empty lines at the very top of the page (CTRL+A makes it visible), not a line with a larger line-height. I have also seen many other pages with the title written in a larger font where no empty space appears above the title (example). What is the reason that this page needs such a complicated solution while other pages do not? --Jan Kameníček (talk) 18:05, 25 June 2020 (UTC)
One more thing: Can the block be aligned to the left? The problem of {{Bilingual}} is that every pair of pages has to be transcluded separately, and so it is impossible to align the text to the center as one block. Therefore it is better to align it to the left, which ensures that the left edge of the text is in one line with the text of the other pages. --Jan Kameníček (talk) 18:14, 25 June 2020 (UTC)
@Ineuw: Here is the result after transclusion to show how the pages got misaligned. --Jan Kameníček (talk) 19:06, 25 June 2020 (UTC)
@Ineuw: I have noticed that you tried to simplify it and align the poem to the left, thanks very much for that, but I still do not like this solution very much. While originally the problem was only a redundant empty line at the top, now two problems have appeared instead:
  1. The block from page 16 is not aligned with the block of the following page no. 18, see the transcluded result at A Book of Czech Verse/J. Kollár, although this probably could be solved by adding the table and margin parameters to the poem in page 18 too.
  2. The author and the title of the poem are aligned to the center of the page, not the center of the block, which is visible when I for example change the width of the browser’s window. --Jan Kameníček (talk) 21:41, 25 June 2020 (UTC)
I really do not understand why these complicated things are needed. The original layout looked simple to me, so why does the empty line appear there? Is it connected with the problem that Xover tried to solve with the {{center}} and then returned it back? --Jan Kameníček (talk) 21:41, 25 June 2020 (UTC)
There is a very easy way, had I known that page two is a translation. I will fix it.— Ineuw (talk) 22:28, 25 June 2020 (UTC)
@Jan.Kamenicek: A_Book_of_Czech_Verse is the end result but the only way I could do this, is to place them in a table in a left and right cell and with a center cell to space the two columns apart.— Ineuw (talk) 00:14, 26 June 2020 (UTC)
@Ineuw: This looks much better! The only disadvantage I can see are the missing little numbers linking to the individual pages with English translation, which is imo very important too :-(
  Comment Why are we producing the Czech at all, it doesn't belong here, it should be at csWS? It should only appear here by interwiki transclusion, not through our proofreading and displaying. — billinghurst sDrewth 06:01, 26 June 2020 (UTC)
@Billinghurst: I would gladly transcribe it to but there are several problems:
  1. Czech WS uses different formatting templates which fail after the page is transcluded to English WS.
  2. Local admins at do not like the proofreading extension for some reason and so it is very badly maintained there and solving a problem often takes weeks or months. At the same time the request for Addition of csWS to global bots was absolutely ignored by the admins and failed as a result. So even if we found some working solution now, I am not sure whether it can be guaranteed that it will work in the future too. --Jan Kameníček (talk) 07:37, 26 June 2020 (UTC)
Still not a particular reason why the Czech text ends up here. I left a message for Danny to see what we can sort out. — billinghurst sDrewth 11:32, 26 June 2020 (UTC)
@Billinghurst: One more thing: I was thinking about the transwiki problems generally and have forgotten about an insurmountable problem connected with this particular work, and that is copyright: {{PD-US-no-renewal}} is not an acceptable license at --Jan Kameníček (talk) 14:25, 26 June 2020 (UTC)
@Jan.Kamenicek: no-renewal really just means "public domain in the US"; there is no legal difference between PD-old, Pd/1923, PD-no-notice, and PD-no-renewal. Is the issue perhaps that the work is only PD in the US, but still in copyright in Czechia? --Xover (talk) 14:59, 26 June 2020 (UTC)
Yes, exactly, some of the poems are still in copyright in Czechia. --Jan Kameníček (talk) 20:30, 26 June 2020 (UTC)
@Jan.Kamenicek: There's an invisible U+FEFF character (aka. Unicode w:Word joiner) between the | in {{block left}} and the opening { of {{c}} in your original version of the page. Due to its position in the markup stream it effectively ends up giving you an additional seemingly-blank line in the output (Word joiner is an invisible character, so the line is visually empty, but it triggers the generation of a line-box that takes up space in the browser). --Xover (talk) 10:32, 26 June 2020 (UTC)
@Xover, @Ineuw: I have remembered that in the past I did something similar and it worked well and so I found it, see Page:Modern Czech Poetry, 1920.djvu/56. So I copied the code and adjusted it, and suddenly it started working as I desired, see A Book of Czech Verse/J. Kollár. What is really strange is that the code seems absolutely the same as the one which did not work, as you can see in the diff.. I am glad it works now, but I have no clue why…
In fact there must be some invisible difference, as the line with {{lang|cs|inline=no|{{block left|{{c|{{larger|{{al|Ján Kollár|JAN KOLLÁR}}}}<br /> is highlighted in the diff, although visually I do not see anything different. --Jan Kameníček (talk) 20:26, 26 June 2020 (UTC)
Wonderful. I am glad you found what you preferred. — Ineuw (talk) 01:13, 27 June 2020 (UTC)
And I thank you for the effort to help me. Although I did not use it finally, I learnt a couple of good tricks.
However, I am really curious about the difference between the two seemingly identical "lines 1" at [15] and why they give different outputs… --Jan Kameníček (talk) 19:56, 27 June 2020 (UTC)
@Jan.Kamenicek: See my reply above. --Xover (talk) 20:22, 27 June 2020 (UTC)
@Xover: Ah, sorry, I did not get it at once, I was reading too quickly. These invisible characters make problems too often… Is there any way to make them visible in the editing windows? --Jan Kameníček (talk) 06:44, 28 June 2020 (UTC)
@Jan.Kamenicek: In theory, yes; but in practice, probably not. I use a programmers’ text editor that makes such characters visible. And if you know or suspect such a character is there you can often discover it by using the arrow keys to move the insertion point character by character. Most invisible characters will cause the insertion point to seemingly not move, because it is technically moving but over an invisible character so visually it stays put. In this specific case it also had disproportionate effect due to an unfortunate position in the markup, and it took me quite a bit of effort to figure out what was going on. Most of the time such invisible characters will either not create issues that anybody even notices, or their presence will be obvious from their effects. --Xover (talk) 09:16, 28 June 2020 (UTC)

I thought you should know (useful tip for transcriptions!)Edit

Just for you to know, in one of the last Windows 10 updates, they added an "emoji" menu, [Win + . (dot) ], that gives you access to not only emojis, but a variety of unicode symbols and letters, as well as mathematical signs. It's slightly more convenient than the "Special characters" tab. Take care. --Ninovolador (talk) 19:30, 26 June 2020 (UTC)

Adding texttip functionality to Template:ReconstructEdit

The title speaks for itself. If somebody is not familiar with the formatted output of Template:Reconstruct, the angle brackets look strange and (in my opinion) harm the authenticity of the text. Adding a "texttip" parameter to the template could provide additional context for the reconstruction; the word or phrase as it appears in the scan could be provided, for example. --EnronEvolved (talk) 17:15, 27 June 2020 (UTC)

Transclusion of pages from OldwikisourceEdit

I am working on Index:A Book of Czech Verse.pdf which is a bilingual book with poems in Czech and English. It was written in such a way so that readers who can read both languages could compare the original versions of poems and their English translations. I believe it is worth keeping this intention after its transcription to English Wikisource too, similarly as I have done with Modern Czech Poetry.

It is quite natural that both English and Czech Wikisource can be interested in this work. Although I think it would be easier to transcribe it separately to both wikisources, I am considering Billinghurst’s suggestion to transcribe part of the above mentioned Index:A Book of Czech Verse.pdf outside and then transclude it here. There was also an interesting suggestion at Czech Wikisource to transribe either the whole work or its Czech part to oldwikisource (some poems are in copyright in Czechia and so the whole Czech part cannot be accepted at and then both English and Czech Wikisource can transclude what they want from there. If this is possible, I will be happy to transcribe it there, but first I would like to ask some questions to people who have some experience with both oldwikisource and interwiki transclusion:

  1. Is there an easy way to apply the same formatting I am used to apply at English Wikisource in such a case? Having look e. g. at the Wikidata item of the template:larger I cannot see its equivalent at mul. The templates I have used so far when transcribing the above mentioned work include e.g. {{lang}} (including its inline=no parameter), {{larger}}, {{block left}}, {{dhr}}, {{ditto}}… I am afraid that if I used some other templates instead, they would not work at after transclusion.
  2. How can the links be treated? If a page contains e.g. the name Ján Kollár, and it is going to be transcluded to both Czech and English Wikisource, is there a way how to link the name to Author:Ján Kollár when the page is transcluded to, and at the same time to cs:Autor:Ján Kollár when it is transcluded to

Answers to these questions would help me to consider whether the interwiki transclusion is worth the effort or whether it is better to transcribe it to both wikisources directly (using their local templates).

Another thing to consider is the "danger" of having some texts outsourced: if something goes wrong at the other wiki from which we transclude the text and they do not fix it, nobody may notice it for a long time here. --Jan Kameníček (talk) 18:59, 28 June 2020 (UTC)

Books from the Biodiversity Heritage LibraryEdit

A mass upload of books from the Biodiversity Heritage Library, to Commons, is underway. Please assist in categorising them, and make use of them on Wikisource. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 13:07, 29 June 2020 (UTC)

Tech News: 2020-27Edit

16:30, 29 June 2020 (UTC)

Feedback on movement namesEdit

Hello. Apologies if you are not reading this message in your native language. Please help translate to your language if necessary. Thank you!

There are a lot of conversations happening about the future of our movement names. We hope that you are part of these discussions and that your community is represented.

Since 16 June, the Foundation Brand Team has been running a survey in 7 languages about 3 naming options. There are also community members sharing concerns about renaming in a Community Open Letter.

Our goal in this call for feedback is to hear from across the community, so we encourage you to participate in the survey, the open letter, or both. The survey will go through 7 July in all timezones. Input from the survey and discussions will be analyzed and published on Meta-Wiki.

Thanks for thinking about the future of the movement, --The Brand Project team, 19:39, 2 July 2020 (UTC)

Note: The survey is conducted via a third-party service, which may subject it to additional terms. For more information on privacy and data-handling, see the survey privacy statement.

WMF is planning a rebranding affecting all projectsEdit

I strongly encourage everyone to go take this survey and to participate in the discussion at the linked venues. Note that the survey closes in 4 days (we got this notification three weeks after they started the survey)! The Board of Directors of the Wikimedia Foundation has already decided that there will be a rebranding, with a final decision on the brand set for August 2021 (that is, just over 12 months from now). At that point they may move forward with one of the options; pause or adjust the rebranding process; or abandon it altogether. The options they have presented and want the community to "help refine" are:

  Option #1 Option #2 Option #3
  Wikipedia as a network Wikipedia as a Movement Wiki + Wikipedia
Movement Wikipedia Network Wikipedia Movement Wiki
Movement tagline Part of the Wikipedia Network Part of the Wikipedia Movement A Wiki Project (for projects)
A Wiki Organization (for organizations)
User groups Wikipedia Group Penguins Wikipedia Group Penguins Wikigroup Penguins
Chapters Wikipedia Network Antarctica Wikipedia Organization Antarctica Wikipedia Foundation Antarctica
Foundation Wikipedia Network Trust Wikipedia Organization Wikipedia Foundation

That is, Wikisource will get one of the taglines: Part of the Wikipedia Network, Part of the Wikipedia Movement, or A Wiki Project. And the project will be hosted by either the Wikipedia Network Trust, the Wikipedia Organization, or the Wikipedia Foundation. The name of the projects appear unlikely to change as a result of this process, so we'll still be "Wikisource", and I don't see any push to change project logos either (but other branding elements definitely might).

The central point for the Board is to more strongly identify themselves with Wikipedia, because that's what potential donors recognise, which is why the options above are just three variants of "Wikipedia". --Xover (talk) 10:39, 3 July 2020 (UTC)

Policy on substantially empty worksEdit

[This is imported from WS:PD, where it applies to multiple current proposals, and several other works].

We have quite a few cases of works that are "collective" or "encyclopaedic" in that they comprise many standalone articles of individual value, which are basically just "shell pages", with no substantial content of any sort, not even imported scans or Index pages. For example, and this isn't intended to make any statement about these specific works, they're just examples and they may well get some work done soon during their respective WS:PD discussions:

Based on the usual rate of editing for things like that, unless dragged up into a process like WS:PD, they'll remain that way a very, very long time. I think it is perhaps there might be a case to host a mainspace page for this work, even though there is zero, or almost zero actual content. Do we want:

  • Mainspace pages where this is a tiny bit of information like header notes, scan links and maybe detective work on the talk page (not in this case). This provides a place for people to incrementally add content. Also gives "false positive" blue links, since there is actually no "real" content from the work itself, or
  • Do not have a mainspace page until there's some content. Only host this in terms of scan links author/portal scan links, much like we do for something like a novel.

Personally, I lean (gently) towards #2, but with a fairly low bar for how much content is needed. Say, Indexes, basic templates, a title page and one example article. Ideally, a completed TOC if practical, especially for periodical volumes/numbers. It is fair to not wish to transcribe entire volumes of these work, it is fair to not want to import dozens of scans when you only wanted one, it is fair to only want an article or two, but it's not fair, IMO, to expect the first person who wants to add an article to have to do all the groundwork themselves, despite having been lured in with a blue link. That onus feels more like it should be on the person creating the top-level page in the first place.

I do see some value in periodical top pages with decent lists of volumes and scans where known, because these are often tricky and fiddly to compile from Google books/IA/Hathi, so it's not useless work, even if there are no imported scans (though imported is better than not).

We currently have a large handful of collective works listed for deletion right now in various levels of "no real content", and, furthermore, every single periodical that gets added can fall into this situation unless the person who adds, so I think we could have a think about what we really want to see here. Inductiveloadtalk/contribs 15:43, 3 July 2020 (UTC)

  • I believe that, if there is no scan as an Index: page, the main-namespace page should not exist unless it is being actively completed or is already mostly completed. A few pages (of the volume itself) is not very helpful, and is entirely useless if their is no scan given. TE(æ)A,ea. (talk) 15:59, 3 July 2020 (UTC).
  • I think such preparatory information would ideally be on more centralized WikiProject pages (for the broad subject), both for clarity and to assist in keeping different efforts consistent -- but that it certainly should be retained as visible to non-admins. I think that the red vs blue link issue is minor (but not totally negligible) and outweighed by the disadvantages of hiding the history of previous efforts. I strongly encourage redirecting such pages to appropriate WikiProject pages (after copying over the details there). JesseW (talk) 18:11, 3 July 2020 (UTC)
  • @JesseW: I agree that history shouldn't be deleted, but I think we should approach this in terms of what we want to see from these works, rather than what to do with the handful of examples at PD. There are hundreds of periodicals we could have but don't, and this applies to those as well. If we can come to a conclusion about what is and isn't wanted, we can make all the deletion requested works conform to that easily enough. Inductiveloadtalk/contribs 20:55, 3 July 2020 (UTC)
  • I think these pages are necessary to list index pages and external scans of multi-volume works (such as encyclopaedias and periodicals) especially if they are wholly or partly anonymous or have many authors or are simply large. I think it makes no difference whether such pages are in the mainspace, the portal space or the project space (except that it is harder to find pages outside the mainspace). The point is that these works often have so many volumes (often dozens or hundreds) that they must have their own page, and cannot be merged into a larger portal or wikiproject. If the community starts insisting on index pages, what will happen is the rapid upload of a large number of scans for the periodicals that already have their own page. Likewise if the community insists on transclusion. I also think it is reasonable to have a contents page in the mainspace, as it allows transclusion of articles. Most importantly, new restrictions should not immediately apply to existing pages that were created before the introduction of the restrictions. This is necessary to prevent a bottleneck. James500 (talk) 23:55, 3 July 2020 (UTC)
move the works to a maintenance category, and i will work them; delete them and i will not: i find your sword of Damocles demotivating. Slowking4Rama's revenge 01:55, 5 July 2020 (UTC)
@User:Slowking4: I am not proposing a sword of Damocles. I agree that the imposition of deadlines is counter-productive. I do not support the deletion of any of these pages. I would prefer to see them improved. James500 (talk) 04:38, 5 July 2020 (UTC)
TEA is on his usual deletion spree. not a fan. will not be finding scans to save texts, any more. he can do it. Slowking4Rama's revenge 00:15, 6 July 2020 (UTC)
The entire point of moving this here, and not staying at WS:PD is to decouple from the emotions that get stirred up in a deletion discussion. Let's keep deletion out of this. If we come up with some idea of what we do and don't want, then we can go back to WS:PD and decide what to do. I imagine that all that will be needed will be a fairly limited amount of housework to bring those works up to some standard that we can decide on here, and all the collective works there will be easy keeps. Hopefully with some kind of consensus that we can point at to outline a minimum viable product for such works going forward. There are hundreds and thousands of dictionaries, encyclopedias, periodicals and newspapers that we could/will, quite reasonably, have only snippets of. How do we want to present them? What, exactly, is the minimum threshold? Let's head of all those future deletion proposals off at the pass, because deletion proposals often cause friction. Inductiveloadtalk/contribs 00:47, 6 July 2020 (UTC)

I don't think this needs to be much of an issue going forward -- we all agree that it's OK to create Index pages for scans, even if none of the Pages have been transcribed yet; so the only case where this would come up is recording research where no scan has yet been identified as suitable to be uploaded. And for that, I still think a WikiProject page is the right location, not mainspace. (Or, if you must, your userpage.) JesseW (talk) 00:59, 6 July 2020 (UTC) I realized I may not have been clear enough here -- in my view, the ideal process goes like this:

  1. Decide on a work you are interested in (in this case, a periodical/encyclopedic one) -- don't record that anywhere on-wiki (except maybe your user page)
  2. Find and upload (to Commons) a scan of one part/issue/etc of the work.
  3. Create a ProofreadPage-managed page in the Index: namespace for the scan. (You can stop after this point, without worry that your work will later be discarded.)
    1. Put further research (on other editions, context, possible wikification, etc.) on that Index_talk page.
    2. Proofread a complete part of the scan (an article from the magazine issue, a chapter from the book, a entry from an encyclopedia, etc.) and transclude it to the mainspace (and create necessary parent pages), and put the further research on the Talk: page of the parent mainspace entry.

If you can't find any scan, and don't want to leave your working notes on your user page, put them on a relevant WikiProject's page.

If you come across such research done by others and misplaced, follow the above process to relocate it to an appropriate place, then redirect the page where you found it to the new location. That's my proposal. JesseW (talk) 01:08, 6 July 2020 (UTC)

@JesseW: It's not clear to me in your above whether when you use the term "index" you refer to a ProofreadPage-managed page in the Index: namespace, or a general wikipage in the main namespace on which an index-like structure (and/or a ToC, or similar) is manually created. Could you clarify? --Xover (talk) 05:14, 6 July 2020 (UTC)
I meant the namespace. Clarified now. JesseW (talk) 05:17, 6 July 2020 (UTC)
  • Hoo-boy. Y'all sure know how to pick the difficult issues…
    My general stance is that: 1) scans and Index: (and Page:) namespace pages have no particular completion criteria to meet to merit inclusion, and can stay in whatever state indefinitely (there may be other reasons to get rid of them, but not this); and 2) the default for mainspace is that only scan-backed complete and finished works that meet a minimum standard for quality should exist there.
    That general stance must be nuanced in two main ways: 1) there must be some kind of grandfather clause for pre-existing pages; and 2) there must exist exceptions for certain kinds of works that meet certain criteria. I won't touch on the grandfather clause here much, except to say I'm generally in favour of making it minimal, maybe something like "No active effort to get rid of older works, but if they're brought to PD for other reasons they're fair game". The design of a grandfather clause for this is a whole separate discussion, and an intelligent one requires analysis of existing pages that would be affected by it. It is always preferable to migrate pages to a modern standard, so a grandfather clause is by definition a second choice option.
    Now, to the meat of the matter: the exceptions…
    We have a clear policy to start from: no excerpts. Works should either be complete as published, or they should not be in mainspace. But quite apart from the historical practices that modify this (which are somewhat subjective and inconsistent, so I'll ignore them for now), there are some fairly obvious cases that suggest a need for more nuance than a simple bright-line rule alone provides. The major ones that come to mind are: 1) massive never-completed projects like EB1911 or the New York Times (EB because it's big; NYT because new PD issues are added every year); 2) compilations or collections of stand-alone works with plausible claim to independent notability.
    For encyclopedias and encyclopedia-like things, we have to accept some subsets due to sheer scale of work. But when that is the grounds for exception, there needs to be some minimum level of completion. I'm not sure I can come up with a specific number of pages/entries or percentage, but it needs to be more than just a single entry (and, obviously, only complete entries). For this kind of exception to apply, I think it needs to be a requirement that the framing structure for it is complete: that is, the mainspace page should give a complete overview of the relevant work even if most of it is redlinks. That includes title pages and other prolegomena when relevant. For a periodical like the NYT, that means complete lists of issues with dates and other such relevant information (e,g. name changes etc.). For preference, these kinds of things should be in Portal: namespace or on a WikiProject page until actually complete, but that will not always be practical (EB1911 and NYT are examples of this). Mainspace or Portal:-space should never contain external links (i.e. to scans) or links to Index: or Page: space (except the implied link of transclusion and the "Source" tab in the MW UI provided by ProofreadPage).
    For exception claimed under independent notability there are a couple of distinct variants.
    Newspaper or magazine articles need to have a certain level of substance in addition to a specific identifiable byline (possibly anonymous or pseudonymous, and possibly identified after the fact by some other source, such as the Letters of Junius) in order to qualify. It is not enough to ipso facto be a newspaper article, a magazine article, a poem, or an encyclopedia entry. On the one hand we have things like dictionaries and thesauri, where an entry could be as little as two words. Or a one-sentence notice without byline in a newspaper. Or two rhymed lines (technically a poem) within a 1000-page scholarly monograph.
    To merit this exception it should be reasonable to argue that the "work" in question should exist as a stand-alone mainspace page (not that we generally want that; but as a test for this exception, it should be reasonable to make such an argument). This would clearly apply to moderately long entries in the EB1911 written by a known author that has their own Wikipedia article. It would apply to short stories or novella-length serialisations in literary magazines by authors that have later become famous (or "are still …"). It would apply to various longer-form journalistic material from identifiable journalists (again, rule of thumb is notable enough for enWP article), including things in magazines that have similar properties. For most periodicals the most relevant atomic (indivisable) part is the issue not the entry or article, but with some commonsense exceptions.
    It would, generally, not apply to things that are works by a single author, like a scholarly monograph that just happens to be arranged in "entries" rather than chapters. It would not apply to things that are essentially lists or tables of data. It would not apply to short entries in something encyclopedia-like or entries that are not by an identifiable author. The OED for example, iirc, is a collective work where entries are by multiple not individually identifiable authors (and each entry is mostly very short too); only the overall editor is usually cited.
    For works claiming this exception too the framing structure should be complete, even if most of it are redlinks. The same general rules about Portal:/WikiProject and no external or Index:-space links apply. An exception would be for periodicals where new issues enter the public domain every year; and we should generally avoid including even redlinks for the non-PD issues here (but may allow them in a WikiProject page). For non-periodical works in multiple volumes where some volumes were published after the PD cutoff, including listings for the non-PD volumes (but not links to scans; those are a copyvio issue) is ok.
    Poems, short stories, and novellas are a special class of works here. A lot of these were first published in a magazine (possibly serialized), and a lot of them exist as multiple editions in substantially the same form. Some exist in multiple versions. These should all primarily exist the same way as chapters as part of their various containing works; but there are some cases where we might want to have, for example, a series of connected pages of the poems of Emily Dickinson. I am significantly ambivalent about this practice, as it amounts to making our own "edition" or "collection" of her poems (in violation of several of our other policies), but I acknowledge that it is an established practice and it is something that has definite value to our readers. It may be that it is actually a practice that should be governed by its own dedicated policy rather be attempted to be handled within these other general policies.
    For the sake of example; applying this to the works Inductiveload listed at the start of this thread would shake out something like this:
    Auction Prices of Books—This work appears to have no sensible subdivisions and is in any case by a single author. I see no obvious reason to grant this work an exception, except under sheer volume of work and even there I would want to see both a substantial proportion completed and some kind of ongoing effort towards completion (no particular time frame, but definitely not infinite and definitely not as an effectively abandoned project). In a deletion discussion I would very likely vote to delete the mainspace pages here (but, as nearly always, to keep the Index: and Page: namespace artifacts). I don't see this as a reasonable candidate for a Portal:, nor really a good fit for a WikiProject (though I probably wouldn't object to a WikiProject if someone really wanted one).
    Central Law Journal/Volume 1—A single volume is too little, so I would want to see a complete structure for the entire Central Law Journal, with level of detail for each volume similar to the one existing volume. Each article in the journal can be individually considered for a stand-alone work exception; but for the collection I would want to see at minimum a full issue finished to justify having the mainspace structure, and preferably multiple issues (in a deletion discussion I might insist on multiple issues). Index: and Page:-space artefacts can, of course, stay. A Portal: might make sense for selections from the journal, of articles that meet the standalone work exception. A WikiProject to coordinate work and track links to scans etc. might be a decent fit here, if someone wanted that. As it currently stands I would probably vote delete for the mainspace artefacts (with option to move whatever content has reuse value to a non-mainspace page for preservation; and undeleting if someone wants to work on something is a low bar).
    A Critical Dictionary of English Literature—The top level mainspace page has near-zero value, existing only to link to the single transcribed entry. For a credible claim to exception to exist it would need to be a complete framework for the work as a whole, and significantly more than a single entry must be complete. I would probably also want to see ongoing work, unless a substantial percentage of the entries were complete. The single finished entry is eligible to claim a standalone work exception, but I think it probably would not meet my bar for that (I might be wrong; and the rest of the community might judge it differently). In a deletion discussion I would probably vote to delete all the mainspace artifacts here (as always keeping Index:/Page: stuff) but with a definite possibility that I might be persuaded on the one completed entry (an absolute requirement for convincing me would be to scan-back it: as a separate issue, my tolerance for grandfathering of non-scan-backed works is small, and effectively zero for new/non-grandfathered works).
    Bradshaw's Monthly Railway Guide—Would need a full framework and a number of individual issues finished to merit a mainspace page. I see no credible subdivisions for a standalone work exception, but might be persuaded otherwise if, say, one of the train tables was used as a (reliable primary) source in a Wikipedia article (implying some sort of notability beyond just being raw data). In a deletion discussion I would probably vote to delete all mainspace artifacts here. If anyone made the argument, I would entertain the notion that there is value in treating train tables like poems, and hosting a series of train tables like we do Dickinson's poems; but that would require a substantial number of them completed.
    For everything above my stance is nuanced by a willingness to accept temporary exceptions for things that are actively being worked: active being operative, but with no particular deadline to complete the work. We have differing amounts of time available, and some works are so labour-intensive or tedious to do, that my person threshold for "active" is a pretty low bar to clear. If it's months and years between every time you dip in and do a bit I might start to get antsy, but days or weeks probably won't faze me. And that the projected time to completion is very long at that pace is not particularly a problem so long as it is not infinite. Within those parameters I would always tend to err on the side of letting contributors just get on with it in peace, regardless of any of the policy-like rules sketched above.
    I also want to emphasise that I think this is a very difficult issue to deal with. There are a lot of competing concerns, and a lot of grey areas that will likely take individual discussions to resolve. My balance point on this issue is partly formed by a broader concern about our overall quality (we have waay too many works of plain sub-par quality, and too many not up to modern standards) and a hope that by preventing the creation of these kinds of works (rather than deleting them after creation) we will be able to retain the good and desirable exceptions without dragging down quality, and without the traumatic and stressful events that deletions and proposed deletion discussions are.
    And for that very reason I am grateful this issue was brought up here for discussion, and I hope we can end up with some clear guidance, possibly in the form of a policy page, going forward. And in any case, since it will create de facto policy, this is a discussion that needs to stay open for a good long while (there are several community members that have not yet commented whose opinion I would wish to hear before closing this), and depending on how well we manage to structure the consensus, may also require a formal vote (up in the #Proposals section). --Xover (talk) 09:03, 6 July 2020 (UTC)
  •   Oppose. It is becoming clear that a policy on incomplete works in the mainspace is going to place enormous pressure on individual editors. I think it would be more effective to start a wikiproject devoted to scan-backing works that lack scans and so on. James500 (talk) 12:14, 6 July 2020 (UTC)
    • @James500: FYI, this thread was made in order to provide an exception to the current policy of "no excerpts". A literal reading of the policy as it stands has a plausible chance of coming down delete on the mainspace pages over at WS:PD. This thread is a chance to come up with a better way to support such partial collective works. That we have several substantially incomplete and abandoned collective works lolling around in mainspace is actually the result of laxity in respect to stated policy (not to say I think it's a bad thing). The deletion proposals, whatever you may think of them, are actually not in contradiction to policy. That said, as always, there is scope to adjust policy. Which is what this is.
    • Now, in terms of a WikiProject to scan back works, I think that is a good idea. See #Re-purpose_WikiProject_OCR_to_WikiProject_Scans above, which proposed to reboot Wikiproject OCR as a scan-backing Wikiproject. Inductiveloadtalk/contribs 14:40, 6 July 2020 (UTC)
      • The policy says "When an entire work is available as a djvu file on commons and an Index page is created here, works are considered in process not excerpts." A literal reading of that policy is that no scan-backed work is an excerpt (it is expected to be completed eventually). Further the policy refers to "Random or selected sections of a larger work". A literal reading of that expression is that it does not include lists of scans, or auxilliary content tables, as they are not "sections" (they are not part of the work), and that not every incomplete portion of a work is either "random or selected" (which would not include starting from the beginning and getting as far as you can, with intent to finish later). I could probably argue that an encyclopedia article or periodical article is a complete work. James500 (talk) 15:16, 6 July 2020 (UTC)
  • Nice wall of text, Xover (and I say that with great respect!) -- it generally makes sense and sounds good to me. As another hopefully illustrative example, take The Works of Voltaire, which I've been digging thru lately. I think this would very much satisfy your criteria as a large work, with sufficient scaffolding to justify the mainspace pages that exist for it. I would love to hear others thoughts on that. JesseW (talk) 16:07, 6 July 2020 (UTC)
    @JesseW: Yeah, apologies for the length. Brevity is just not my strong suit.
    The Works of Voltaire probably qualifies on sheer scale of work, yes. I don't think the current wikipage at The Works of Voltaire is quite it though: as it currently stands it is more WikiProject than something that should sit in mainspace (its contents are for Wikisource contributors, to organise our effort, not our readers, who want to read finished transcriptions). It also mixes a work page with a versions page in a confusing way. So I would probably say… Move the current page to Wikisource:WikiProject Voltaire; create a new The Works of Voltaire as a pure versions page, linking to…; The Works of Voltaire (1906), that is set up as a work page with the cover and title (and other relevant front matter) of the first volume, and an AuxTOC (and possibly also the {{Works of Voltaire}} volume navigation template). I don't know how tightly coupled the volumes of this edition are (does the first volume have a common ToC or index of works for all the volumes?), so some flexibility on format may be needed to make sense. But as a base rule of thumb it should start from a regular works page and deviate only as needed to accommodate this work (mainly the size is different).
    In any case… With a volume or two completed (they're only ~350 pages each) I'd be perfectly happy having something like that sitting around. With less then that I'd possibly be a bit more iffy, but it's hard to put any kind of hard limit on that. And with somebody actively working on it I'd be in no hurry whatsoever regardless of current level of completion.
    PS. I'm pretty sure a large proportion of the contents of these volumes are works that would qualify under "standalone works" that could exist independently in mainspace, regardless of what's done with the The Works of Voltaire page. Even his individual poems and essays can presumably make a credible claim here (because it's Voltaire; less famous authors would have a higher bar). Better as part of the edition, but also acceptable on their own. --Xover (talk) 16:56, 6 July 2020 (UTC)
  • @JesseW: I personally take no issue with this page's existence (actually I think it's a nice work and good way to allow an important author's works to be slotted in piece-by-piece. I have some general comments which overlap with this thread (written before Xover's reply, so pardon overlap):
    • First off, I differ with Xover in terms of the scan links: I think they're better than nothing, and I don't see much value in duplicating the volume list onto an auxiliary page just to add scan links. However, I can sympathise with the sentiment that our mainspace shouldn't direct users off-wiki (or at least off-WMF). But if we don't have the scans, and that's what the user wants, they're leaving anyway. Real answer: import moar scans!
    • No scan links are necessary where the volume exists in mainspace and is scan-backed (e.g. v3)
    • Ext scan links should only be used when there is no Index page or imported scan. Use {{small scan link}} or {{Commons link}} when possible (e.g. v2)
    • The first volume list could probably be in an AuxTOC to mark it out as WS-generated content.
    • The "Other editions" section belongs on an auxiliary namespace page (Talk, Portal or Wikisource). I suggest the Talk page is best in this case. Inductiveloadtalk/contribs 17:35, 6 July 2020 (UTC)
  • @Xover: I am in agreement with the majority of what you say. Particularly, I think a framework around any collective work (be it a single-volume biographical dictionary or a 400-issue literary review spanning 80 years) is the critical prerequisite, plus at least some scans, the more the merrier. Where I think I differ:
    • I am inclined to be a bit more relaxed in terms of how much of a work we need. As long as a single article exists, it's not "trivial" (e.g. only a short advert or some incidental text like a "note to correspondents", as opposed to an actual article), it's well-formatted and scan-backed, and a complete framework exists, including front matter and a TOC, such that's it is easy for anyone to slot in new pieces, I'd be fairly happy. Lots of periodicals have all sort of tricky bits like tables of stocks or weather tables and writing into policy that those must be proofread in order to get the "real" articles into mainspace would be a chilling effect, in my opinion. If you allowed an exception, it would be verbose and tricky to capture the spirit without saying "unless, like, it's totally, like, hard, man".
    • I am not dead against scan links in the mainspace at the top level, when such a top-level page exists. See my comments on Voltaire above. I am against them where they could sensibly be on an Author page and they are the only mainspace content.
    • I am ambivalent on the presence of, e.g., disjointed train timetables. It's not my thing to have a smattering of random timetables, but as long as they're individually presented nicely, it's not too offensive to my sensibilities. I might question the sanity of someone who loves doing tables that much, but whatever floats the boats! Also, I think that this might circle back to "good for export" - a mark which certainly would require completed issues or volumes. If you want to get that box ticked, you have to do it all.
    • Re the "notability" aspect of individual articles, I'm not really bothered by that, as I don't think we'll see a flood of total dross because few people really want to take the time to transcribe 1867 articles about cats in a tree from the Nowhere, Arizona Daily Reporter, and, actually I think some of the "dross" can be quite interesting in a slice-of-life kind of a way (always assuming well-formed and scan-backed). And the real dross is usually so bad (no scans, raw OCR, etc) that it can be dealt with outside of this topic. I think part of the value of WS is the tiny, weird and wonderful, not just in blockbusters like War and Peace and Pultizers. I think I might like to see more of our articles strung together thematically via Portals, but that's another day's issue. Inductiveloadtalk/contribs 17:35, 6 July 2020 (UTC)
      • @Inductiveload: We appear to be mostly in agreement. But… instead of me dropping another wall of text on the remaining points of disagreement, maybe that means we're in a position to try to hash out a draft guidance / policy type page with the rough framework? Then we could go at the remaining issues point by point. Because I think I'm in with a decent chance to persuade you to my point of view on at least some of them, but this thread is fast getting unwieldy (mostly my fault). It would also probably be easier for the community to relate to now, and much easier to lean on in the future. --Xover (talk) 18:31, 6 July 2020 (UTC)
        • @Xover: If there are no more comments forthcoming after a couple of days, I think that makes sense. I don't want to railroad it: considering we have at least one !vote for "do nothing", I'd like to see if there are any other substantially different opinions floating about. Inductiveloadtalk/contribs 17:41, 7 July 2020 (UTC)

The quantity of text here has grown far faster than my ability to absorb it, so rather than continue to put it off, here's my position: I don't see any problem with transcriptions that are scan-backed, even if the transcription only covers a small fraction of the entire scan. If Sally chooses (say) to transcribe a favorite story, that happened to be published in an issue of Harper's back in the 1890s, and goes to the trouble of uploading the full issue, but only creates pages for the one story that interests her, I think that's great. It doesn't matter to me whether she intends to work on the other pages or not. If it's not scan-backed, but it's fairly high quality, I am personally willing to do some work trying to locate a scan and match it up to the text; I'd rather we take that approach, than deletion, though of course deletion is the better option in some cases where the scan is very hard to come by.

If all this has been said above, or if I've misunderstood the topic, my apologies. Please take this comment or leave it, as appropriate. -Pete (talk) 02:00, 8 July 2020 (UTC)

Apologies, I see I had missed the point.

I disagree with Xover's statement that a top-level page for a publication, with a link only to a single article within the publication, has "near-zero value." Such a page can serve an important function linking content together in ways that help the reader (and search engines) find the content they're looking for, or understand the context around it. For instance, A Critical Dictionary of English Literature is linked from the relevant Wikidata entry. The banner on the Wikisource page clearly tells a Wikisource reader that they won't find a full transcription here; and with a simple edit, it could link to a full scan on another site, or (with perhaps a little more effort) even transcription links here on Wikisource. This page has been here since 2010; we don't have any way of knowing what links might have been created elsewhere in the intervening decade. (I do think that new pages like this should not be created without a scan at Commons to be linked to.) -Pete (talk) 02:12, 8 July 2020 (UTC)

A brainstorm...checking OCRs against one anotherEdit

I had an idea for a text processing script, and I'm curious if something like this has been tried before, exists, etc.

Let's use this example. Internet Archive has four separate scans of this text, All Over Oregon and Washington. For the purpose of expressing my idea clearly, let's assume that all four are the same edition of the book, but different physical books, and possibly run through different OCR engines. The goal would be to get an even more accurate OCR of the original text, by comparing all four.

A program could find which line corresponds to which, and then if two or three of them agree on a specific line, but others disagree, it would take the line withe the strongest agreement. Then it would output a single text file representing its best estimate of the most accurate OCR. This would eliminate some errors resulting from notes and scribbles on the page, dust in the scanning apparatus, and maybe some OCR engine errors as well.

Is this an idea that exists in any software already? And/or, for anybody with coding skills...can you assess how "doable" it would be? -Pete (talk) 21:42, 3 July 2020 (UTC)

Assuming the page matching is done (either based on image, OCR or manual), I imagine the process would be something like comparing each line of each page against each line of the other page using something like w:Levenshtein distance. Then, when you have a list of sets of likely line matches, you need to figure out which is "best". For that, you can use a majority-vote system, but you'd need at least 3 scans. Another heuristic I can think of might be to check if words that fail a spell check pass the spell check in an alternative scan. Then take the one that passes. In your work, for example, two different scans have:
topography, scenei^, soil, climate, productions, and improve- 
-^ topography, scenery, soil, climate, productions, and improve- 
So you can see you might be able to choose "scenery", here. On the other hand, the "scenery" scan has junk at the start that "scenei^" doesn't. And sometimes no scan might have a correct word:
The Columhia s log-book certainly does not betray 
The OdunibialB log-book certainly does not betray 
In this case, you might figure that "Columhia" is closer to a real word and choose that (correcting it is optional).
I think that while you certainly might be able to get some benefit out of this, the need for multiple scans that can be page-matched and line-matched (often one OCR breaks lines or interleaves columns, which will be awkward) makes it a bit unwieldy in the general case. I think you'd probably get comparable results from a post-processing script that exploits knowledge of common OCR mistakes combined with knowledge of what letters can and cannot appear in words. For example, "mhia" never occurs in English, so you can probably correct to "mbia". The advantage is that it's independent of having multiple scans. I have such a script that I've been working on: User:Inductiveload/cleanup.js, but it needs quite a bit more work. But you should be able to use it right now. There are configuration options (e.g. you can turn on a rudimentary long-s corrector, or disable corrections that might damage German words like "und" -> "and").
Another thing that might boost OCR accuracy is training Tesseract on "similar" works specifically. Tesseract 4 is trained on vast amounts of synthetic text in hundreds of fonts, it's not impossible that retraining on ground truth data from exactly the kind of work you want might improve things. Many works we have are very similar in terms of typesetting, so perhaps retraining on, say, the Google scan of this work might improve things for other Google scans, which generally don't OCR quite as well as the "full-colour" IA scans (though the IA uses Abbyy, not Tesseract). That said, I have tried to train Tesseract 4 in the past, and it's a real pain to generate suitable ground truth images/text pairs, and you really need lots and lots. And if you "overtrain" the network, you might have to have multiple models, one for "Google scans", one for "low quality newsprint", one for "1700s printing", etc. Inductiveloadtalk/contribs 14:49, 4 July 2020 (UTC)

Tech News: 2020-28Edit

20:18, 6 July 2020 (UTC)