Scriptorium Scriptorium (Help) Archives, Last archive
The Scriptorium is Wikisource's community discussion page. This subpage is especially designated for requests for help from more experienced Wikisourcers. Feel free to ask questions or leave comments. You may join any current discussion or a new one. Project members can often be found in the #wikisource IRC channel (a web client is available).

This page is automatically archived by Wikisource-bot

Have you seen our help pages and FAQs?

Multiple works on the same topicEdit

I have recently transcribed four works about the forger, William Booth. He is not known to have ever published anything, so does not require an 'Author:' page. How can I group the works together? Would Wikisource policy support a category, or a portal page? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 08:02, 11 August 2020 (UTC)

Wikisource:Portal guidelines suggests that a portal would be appropriate. Moreover, person-based categories are generally not used here. BethNaught (talk) 08:12, 11 August 2020 (UTC)
@Pigsonthewing: definitely portal, you can use either {{person}} or {{portal header}}. Categorise to category:People in portal namespace. We would also link that portal to the person item in WD. — billinghurst sDrewth 23:17, 11 August 2020 (UTC)
Done. Why doesn't {{person}} automatically apply category:People in portal namespace? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 09:06, 12 August 2020 (UTC)
Probably just hasn't had the focus as they were developed separately. {{person}} desperately needs to be converted to be Wikidata native as default. — billinghurst sDrewth 01:23, 20 August 2020 (UTC)

Anchors for Sidenotes...Edit


Affected templates: {{left sidenote}} {{right sidenote}} {{LR sidenote}} {{RL sidenote}}

Page:Ruffhead - The Statutes at Large - vol 2.djvu/52 has sidenotes, and these can be associated with a particular section of the text. whilst I could add {{anchor}}, it would be sensible if these could be added from the sidenotes templates directly, so as to limit the amount of Templates needed in a long work.

The fix would be to do something like:

  <span {{if:{{{@|{{{anchor}}}|}}|{{{@|{{{anchor}}}|}} ....

in the relevant templates, so that the x.y style numbering suggested previously could be included directly, for this and other works using this template family. unsigned comment by ShakespeareFan00 (talk) .

Interwiki to Translation namespace not appearingEdit

According to, the page Translation:Tikunei_Zohar is linked to its source language he:s: page However, althought the link appears in the he: page, it does not appear on the en: page.

Importing from PGDPEdit

The following discussion is closed and will soon be archived:
We stepped into this one with perhaps more enthusiasm than sense, so best let this lie for now.

I was wondering if there is a way to import a project from PGDP to Wikisource. The works on PGDP have images and corrected text with formatting. I know that the formatting will need to be wikified, is there a tool to do this? Or is this a request best posted on Phabricator? Languageseeker (talk) 16:07, 26 February 2021 (UTC)

@Languageseeker: Please don't. Project Gutenberg texts are not generally of any particular edition of a work (amalgamations of multiple editions in some cases), and their transcribers sometimes "innovate" in various ways (modernised or americanised spellings, for example). Works here should generally start with uploading a scan and then proofreading against that scan; and the raw OCR in the scan will usually be a lot better for that than the PG text. If you don't care about the fidelity of the text, why not just read it on PG directly? --Xover (talk) 16:23, 26 February 2021 (UTC)
Totally with you on account of Project Gutenberg. I would advocate for a removal of all texts from Project Gutenberg. However, it's not Project Gutenberg, but Distributed Proofreaders that feed into Project Gutenberg. They have proofread texts with formatting and the original images. So, we would get a proofread text that we could compare to and validate them against the original image. See, for example, [1] (login required) Languageseeker (talk) 17:31, 26 February 2021 (UTC)
@Languageseeker: My apologies. I have obviously not been entirely clear on the distinction between PG and DP. Having a quick look at their guidelines it appears at least mostly compatible with our practices, so they could certainly be one source of text for us (provided what they actually output matches the guidelines, which I haven't checked). We'd need to find some technical way to import page by page to a scan hosted here so we can run our own Proofreading (just with a better starting point than OCR) and to make sure our texts are validatable to that scan for our readers. Possibly a mechanism akin to Help:Match and split, and it would probably require DP to have something API-ish that we could consume, but overall it should be feasible. --Xover (talk) 18:53, 26 February 2021 (UTC)
@Xover: Created a phabricator task. Hope it gets done. Languageseeker (talk) 20:27, 26 February 2021 (UTC)
@Languageseeker: this is an interesting idea, but an importer would almost certainly be done as an external tool that constructs a matching DJVU file from page images, feeds data in over the MediaWiki API, and then uploads the pages. I do wonder if it can be fully automated. The biggest worry so far, after logging in and sniffing around a bit, is that I cannot find page images for the "complete" works, nor a reliable link to something like the IA.
Also, I'm rather jealous of their velocity, even with such a huge number of review stages, they're clocking 140 works a month.
The other problem is that they do not format works to our level, for example "--" instead of "—", capitals, not small caps (the do mark this up), no centering, no sizing, etc etc.
On the subject of Phabricator, I've recently been wishing for a way to track enWS tasks, since they often have dependencies. Does anyone know if we can use Phabricator for that? Can we ask for a project? For example "move {{header}} to module logic". Inductiveloadtalk/contribs 20:53, 26 February 2021 (UTC)
(e/c)At present, the instructions for Match&Split specifically exclude DP works. However, IF a DP work is based on a single edition and the other criteria are met, then the Match & Split tool is fine. Certainly some of Laverock's contributions were done this way and the EB11 project is also utilising a version of the process. We would still require the normal enWS validation process. Beeswaxcandle (talk) 21:00, 26 February 2021 (UTC)
@Beeswaxcandle:, DP provides a file split by page, so you can in theory do better than M&S. However, you do need to figure out where the scan came from (hopefully the IA) and work out the offset (their page 1 is not the front cover) or construct a scan from the DP images, if present. The bigger challenge will be to write a parser for their markup, because it'd be a shame to junk it all. Inductiveloadtalk/contribs 21:30, 26 February 2021 (UTC)
So, I made a really shonky script to import a DP page-by-page text file: User:Inductiveload/dp_reformat.js using the magic of regex. It seems to have worked OK: Index:The ways of war - Kettle - 1917.pdf. However, the biggest issues I see is that once DP "archives" a project, the links to the marked-up source are removed from public view, as well as the page images. I'm unsure of why they do this, but it makes it all-but impossible to do a perfect match/split on the work, even if you can hook it up with a matching edition's scan. Inductiveloadtalk/contribs 10:52, 1 March 2021 (UTC)
The long and short is that DP does not appear willing to share their archived projects. So, making a tool that is specific to DP makes little sense. However, I still think that it makes sense to create a tool that would allow us to import OCR from a different source or replace the image files. So, I made a different phabricator ticket. Languageseeker (talk) 14:36, 1 March 2021 (UTC)
@Languageseeker: as long as you can massage the text into "Match and Split" format, you can already drive mass page uploads directly though the normal Wikisource interface. For the case of the User:Inductiveload/dp_reformat.js, this script will (attempt to) transform raw DP text into split-ready text with as much wikiformatting as it can. I will add some quick docs at User:Inductiveload/dp_reformat. It might not work for every type of DP project (since AIUI, different projects have different formatting standards). Inductiveloadtalk/contribs 15:06, 1 March 2021 (UTC)
@Inductiveload: Your script is utterly amazing. I'm astonished. I've used it on several books and it is great. I do have one bug and one suggention
  • Bug: If the offset is negative, you need to type the number first before you can insert a minus sign.
  • Request: Can you make the Index Menu a drop down menu so that the tool could be used on non-English Wikisource pages? For example, for French it would be "Livre" and "Page"
Also, is it possible to redo the match and split if the results are incorrect? I started one for Index:The American encyclopedia of history, biography and travel (IA americanencyclop00blak).pdf and it turns out they removed several blank pages from inside of the book, so I would need to rerun the script. In the past, when I tried to do this, it failed silently.
BTW: Everything from [2] upwards is still available on PGDP. It might be good to do a collective project to add these to Wikisource before the files are archived. Languageseeker (talk) 18:20, 2 March 2021 (UTC)
@Languageseeker: I looked at the offset and I think it's a bug in OOUI (phab:T276265).
Re the Index drop down, the namespace names "Index" and "Page" are canonical, so the script should just work at other Wikisources. E.g. s:it:Index:Peregrinaggio di tre giovani figliuoli del re di Serendippo.djvu works, even though the local namespace is "Indice". Let me know if it does not.
As for fixing a bad split, this is best fixed by a bot and admin, otherwise the redirects make a mess. Let me know the range to be moved and the offset and I'll sort it for you.
I am working on salvaging the texts at F2 levels (~1600). Inductiveloadtalk/contribs 19:26, 2 March 2021 (UTC)
Haha, you're awesome. Thanks for salvaging those texts. The ones that are posted to PG are archived first, so it probably makes sense to salvage those first. It's such a rich source.
For the problem with the merge, starting with Page:The American encyclopedia of history, biography and travel (IA americanencyclop00blak).pdf/22, the text should be moved +2 pages. So 22 has the text for 24. Languageseeker (talk) 21:15, 2 March 2021 (UTC)
I tried splitting a French book and it mostly works except for Modèle:Nop and Modèle:Ch don't work. Languageseeker (talk) 21:15, 2 March 2021 (UTC)
@Languageseeker: Move underway for the misaligned pages. In future please be cautious before splitting that the alignment is correct. It is annoying, I know, but if they mess with the pages, thems the breaks.
Re the French templates, I guess it possible to handle the other subdomains, as long as you know what to map each formatting element to. E.g. I think is their "nop". But it will take a little bit of a fiddle to do so. You can also do the replacements in a text editor if you know what you want to replace with.
Also, I wonder where to put the text files - they total over 950MB when uncompressed! Maybe the IA? Inductiveloadtalk/contribs 23:18, 2 March 2021 (UTC)
Thanks. I checked a few pages in the beginning and this one tricked me.
The IA might be a good place, or you can batch upload them to Commons. It might be good to store the original OCR for the future. You never know.
A few more markup for your script: [ae] = æ; [oe] = œ; {{...}} = … Languageseeker (talk) 02:19, 3 March 2021 (UTC)
@Languageseeker: Commons doesn't accept random zips/txt files, though. Anyway fill your boots:
Re the OCR, I'm not really sure about that, as long as we have the scan, we can OCR to our hearts' contents.
{{...}} is actually a WS template, it's designed to prevent a line break in the middle of ". . ." Inductiveloadtalk/contribs 09:09, 3 March 2021 (UTC)
Given DP don't have a publically available archive of their completed projects, and have stated that is intentional, I'm not sure harvesting everything that is available and posting it ... to a publically available archive ... is a great way to make friends! Nickw25 (talk) 08:24, 4 March 2021 (UTC)

All, it's important to note the position of the DP in relation to the activity in this space. In summary, DP have stated a view that WS should not use their in-progress texts in line with their community wishes. This is stated by the DP General Manager in the forum discussion on this topic at their site, which can be easily located. It was stated on the first phab ticket above that DP pointed the enquirer to in progress texts rather than archived ones. That was never stated publicly there. I don't know if it was stated privately or not, although it is no longer the position, if it ever was. It's also stated by their administrators in the same forum that the bulk harvesting of texts (presumably the same referenced above) was so disruptive it caused their server to become unresponsive for a period. Maybe unintentional, although destablising other projects servers to harvest information they don't want harvested cannot be the standard for a WMF project. Given this, I'd think it's reasonable for WS volunteers to refrain from harvesting and bulk importing content from DP given their community wishes. Disclaimer that I'm a volunteer at DP as well, and have been for many years on and off. To be clear, I'm a standard volunteer there, as here, and have no more knowledge other than what has been publically posted on their forums. Nickw25 (talk) 02:35, 6 March 2021 (UTC)

@Nickw25: I'm somewhat familiar with the situation. Inductiveload archived the project pages and concatenated texts because DP removes the images and text from DP soon after they get posted to PG. It did not cause the server crash. We asked if it was possible to just obtain the concatenated text afterwards and they said no. I'm not sure what the exact issue is. It seems to be a moral/philosophical issue rather than a legal one. They asserted no copyright claim, just a statement that downloading texts is a subversive activity that disrupts the core mission of the site.I'm sure that millions of authors who have their texts lapse into the public domain would like to restrict copying their work with a similar argument, but that is not how the PD works. If the text is an exact mechanical reproduction of a text in the PD, then the reproduction is in the PD. You cannot copyright a PD work. Languageseeker (talk) 02:55, 6 March 2021 (UTC)
The issue is not the public domain status of the material. If a physical library has a work that is in the public domain you're not entitled to photocopy it if the library says their policy is no photocopying; you are still subject to the terms and conditions of entrance to the library and any conditions they put on your access to the resource. Given those files require being a member to access; DP is entitled to set the terms and conditions for said access. I've been around there for a long time, and I understand why they have arrived at the conclusion they have -- they've arrived at it many times before this request. Either way, most websites have a fairly dim view of that kind of scraping. While I think it is a shame they aren't open to sharing their outputs more widely, nobody is being forced to volunteer there. Unfortunately such aggressive tactics don't win friends and influence people; they just result in more technical barriers being implemented and tighter T&C's than would have otherwise existed. Frankly, harvesting everything like that is a form of information colonialism in my opinion, it comes from a position of deep entitlement, shows no respect for the community that is there and damages the reputation of WMF projects along the way. If you disagree that strongly, those energies might be better directed to improving WS to learn from the attributes of DP that make it so effective despite not necessarily having all that many more volunteers than WS. Nickw25 (talk) 04:44, 6 March 2021 (UTC)
@Nickw25: I read carefully through the Code of Conduct and there is nothing that prohibits the copying of texts to other sites. It only states "Volunteers or guests must not intentionally harm or subvert DP processes or systems." If downloading a concatenated file harms or subverts their website, why do they have that option on every project page? If they want to write a stricter code of conduct, then I would welcome that. I think that volunteers should be allowed to know exactly what terms they are agreeing to. My intention was never to go over there to scrape all their files and I never did. I cannot speak for anybody else. I was hoping to be able to gain access on a case-by-case basis and not a massive batch import. For example, if we decided to add "Waterless Cooking for Better Meals, Better Health," then I hoped that we could ask DP for the concatenated text file and receive it. We might do so because the current version on PG is not disability friendly with dark text on a blue background. Even with the files on the IA, it would take weeks if not months of continuous work to match and split them all. Languageseeker (talk) 05:25, 6 March 2021 (UTC)
I agree that their code of conduct could be clearer around their expectations, I suspect that that will now be followed up. The ability to download is a little ambiguous. DP has been around since before Wikipedia; when it was put there I doubt anyone thought to use it for anything other than DP things. That said, in reality, if you quietly downloaded a few projects that were still active that you had some kind of 'personal use' for, it would have very very likely gone under the radar, especially if you were cautious to remove non public-domain elements (i.e. volunteer annotations). My interpretation of their code of conduct is that doing that at scale however, would be frowned upon as I interpret harvesting as subverting systems (they were made for human use). A bit like using the photocopier at work for personal use, the occasional page is OK but don't make a habit of it. One of those tricky to navigate grey zones, and like most small organisations, it takes getting to know the culture + what is actually written to figure out how things work--and the internet can make that more difficult.
Anyway, looking at Waterless Cooking; copyright logistics on that one aside, provided you can source a scan-set (DP do make their scans publicly available when they archive a project) my question would be what is stopping WS from working back from the final PG file? The PG file has the page numbering embedded; and the text could easily be matched up copied and pasted into WS (or a tool could be developed to extract based on the embedded page numbering). The transcribers note acknowledges a couple of silent corrections; which is what we'd have to keep an eye-out for when touching it up.
I would say, many of the 'reasons against' PG on WS seem outdated, and mainly affect earlier titles in PG's collection. Everything that goes to PG from DP for many years is expected to be from a single edition, doesn't get modernised, generally has a list of corrections made, and at worst has a few silent corrections to obvious errors that WS needs to keep an eye out for. I'd say DP processed files at PG would be the most reusable items from PG for WS given how DP work. It's certainly better than starting with raw OCR! I'd add the DP team have a point that the post processed file on PG might be better, even if a bit more tedious to work with. I've done 4.5k pages in their formatting rounds, and previously Post Processed a handful of titles there. I assure you things do slip through the 3 proofing (transcription) rounds. The PP and SR processes are quite likely to pick up on a number of those. Nickw25 (talk) 10:16, 7 March 2021 (UTC)
@Nickw25: I agree entirely, and I think you put the issue forcefully and cogently. In this particular case we've treated DP like some faceless entity rather than as a community with a culture and values of their own. While we may not have done anything wrong in any formal sense, we've acted hastily, and quite possibly brashly, without sufficient concern for a kindred project and its community. In comparable circumstances there would be outrage on the Wikimedia side.
There are probably some irreconcilable philosophical differences between our two projects (I suspect we're in The Cathedral and the Bazaar territory) that may make close cooperation effectively impossible. But we share similar enough interests and goals that we really should be cooperating wherever possible, and coexisting in an environment of mutual respect. That probably means we need to start by building some cross-cultural understanding so that we have a basis for dialogue, which in turn might let us identify what we actually agree on (which is probably a lot more than what we disagree on). --Xover (talk) 09:14, 6 March 2021 (UTC)
@Xover: I wouldn't entirely agree with this statement. The original proposal was to write a parser to allow for the importation of PGDP text files on a case-by-case basis. Then, we found out that DP remove the text files and images about three months after posting to PG. The immediate question became would it make sense to write a dedicated parser if we cannot get access to the files. So, I asked my friend, who is a DP volunteer, if she could find out if we could get access to the files. The basic answer was absolutely no under any circumstance; DP never has and never will share its in process files with anyone. PG is all you get. At that point, I believe that Inductiveload independently decided to archive the text files so that we could have more time. Since I was worried about copyright, I asked my friend to find out if the files are under copyright. DP would not give her a direct answer.
I don't think that time would help at all. DP simply believes that its sole purpose is to create ebooks for PG. When it comes to user contributions, the idea is that they belong to DP and, more specifically, the site administrators. They characterized wikisource as project that produces inferior ebook that will never meet their standard of quality. I'm not happy about the outcome or how the conversation went, but they left no room for dialogue. Languageseeker (talk) 02:31, 7 March 2021 (UTC)
I'd agree Xover that I hope at some point in the future there can at least be some mutual understanding. I'd never say never languageseeker. Anyway, a few points on all the discussion above:
  • Above Langugageseeker stated DP said they won't provide access under any circumstances. As such I think we just need to draw a line under it, respect their decision and wishes and move on, even if we disagree. As such I think it is important we don't bulk import items that were harvested. To do so risks diminishing WS's reputation in the DP community; but also at Project Gutenberg (where there is volunteer crossover) and potentially more broadly--it's a small world (even if most of us can't go anywhere right now!).
  • From what I can tell DP only shut the conversation down only after it became apparent that the bulk harvesting of texts had occurred against their wishes. I'd suggest that is really what crossed the line with them. It's an unfortunate series of events, although, regardless of who did what or how we got there, it was unsurprisingly interpreted as related and determined as a bad faith move on their part. I suspect they don't want to hear from anyone claiming to be from WS for a while.
  • As background, to understand the culture of DP, you need to consider them akin to a small local all volunteer community group in your neighrborhood and approach them as such. Anyone who has spent some time perusing their forums would be aware they are a very small organisation, overseen with a traditional governance structure (i.e. a board), run on a shoe-string budget of a couple of hundred dollars a month, and kept going by a very small group of people that have basically made it their FT job. There is no WMF on standby to take care of a whole bunch of foundationals or step in if required; and that no doubt contributes to their independent mindset. PG is similar. Even so, the community votes in the Board and has a say in the leadership and strategic direction. While some folks there might like them to be more open, it is clear, in WikiSpeak, that the Community Consensus on DP is they produce for PG and they are happy with that scope for now. There has been a lot of community upheavel at various points in their 20 year history. At the moment their community is settled and calm. As such, I suspect DP's leadership group are equally prioritising community harmony - as the leader of their community should. One good way to do that is keep the mission clear! As with small community groups change doesn't come quickly, and rarely happens because an outsider turns up. Such change takes many months, if not years, generally has a strong and trusted advocate within the organisation, and is quite complex and doesn't always succeed. It certainly doesn't happen inside of a week. Nor need it, DP and WS have both been doing their thing for the better part of 2 decades, there is no reason for haste. Nickw25 (talk) 10:16, 7 March 2021 (UTC)
They shut down the conversation before the harvesting happened. It was precisely their unwillingness to share the archived works that led to the bulk harvesting. I don't want to be specific because the conversations were private, but they claim ownership of the work of their volunteers and will not share any archived material. In the end, my own analysis is that they are afraid that Wikisource will destroy their community by attracting away volunteers. They have spent decades building the community and they don't want to lose it. Which I completely understand and accept. They want to be left alone and I'm leaving them alone. I don't think that wikisource should approach them for help in the future. Languageseeker (talk) 02:32, 8 March 2021 (UTC)
Appreciate you've come to your conclusions, although I would not agree after both being around for this long they are threatened by WS. At any rate, the issue is now not what happened, the legalese or interpretation of their T&C's, what misunderstandings took place or why they may not like WS. In the long term what WS should not do is engage DP the way that occurred on this occasion. We must expect genuine collaborations take time, be based on trust + mutual benefit and start from a place of curiosity, rather than just WS just wanting something.
The central issue now is the ethics of using that material. I've previously stated my personal expectations are WMF projects uphold the highest standards. You state they have asserted ownership and they've otherwise stated WS does not have permission to use. The raw download files, as is, contain material that is quite arguably not in the public domain. If I'm not mistaken your edit history shows you continue to import their files in bulk as recently as a few hours ago? I'm curious do you intend to continue given all the above? Given the potential damage to WS's reputation and that DP are unwilling collaborators in this instance, where is the community consensus to proceed, especially given this course of action does not align with WMF values as far as I can tell? Those values include statements like we will be "caring neighbors" and "humbly learn from our mistakes". . I'd respectfully submit it is time we do those two things in relation to this matter. There is much work to do on WS -- why persist with something that has the potential to be so damaging to our community? It's not a race to see who can get the most books, and the way we do things is just as, if not more, important than what we do. Nickw25 (talk) 11:13, 8 March 2021 (UTC)
@Nickw25: I think it is not overstepping if I suggest that the community at Wikisource now acknowledge that we've acted a bit of a bull in a china shop here, and that this can beneficially be conveyed to the DP community. We seem to have several philosophical differences that makes not just cooperation, but even just dialogue, a challenge. But, ultimately, I think both projects share a fundamental goal of making these works available; and both projects value accuracy and quality in the texts we produce. That, to me, seems like enough of a foundation to build on. Let's try to at least keep the lines of communication open and eyes open for any issue where our interests might align (increasingly draconian copyright, as one obvious shared concern). If someone that volunteers on both projects feels like acting as a bit of an informal liaison I think that would probably be a very good idea. --Xover (talk) 15:58, 8 March 2021 (UTC)
I am sorry for my actions in over-eagerly downloading the text. I have removed the public IA item. I didn't intend to cause such a problem, it was a "I wonder if I can" exercise that got out of hand. It was rude and thoughtless to proceed at such a rate. Inductiveloadtalk/contribs 13:36, 8 March 2021 (UTC)
I apologize as well. I never meant to cause the community any harm. I truly hoped that we could establish some form of collaboration, but they merely want to be left alone. Apologies were extended to the senior leadership of PGDP who consider the matter resolved. Languageseeker (talk) 13:49, 8 March 2021 (UTC)
  This section is considered resolved, for the purposes of archiving. If you disagree, replace this template with your comment. Xover (talk) 14:12, 30 March 2021 (UTC)

Project to Match and Split OCR from Distributed ProofreadersEdit

The following discussion is closed and will soon be archived:
We stepped into this one with perhaps more enthusiasm than sense, so best let this lie for now.

Distributed Proofreaders has several thousands projects to proofread and correct texts against a single-edition work. Most of their projects derive their scans from IA or Google. However, they archive they scans after posting to Google. The project is to download the proofread text, match them with their appropriate scan, and post it here. Inductiveload has created a awesome tool that makes it possible to preserve much of the formatting and easily get the text into a format ready for match-split.

Here are the requirements

  1. Install User:Inductiveload/dp_reformat.js

Project Instructions

  1. Pick a project from one that Inductiveload archived at [3].
  2. The "Concatenated Text File" in the zip file and the the Project Description is in the matching HTML.
  3. See if the Project Description gives the original location for the scans. If it doesn't, you'll need to manually match the work.
  4. Create an index file for the work and make sure to create page numbers.
  5. Go the the Sandbox. Select Edit and paste in the text from the text file inside of the zip that you downloaded from Distributed Proofreaders.
  6. Select Reformat DP text in the Tool section of the left side bar.
  7. You will need to paste in the Index name for the work and calculate the offset. Distributed Proofreaders deletes the first few pages, so you will need to do a bit of math to get the correct number.
  8. Either select Show Preview or Publish Changes. Verify the pages. Distributed Proofreaders sometimes remove blank pages from within the work and you will need to check a few pages to make sure that the offset is correct.
    1. If their are multiple offsets, you will need to do the merge and split in stages.
  9. A tab called "Split" will appear next to discussion. Select discussion. If you want to verify that your split started, visit [4]
  10. Once you have done this, record that you have imported the work at User:Languageseeker/PGDP.

Disclaimer, this is a personal project and has no official sanction. Just asking for help from the community. Languageseeker (talk) 23:25, 2 March 2021 (UTC)

Follow-up note, Inductiveload's script will work on non-English wikisources, but the markup will need to be manually updated.

Are the downloaded texts coming from Gutenberg? We have had poor success with their "proofread" texts. They often mix-and-match editions, have modernizations, or other problems that make them incompatible with Wikisource standards. --EncycloPetey (talk) 00:09, 3 March 2021 (UTC)
No, Distributed Proofreader takes a single source usually from the IA or Google Books, proofreads the books against the source images, and then processes them and posts them to PG. They are strict to match the source text to the image prior to posting to PG. Before they post to PG during post-processing, they sometimes correct errata and deviate from the source text. The files on Distributed Proofreaders match the source text. You can take a look at the couple of books that I matched-and-split from Distributed Proofreaders as an example. The major thing that is lost in the Distributed Proofreader sources is the header and footer, but this is easier to add-in that proofreading the text. Languageseeker (talk) 01:08, 3 March 2021 (UTC)
Just commenting on the concerns about PG. Many (most?) of their texts now come from DP, who expect them to be a single edition. Most of DP's final output will list the changes made, and many also state if further silent changes were made. PG has come a ways from the days of refusing to acknowledge which edition something was prepared from (as I understood they did). There is certainly a subset of projects from PG that are lower risk from a WS perspective. They'd have been processed at DP (identifiable in the PG credits line), have an easily identifiable scan set (sometimes referenced in the PG text, otherwise can be traced back via the DP project comments, or a bit of investigative work) and have a transcribers note that silent changes were not made. Nickw25 (talk) 02:41, 6 March 2021 (UTC)

Also posting here to link to my comments above, that DP do not want WS using their in-flight works for this purpose. See that post for more detail. Nickw25 (talk) 02:45, 6 March 2021 (UTC)

  This section is considered resolved, for the purposes of archiving. If you disagree, replace this template with your comment. Xover (talk) 14:12, 30 March 2021 (UTC)

3 versions of Pratt's History of MusicEdit

There is an unsourced and incomplete version in the Main The History of Music and two transcription projects: Index:Pratt - The history of music (1907).djvu and Index:Pratt - The history of music (1907 Preface Variant).djvu. If The History of Music could be moved to The History of Music (Pratt, unsourced), I can take it from the redirect. Thanks.--RaboKarbakian (talk) 17:08, 7 March 2021 (UTC)

Pratt - The history of music (1907).djvu is the original printing; Pratt - The history of music (1907 Preface Variant).djvu is a later reprint of the 1907 edition with a new preface, list of deaths after the appendix, and the removal of blank plages. The transcription is probably sourced from Pratt - The history of music (1907).djvu, but, outside of the images, I'm not sure how much value it contains as it was last edited in 2007 and the text in Pratt - The history of music (1907).djvu comes from a good source. Languageseeker (talk) 17:36, 7 March 2021 (UTC)
They have a {{versions}} here. I was told to ask an admin to move things due to clean-up of redircts being easier with some tools they have. There is a broken, unfinished version in Main that needs moving. So, I ask here for that.--RaboKarbakian (talk) 15:53, 8 March 2021 (UTC)
@RaboKarbakian: Just transcribe, and when you have the pages, just transcluded into the existing pages. No point in moving an unsourced, unfinished work, we just overwrite with something verifiable. — billinghurst sDrewth 09:58, 11 March 2021 (UTC)
@Billinghurst: I would have suggested to delete the 2007 work that has nothing to transcribe as I understand the definition of that word here. But the way of the sourcerers, usually, is to move it from the Main namespace. I might be completely confused, so perhaps you can provide a link to where it is to be transcribed at and I will work at it....--RaboKarbakian (talk) 14:21, 11 March 2021 (UTC)
@Billinghurst, @RaboKarbakian: Seconded, their is a sourced copy with good text to replace the unsourced copy. Languageseeker (talk) 14:30, 11 March 2021 (UTC)
I am presuming that the name is okay, and that the chapter structure is okay, so just transclude in place. There is no requirement to delete, just replace. Nothing is gained by deleting; and nothing is gained by moving an incomplete work that will never be completed`. — billinghurst sDrewth 07:53, 12 March 2021 (UTC)
@RaboKarbakian: Do you still need assistance with this or has it been resolved? --Xover (talk) 14:07, 30 March 2021 (UTC)
@Xover: as far as I am concerned, I left this to billinghurst's competent management....--RaboKarbakian (talk) 14:23, 30 March 2021 (UTC)

A minor gadget/edit toolbar distractionEdit

On activation of both OCR Gadgets, with the first edit page refresh, the Tesseract OCR icon is separated from the Gogle OCR icon by the "Insert template" icon. Afterwards, the two OCR icons are side by side, and it's very distracting. Would it be a major correction to keep the two separated by anything? — Ineuw (talk) 23:28, 13 March 2021 (UTC)

Asking for suggestions for a minor javascript correction with an image this timeEdit

Can someone suggest how I can correct this issue for myself? I copied User:Ineuw/Gadget-ocr.js to my namespace, hoping to modifying the script because, on the first edit, the [Template button] is positioned between the two OCR buttons (which is very helpful) and afterwards it's positioned to the left as in this image. (Having both side by side is distracting). Half my thanks in advance, many thanks for a successful solution.— Ineuw (talk) 19:29, 16 March 2021 (UTC)

Tom UrtonEdit

The following discussion is closed and will soon be archived:

I happened to read one of your letters from Tom Urton, who lives in Norton, England with great interest. I live in Santa Rosa, Sonoma Co, California, and my great-grandmother was Kate Elizabeth Urton. It is not a common name. A few years ago we went to Norton and to St James Church, but we could not go inside. I am a genealogist and have a lot of information about the Urtons, but I am stuck about 1470. Tom, I hope you see this note! I sent a letter to you last December, but it was returned the first week of March and said the address was incorrect. I used the address you posted with your note on Wikisource. I am afraid to write my address/email here because it would be available to everyone in the world. I hope you will write again so that I can see your address or an address that you use. I live in Spring Lake Village, Santa Rosa, CA 95409. Suzanne (Tharp) Guerra

  This section is considered resolved, for the purposes of archiving. If you disagree, replace this template with your comment. Xover (talk) 13:49, 30 March 2021 (UTC)

Appendix 2 of Katherine Mayo's book, "General Washington's Dilemma"Edit

Help requested to upload to Wikisource please

I wish to upload Appendix 2 of Katherine Mayo’s book, “General Washington’s Dilemma” to Wikisource, since it is missing from the New York (1938) edition and only available in the London versions. I wish to upload it here: but would far rather upload the text, with references and Wikilinks, which has been prepared by me here (full explanations are there too): I think this would be far more useful. The text commences: “The following is a faithful copy...”. Copyright approval has been obtained here: see (Nthep (talk) 15:07, 17 March 2021 (UTC)). I am essentially looking for some kind editor who would do this for me since I’m too old to have good IT skills and this is likely to be well beyond me. Help would me massively appreciated. Once done, there are three main pages where this will need to be linked, the main one being,_2nd_Baronet but I should be able to copy the outcome to the other pages. Many thanks in advance to the person willing to help me out here. If this really can only be done with either jpeg or pdf documents, then I will have to ask my daughter to do this since she has a copy of the correct edition of the book. Arbil44 (talk) 18:24, 18 March 2021 (UTC)

Thank you for your willingness to contribute to Wikisource. Do you have a full copy of the book? I know that you say that the only the Appendix is different, but there can be subtle differences between editions. Also, we prefer to have our works scan backed. Is there any place where we can find a scan of the book? Languageseeker (talk) 19:26, 18 March 2021 (UTC)
Thank you Languageseeker. I have the book (well my daughter does now) and it is the London: Jonathan Cape, 1938 edition, pp.263-268. The New York Harcourt, Brace edition does not have an Appendix 2, and that is the reason I would like it uploaded to Wikisource here. I have six scanned pages of Appendix 2, in jpeg. format. Would that be acceptable? If you were able to send me an internal email, I could then send these scanned images to you? That said, I wish you could use my Sandbox 4 edition! It is a faithful copy and I put a great deal of work into it, including lots of Wikilinks, and of course the references needed (2 of them) as well. However, if that must go to waste, the important thing is to get it up on Wikisource please. Arbil44 (talk) 01:32, 19 March 2021 (UTC)
Just adding some information here from my sandbox 4.
The following is a faithful copy of Appendix 2 of General Washington's Dilemma by Katherine Mayo. This appendix appears in the London: Jonathan Cape, 1938 edition, pp.263-268, and in the New York/London: Kennikat Press 1970 edition, pp 263-268, but not in the New York, Harcourt, Brace & Co., 1938 edition, which is also online here:[a 1]
All references to The Hon. R. Fulke Greville, of the First Foot Guards, are now known to refer to Lieutenant and Captain The Hon. Henry Greville of the 2nd Foot Guards (now known as the Coldstream Guards[a 2] Arbil44 (talk) 01:39, 19 March 2021 (UTC)
  1. Mayo, Katherine (1938). General Washington's Dilemma. New York: Harcourt, Brace and Company. 
  2. "The Thirteen Officers and Their Regiments". The Journal of Lancaster County's Historical Society 120 (3): 100. 2019. OCLC 2297909. 
@Arbil44: None of your proofreading would not go to waste if we scan-backed this work, it would be placed next to the scan images. Otherwise no-one else can validate this book, as it appears to be not otherwise available online. For more information, Help:Beginner's guide to proofreading provides a quick intro to how this normally works.
But you can email me the complete scan images of the book (ideally including covers and blank pages—a scan of only part of the book is not very ideal at all) at my username at and I'll make them into a file that we can use to scan back your edition of this book and set it up at Wikisource. A Google Drive/Dropbox link or similar is fine too if the images are too big to email.
Thanks for contributing to Wikisource and I look forward to helping you realise your goal. Inductiveloadtalk/contribs 22:14, 20 March 2021 (UTC)
Sorry, I'm a bit of a tech idiot and I don't really understand what you have said, but I have uploaded the Appendix 2 of the Mayo book (six pages) and you can find them here: [5]. However, the entire book is available online here: [6] with the exception of the Appendix 2. I would simply say that my daughter now has my copy of the book and I couldn't possibly ask her to scan the entire book, when it is already available at HathiTrust. If that has to happen I'm afraid I will probably have to abandon this quest! That will be a great pity as two of the most important letters written regarding The Asgill Affair are the only element comprising Appendix 2. Thanks for your email address, but now I have uploaded the 6 pages to Wikimedia I imagine you will be able to find them there? I don't know who does the proofreading, but I have typed up the entire Appendix 2 here: Please remember that my notes regarding the editions which do and don't have an appendix are important, as is also my notes regarding the different spellings of Asgill's name and the totally incorrect particulars of Greville's name. [User:Arbil44|Arbil44]] (talk) 00:44, 21 March 2021 (UTC)
Book added at Index:General_Washington’s_Dilemma_(1938). I'll try to merge and split asap. Languageseeker (talk) 01:34, 21 March 2021 (UTC)
Thank you. I think the page numbers are labelled incorrectly. I had some trouble with this during the upload process. My apologies for that. Arbil44 (talk) 08:31, 21 March 2021 (UTC)
Please note: This appendix appears in the London: Jonathan Cape, 1938 edition, pp.263-268, and in the New York/London: Kennikat Press 1970 edition, pp 263-268, but NOT in the New York, Harcourt, Brace & Co., 1938 edition, but all other pages of that edition appear in the online edition.Arbil44 (talk) 09:28, 21 March 2021 (UTC)
All the various issues (where the appendix is and is not - who Asgylle and Asgyle really is - and the bad transcription by Earl Spencer with all details regarding Greville) are all covered in my Sandbox 4. Arbil44 (talk) 09:38, 21 March 2021 (UTC)

Languageseeker please could you correct the publisher's name, because Harcourt Brace is wrong. That edition is the one which does not have an Appendix 2. Arbil44 (talk) 09:09, 23 March 2021 (UTC)

Index:Hans Holbein the younger (Volume 2).djvuEdit

Was looking through this, and found some 'bonus' images and other ehpemra in the scans..

I've marked the file as problematic, so that a further discussion can be had here.

The images look like they are of Holbien (or similar-era) paintings (so PD-art). If they can be identified it would be reasonable to retain them..

However the copyright status of the ephemera is unclear. Do I mark the ephemra for blanking given the unclear status? (it's also not clear if they are contemporaneous withe the rest of the book.)

Example : News clipping of unknown date Page:Hans_Holbein_the_younger_(Volume_2).djvu/30 ? ShakespeareFan00 (talk) 22:46, 18 March 2021 (UTC)

@ShakespeareFan00: If it's not part of the book as published then mark the pages as without text. If they are additionally of unclear or dubious copyright status then flag the specific pages and I can excise them. Just looking at the index it wasn't clear to me which pages this was concerning. --Xover (talk) 13:48, 30 March 2021 (UTC)
I think you are both overthinking this. Like I did with my failed 9 page djvu file. Look at the publication date. — Ineuw (talk) 00:24, 3 April 2021 (UTC)

small caps names within italic textEdit

Esme Shepherd (talk) 16:34, 20 March 2021 (UTC) I have been formatting many dramatic instructions that are in italics except the character names, which are in small caps. The usual form is therefore 'text in italics' Name 'text in italics', there being spaces between the text blocks and the name. Mostly, this works fine but sometimes the result is 'text in italics'Name'text in italics' without spaces. I don't know why this is so, and all I can do is format it as 'text in italics ' Name ' text in italics' to provide the spaces. Is there any rationale that differentiates these cases or is it just random?

@Esme Shepherd: Can you give a link to a page where it happens? --Jan Kameníček (talk) 17:46, 20 March 2021 (UTC)

Esme Shepherd Esme Shepherd (talk) 10:17, 21 March 2021 (UTC) It isn't easy to spot them retrospectively, so I will post you the next one I find. The Exeunt at the bottom needs separating from the character name.

Pardon if I have misunderstood, but there is an optical effect created by the slope of italic text. The example given 'looks' fine to me, there is a space. CYGNIS INSIGNIS 12:53, 22 March 2021 (UTC)

Esme Shepherd (talk)Yes, I have put a space where one is not usually required. It may have something to do with the italics but compare the following page 'Re-enter Leonora' (no space here). There is also a longer passage on without spaces, where a character name is preceded by an italic t. Also sometimes, the word following the character name needs a space before it. I haven't located an example of this yet.

Esme Shepherd (talk) 19:33, 22 March 2021 (UTC)Okay, I think we are working towards eliminating the problem. It was just a puzzle and annoyance. I still don't understand why, but at least I can overcome it! Thank you.

@Esme Shepherd: There is no need to add any extra spaces, I have removed the extra space that you added to Page:Dramas 1.pdf/284 and the result is as expected. Such extra spaces should definitely not be inserted. If you do not see the space, it can be caused by the effect described above by Cygnis insignis. This effect can be stronger in some browsers than in others, but it is not the reason to add any extra space which does not belong there. --Jan Kameníček (talk) 21:19, 22 March 2021 (UTC)

Esme Shepherd (talk) 19:31, 23 March 2021 (UTC)All spaces on proofread pages have now been removed and the rest will soon follow. The closing up still appears sometimes on transcluded pages and doesn't look good, but the spaces are confirmed by 'copy and paste', so I'm happy with that.

Index from Image Files and Haithi TrustEdit

@Xover, @Inductiveload, @ShakespeareFan00: Today, I experimented with creating an Index from a set of images for three reasons

  1. There is the possibility of adding JP2 support that will enable the usage of image files from IA and other sources. However, the Wikifoundation has made it clear that it will not support images in zip files. Therefore, Wikisource will need to create Index files from individual images to leverage the advantages of JP2 files.
  2. Books can be downloaded from Hathi Trust as individual images. If uploaded to the IA, then IA converts them from the original files to JP2 and then to PDF incurring serious quality loses. One file went from 2GB in JPEG files to a 100mb pdf.
  3. Understanding the problems of Index files created from images can help prepare this site for the future.

For the purpose of review, I created Index:A dictionarie of the French and English tongues based on 2gb of images of a book from 1611 and Index:Shen of the Sea based on 137 mb of a book from 1925.

Here are my major finding

Hathi TrustEdit

  1. Images for a book are stored in a mixture of JPEG and PNG files. The images stored on Hathi Trust have no extension and the extension must be added from the minetype in the file. Therefore, most book downloaded from Haithi Trust will contain a mixture of image formats.
  2. Hathi Trust does not return an error when it goes beyond the last page of a book. It just keeps on sending the back cover.
  3. You need a program like trid to add an extension based on the minetype.
  4. Images can be downloaded sequentially from Hathi Trust.
  5. Hathi Trust throttles downloads and either automatic retrying or a 4 second delay between requests is needed.
  6. Images downloaded from Haithi Trust must be renamed into a sequential order during download.
  7. Pattypan makes it easy to upload the image files to common and generates a sequence of images that can be used to generate the list of Pages for the Index file.

Advantages of an Index File generated from ImagesEdit

  1. Much faster loading time.
  2. Easier to add in missing pages.
  3. Images can be cropped directly.
  4. Text is easier to read due to higher quality images.
  5. Easier to add missing pages or replace damaged pages.
  6. No need to rely on the IA upload tool or try to figure out why your file won't upload to Commons because individual 400kb to 2.5mb are no problem for common and Pattypan makes batch uploads trivial.

Disadvantages of Index generated from ImagesEdit

  1. No OCR layer
  2. Merge and Split doesn't work
  3. index preview.js does not work
  4. Preview Pagelist does not work
  5. Uploading 1,000 images to Common takes a long time.

Suggested ChangesEdit

  1. Add a third parameter page_sequence for the Pages category on Index ns
    1. Currently, when adding pages from images the code is [[Image|Page Name]]. This creates Page:Image. This is out of sync with the way Pages are created from DJVU or PDF files; namely, Page:Index_name/Sequence.
      1. Example, the same book Index:Shen of the Sea creates Page:Mdp.39015056023214 001.jpeg, Page:Mdp.39015056023214_028.png, Page:Mdp.39015056023214 280.jpg.
      2. They would make more sense as Page:Shen of the Sea/1 Page:Shen of the Sea/28 Page:Shen of the Sea/280
    2. The current approach to creating Page from separate image file will always break Merge and Split because of the possibility of mixed format images.
    3. The new approach should be [[Image|Sequence|Page Name]] creating Page:Index_name/Sequence.
    4. For Index created from Images with two parameter index ([[Image|Page Name]]), a bot should exist to automatically add a numerical sequence starting with 1. This should also be done on the creation of an Index.
  2. Fix index preview.js and Preview Pagelist to handle Index ns with images from individual files.
    1. Having a sequence number for images can help.
  3. Automatically run OCR for all the Images and created a text layer.
  4. The Scans section on Index ns does not make sense because of the possibility of multiple minetypes. Instead of asking the User to manually enter the file type, the Index ns should automatically list all minetypes present.

Here is a page created with text generated by Google OCR and an Image generated by the Crop tool.

I've probably forgotten a few things, so please ask questions. Languageseeker (talk) 03:03, 21 March 2021 (UTC)

@Languageseeker: While I appreciate your enthusiasm and zeal here, you're misunderstanding the problems and prescribing inapt solutions to the wrong problems. The above is a reasonably accurate summary of the status quo (which we're painfully aware of), and "being able to use images from Commons for an Index:" is roughly a description of the desired goal (which we have previously articulated). But getting from here to there is going to take sustained effort from the community in specifying the solution, followed by advocacy and recruitment to find the developers able to do the work, and then a significant amount of developer resources over a fairly long stretch of time (including ongoing maintenance). And due to the existing platform infrastructure, into which any solution for our needs is going to have to fit, this will not be green field development: it will involve not just "a developer" hacking together some new functionality in Proofread Page, but multiple developers from multiple teams at the WMF working on multiple components of the technology stack. It is also very probable that the features we need do not all actually exist in the stack and will need to be developed more or less from scratch (and without breaking anything in the process). And because these do not yet exist out of the box we are actually going to need some developer assistance in just coming up with a specification that is even remotely implementable. Meanwhile, we can't even get minimal ongoing maintenance of our core software components (except by the kindness of volunteers in their very limited spare time) or bug fixes for which there is a patch provided applied. So, to put it succinctly, this problem only looks simple if you ignore all the hard parts.
I wouldn't for all the world want to put a damper on your enthusiasm, but right now you're flailing about all over the place without the background to direct the energy constructively (hint: it's not in getting Commons to ban duplicate scans in PDF and DjVu because you think the issue has any similarities with VHS vs. Betamax). Slow down. Learn. Discuss. And then figure out how to direct your energies where they can do the most good. There is no simple short term solution that none of us have been able to come up with: there are only hard long-term solutions that will take all of us pulling together in a sustained effort. --Xover (talk) 09:04, 21 March 2021 (UTC)
@Languageseeker: Re Hathi Trust: FYI you can get the number of pages in the book from the Data API (along with the image data itself). You need a free UoM Friend account to get an API key. You can also find it in the HTML for a book (<span data-slot="total-seq">254</span>, but the Data API is tidier.
The file extension is easy to work out from the returned mime-type of the image data. Generally PNG is bitonal and JPG is coloured. If you do make a DJVU from the images, you should use this information as it cuts the filesize by an order of magnitude. Inductiveloadtalk/contribs 13:51, 21 March 2021 (UTC)
@Xover: Thank you both for your thoughtful replies. I agree that I probably do need to learn more and that many of these changes are far more complicated. Do you think it would be possible to change the Pages generated from an image based index to Page:Index_name/Sequence instead of Page:Image Name or would that also a deep restructuring of the platform infrastructure?
@Inductiveload: Thanks for the advice. I actually discovered that triad [7] can do this as well. For me, the question is what Source do I put down on an Index ns if I use the JPG/PNG file from Haiti Trust. Languageseeker (talk) 14:07, 21 March 2021 (UTC)
@Languageseeker: Without knowing the code intimately, so caveat etc.… I think that would be a relatively contained change only in the Proofread Page extension. I don't want to speculate about the complexity / how much work that change would be without knowing the code, but it doesn't obviously require any major surgeries anywhere. But a better question is why do it? What does it gain us? The individual page names could essentially be random strings for all it matters: it's the Index that ties it together and the software knowing what the sequence of pages is. So long as we can use the next/previous buttons on each page, and transclude sequences of pages (from/to) using <pages … /> what does the page naming matter?
The Match & Split tool doesn't support non-multipage formats, but if that's what you want to use why not pursue making it support that? It's not really the mixture of image formats that's the problem there, but rather that it assumes it's a single multi-page format file. But the source code is available and I have access to the relevant server if need be. Of the two JS tools one is developed by a long-time enWS contributor, and the other by a more recent contributor as a student Google Summer of Code project, and both of them are active and responsive to queries. Both tools should technically be able to work with an image-based Index, albeit possibly with code that is too hacky to want to implement in production (I don't think there's a clean API in place for the necessary information yet). If you're having trouble using one of those tools for a specific project that's the level at which you'll want to pursue it. --Xover (talk) 15:34, 21 March 2021 (UTC)
@Xover: The major reason would be to harmonize the way we make Pages for PDF and DJVU indexes with the way we do so for images. For PDF and DJVU, the system uses Page:Index_name/Sequence and for Images Page:Image Name. This means that every piece of code has to take into account these two systems. Also, for tools, such as merge and split, you would need to query the list of images and then match to individual images instead of a sequential range of pages. Languageseeker (talk) 16:41, 21 March 2021 (UTC)

Index:A discovery that the moon has a vast population of human beings.djvuEdit

The following discussion is closed and will soon be archived:
Request appears to have been otherwise resolved. @ShakespeareFan00: If you still need this doctored in some way feel free to hit me up on my talk page.

Can someone trim this down? It seems there are some additional clippings in it, that aren't part of the original. ShakespeareFan00 (talk) 22:21, 25 March 2021 (UTC)

Why? Just mark them as no text and move on. Creating work for others where there is no value, and an easy solution. — billinghurst sDrewth 14:50, 27 March 2021 (UTC)
  This section is considered resolved, for the purposes of archiving. If you disagree, replace this template with your comment. Xover (talk) 11:22, 30 March 2021 (UTC)

Help: Need to move Index and its Pages because of a spelling errorEdit

The following discussion is closed and will soon be archived:
All pages, file, and index migrated to new name.

Years ago, I copied and pasted OCR text for the Index name, and I missed a typo in the name. The scanned "g" in Egypt ended up as the letter "q". Index:On the Desert - Recent Events in Eqypt.djvu. I would like to save rather than delete and re-install, but don't know how to move the index and pages. — Ineuw (talk) 06:36, 27 March 2021 (UTC)

  Doing... --Xover (talk) 13:20, 27 March 2021 (UTC)
  Done File rename requested at Commons, and a temporary redirect established. @Ineuw: --Xover (talk) 14:33, 27 March 2021 (UTC)
I have moved the file at Commons, though I wonder why we even bothered. There is no need, and it creates a lot of work for next to no value. — billinghurst sDrewth 14:48, 27 March 2021 (UTC)
@Xover: Many many thanks for taking the time out and correcting my error. I should be able to do it by now. What tools do you use to rename the pages? — Ineuw (talk) 19:02, 27 March 2021 (UTC)
@Ineuw: Pywikibot has built-in support for moving pages, you just need to massage up a list of from->to page names. --Xover (talk) 19:04, 27 March 2021 (UTC)
@Ineuw: I also have a handy script for exactly this purpose: User:Inductiveload/Scripts/Page shifter. Inductiveloadtalk/contribs 20:32, 27 March 2021 (UTC)
  This section is considered resolved, for the purposes of archiving. If you disagree, replace this template with your comment. Xover (talk) 11:16, 30 March 2021 (UTC)

Index:The star in the window (1918).djvu has problematic pages, need to be replacedEdit

The following discussion is closed and will soon be archived:
All pages migrated to a new index.

I need some help, because pages 286 through 293 of The Star in the Window are missing for some reason, and are just a bunch of copies of page 284 and 285.. This would be the entire content of Chapter 31.

The pages on the DJVU need to be replaced without damaging the Index page. This DJVU came from Internet Archive, but there is another version at Google Books, which is located here: File:The Star in the Window (Grosset & Dunlap).pdf. The pages 286-293 are correctly shown there, and the actual book content between the two versions is exactly the same (trust me).

I have no idea how to do this, so could someone please replace the problematic pages at Index:The star in the window (1918).djvu with the same correct pages from File:The Star in the Window (Grosset & Dunlap).pdf? Thank you! PseudoSkull (talk) 17:16, 27 March 2021 (UTC)

Pages will be proofread as normal from the other scan but are marked as problematic for now until this is fixed. PseudoSkull (talk) 17:19, 27 March 2021 (UTC)
I'll upload the Harvard copy of the Stokes version later. Languageseeker (talk) 20:53, 27 March 2021 (UTC)
New version at Index:The Star in the Window.pdf] Languageseeker (talk) 03:13, 28 March 2021 (UTC)
@Inductiveload, @Xover: Could one of you move the text from Index:The star in the window (1918).djvu to Index:The Star in the Window.pdf. The offset is -1. Languageseeker (talk) 15:05, 28 March 2021 (UTC)
Yes, please. I have tried myself. PseudoSkull (talk) 15:23, 28 March 2021 (UTC)
I also made this request at Commons to convert that to a DjVu and keep the index we already made. If migrating the page contents is what we must do instead I can do that much. But it will take a damn long time... PseudoSkull (talk) 15:25, 28 March 2021 (UTC)
Migration in progress. PseudoSkull (talk) 16:04, 28 March 2021 (UTC)
I aborted my migration due to User talk:PseudoSkull#Please do not bulk move Page: namespace pages. @Xover, @Inductiveload, @Billinghurst, @Jan.Kamenicek: Any of you can migrate the rest in whatever way you wish, since you are the ones with more efficient tools. My bot got to Page:The_star_in_the_window_(1918).djvu/89. None of the pages in the transclusion (except in the front matter) have been migrated to the better scan. PseudoSkull (talk) 22:18, 28 March 2021 (UTC)
The migration needs to be from Index:The star in the window (1918).djvu to Index:The Star in the Window.pdf. PseudoSkull (talk) 22:20, 28 March 2021 (UTC)
More info: Only Chapter 1 up to Chapter 30 need to be changed to reflect the correct scan, since I'm proofreading the rest of it on the correct scan. PseudoSkull (talk) 23:29, 28 March 2021 (UTC)
@PseudoSkull:   running ; pages are being moved, and the transclusions are being updated. And has been mentioned previously, please do not do titles as hard links
| title = [[The Star in the Window (Stokes)|The Star in the Window]]
please do relative links like
| title      = [[../|The Star in the Window]]
Thanks. — billinghurst sDrewth 02:28, 29 March 2021 (UTC)
@Billinghurst: Ah yes, thank you for doing that. And I apologize for repeating a previous mistake, I must have copied from elsewhere and not have seen that error. I will try and do differently in the future. PseudoSkull (talk) 02:32, 29 March 2021 (UTC)
  This section is considered resolved, for the purposes of archiving. If you disagree, replace this template with your comment. Xover (talk) 11:15, 30 March 2021 (UTC)

Paragraphs with a left margin across page-breaksEdit

I have come across numerous instances where a paragraph that is left indented runs from one page to the next. There does not seem to be any way of running this paragraph together on transclusion. Am I correct in this? see, for example: and the subsequent page. Esme Shepherd (talk) 10:23, 31 March 2021 (UTC)

In this case, use the "split" templates {{left margin/s}} and {{left margin/e}}, just like the {{block center}} equivalents. These are undocumented, so it's hardly surprising you didn't find them! Inductiveloadtalk/contribs 11:16, 31 March 2021 (UTC)

Thank you, that's brilliant! I had experimented with this, but I must have had the formulation wrong! Esme Shepherd (talk) 09:55, 1 April 2021 (UTC)

Black Beauty, versions and translationsEdit

No big deal as I put them on Author:Anna Sewell, but there is a translation and a film of Black Beauty (silent though). There are other versions with different illustrators, but I haven't compiled that list.

Maybe Black Beauty could be moved to Black Beauty (first edition)? Or not. Just let me know.--RaboKarbakian (talk) 16:17, 31 March 2021 (UTC)

Another perfectly good option is to tell me that I was given bad guidance and allow me to move it myself. I am annoyed to be here asking, which often means that I am being annoying.--RaboKarbakian (talk) 17:18, 31 March 2021 (UTC)
Labeling something as "first edition" can be problematic, as there can be a first book edition, first magazine edition, first paperback edition, first edition in the US, first edition in the UK, etc. It is much better to use the date and/or publisher and/or place of publication to identify the edition rather than an edition number. --EncycloPetey (talk) 17:37, 31 March 2021 (UTC)
And the items you added to Author:Anna Sewell were not written by Sewell, so they should not be placed on her Author page. Nor should you link in an Author page to a Wikipedia article about the work she didn't write from a title that would be expected to link to the work itself on Wikisource. --EncycloPetey (talk) 17:45, 31 March 2021 (UTC)
@RaboKarbakian: First, you only need to ask for help in moving pages if you are uncertain about the correct page names etc. or there are a large number of pages involved (think "more than about five" as a rule of thumb). In the latter case both because with a large number of pages any messes are also going to be large and because it is much easier to get an admin to do the move than to go back and clean up redirects etc. Black Beauty is a case where it's a good idea to ask for help for both those reasons. So I think the advice you were given that led you here in this case was very good.
Because… I'm not sure any move is warranted here. We generally don't preemptively create versions or disambiguation pages (yes, there are exceptions), and so far as I can see we currently only have one work with that name and one edition of that work. Once we have an edition of Black Beauty Retold in Words of one Syllable ready to transclude we might want to look at what to do, which might be a disambiguation page or might be to put the latter at Black Beauty Retold in Words of one Syllable. If it were concluded to use Black Beauty for both, the original would live at Black Beauty (1877) and the other at Black Beauty (1905) (or, often, disambiguating using the author's last name because we usually don't have multiple editions of the same work). --Xover (talk) 18:10, 31 March 2021 (UTC)
@EncycloPetey: Everything you said was true, although, I have been treating "one syllable" works as translations (as others are here). Everything you removed from Anna Sewell belongs at Black Beauty which is the reason I am here.--RaboKarbakian (talk) 18:29, 31 March 2021 (UTC)
@Xover: The Main space name is interesting, in that it is a pain. At commons, another sourcerer was naming cats: Title (YYYY, publisher) which I started to follow there, leaving Title open for all editions, or Title (Author) for problem titles, or Title (YYYY, Author) for the prolific and revisionists. Whatever name you (all) think works will be just fine.--RaboKarbakian (talk) 18:29, 31 March 2021 (UTC)
@EncycloPetey: What author page linking? I am confused.--RaboKarbakian (talk) 18:31, 31 March 2021 (UTC)
I am referring to the links you incorrectly added as part of the discussion you started. You placed links on a page where they should not have been placed. If you need to keep notes, you can place them in your User space. Or you could place them on the Talk page for Black Beauty. But please do not place works by one author onto the Author page of a different author. --EncycloPetey (talk) 18:42, 31 March 2021 (UTC)

Transclusion not wiki-formatting headingEdit

Page:CTSS programmer's guide.djvu/53 is fine by itself, but not when transcluded on Compatible time-sharing system: A programmer's guide

Thanks, Phillipedison1891 (talk) 15:03, 1 April 2021 (UTC)

Nevermind, was able to fix it. Phillipedison1891 (talk) 15:04, 1 April 2021 (UTC)

Setting up Merge and Splits for The Complete Works of Geoffrey ChaucerEdit

I just uploaded the scans for all 7 volumes of Author:Geoffrey_Chaucer#Collected_works. The first 6 volumes have text from Gutenberg done by PGDP. For that reason, they have page numbers. Is there anyway to merge-and-split these texts? Languageseeker (talk) 01:45, 2 April 2021 (UTC)

Missing page images of a linked djvu file?Edit

I created this eight page article as a .djvu file which displays correctly in my desktop DjVu app. But here, the page images are not showing, but the text layer is re-created with the OCR. — Ineuw (talk) 04:18, 2 April 2021 (UTC)

Page images are missing and OCR errorEdit

Installed this 9 page document. The page images are missing, but the OCR succeeded, except on the last page on which OCR generates an error. Whenever someone has the time, please look at what's wrong. Thanks. — Ineuw (talk) 13:09, 2 April 2021 (UTC)

IOError: (invalid url?)
@Ineuw: That file claims to have a resolution of 19,204 × 26,458 pixels (about 10x what's typical), but still only 9.31 MB. I'll dig a bit, but my initial guess is that this file is broken in some way. --Xover (talk) 13:33, 2 April 2021 (UTC)
Uhm. How did you extract the 9 pages? And for that matter, why? You can proofread and transclude only those 9 pages even if the file and index contain many more. --Xover (talk) 13:35, 2 April 2021 (UTC)
Definitely a funky file. It's got indirect chunks, looks to put the text layer in annotation blocks, and claims to be an insane resolution. What tool created this file? I'll try to generate a DjVu of the whole volume, but it'll have to wait until later today or tomorrow. --Xover (talk) 13:42, 2 April 2021 (UTC)
@Xover: Please don't waste your time. I will try it again in a different way to learn how to do it. These were made from 9 JP2 pages converted to PNG then uploaded to Convertio to convert to 9 separate djvu pages (I have no offline djvu conversion tool), which was stitched together with djvm in Windows. Go ahead and laugh. :-) — Ineuw (talk) 22:19, 2 April 2021 (UTC)
@Ineuw: Regardless of how roundabout that process sounds (happy to provide guidance, but tl;dr if you have djvm you should have c44, which would convert a JPG input directly to DJVU): why not just upload the entire document, which even comes with the OCR? And then it allows proofreading of the rest of The World's Work v. 14 by others. Inductiveloadtalk/contribs 22:59, 2 April 2021 (UTC)
@Inductiveload: You are absolutely right. There is no excuse for my approach, except that I was exploring (playing) to see the end results. The djvudump displayed everything that's wrong. So, went back to the drawing board, found c44.exe, as well as the scripts posted on the Wikimedia Commons. About uploading the complete volume. I try not to upload books which I have no interest to proofread, so this seemed to be an alternative and a teaching moment. Only because it's 9 pages.— Ineuw (talk) 00:18, 3 April 2021 (UTC)
@Inductiveload, @Xover: I converted the .jpg page images with c44 and then assembled them with djvm. It's about 20% of the previous uploads, but the same problem exists. The text comes through but not the page image. Could you please look at it. — Ineuw (talk) 23:06, 4 April 2021 (UTC)
@Ineuw: A big part of the issue is that you tagged the images with Internet Archive identifier : worldswork14gard on Commons which prevents the IA tool from uploading the file. Languageseeker (talk) 23:27, 4 April 2021 (UTC)
Thanks for explaining. This is not working out for me. The volume is 700 pages and is not worth uploading in my opinion. So, I will delete it here, and ask for a deletion at the commons.— Ineuw (talk) 23:59, 4 April 2021 (UTC)
@Ineuw: I checked the new version of the file and it looked just fine, including showing the images in the Page: namespace. If you're still seeing broken images it is probably a caching issue or similar. The only thing wrong with your new version is that it doesn't have a text layer in the file itself (let me know if you want instructions for adding one: it's complicated and inconvenient, but entirely doable). --Xover (talk) 00:48, 5 April 2021 (UTC)
@Languageseeker: And just what in the world does that have to do with anything? --Xover (talk) 00:48, 5 April 2021 (UTC)
@Xover: The IA tool checks if there is a file tagged with {{IA|worldswork14gard}} on Commons. Even if it's an image, then the IA tool will not allow you to upload the file stating that the file already exists. I tried uploading the entire file with the IA tool and the images that Ineuw uploaded to Commons and tagged with the IA link prevented the uploading of the actual book. Languageseeker (talk) 00:55, 5 April 2021 (UTC)
@Languageseeker: Yes, that is roughly how the ia-upload tool works. However, as ia-upload was involved nowhere in Ineuw's problem, why are you bringing it up at all, much less framing it as a causal factor for the problems they were having? --Xover (talk) 10:06, 5 April 2021 (UTC)
@Ineuw: It's actually easier to manage a single 700 page volume than managing an extracted article. You don't need to proofread the entire thing, just the part that interests you. Languageseeker (talk) 00:46, 5 April 2021 (UTC)

@Languageseeker: Thanks for the correction on the commons and will that with future uploads.— Ineuw (talk) 00:50, 5 April 2021 (UTC)

@Ineuw: If an uploaded image, DjVu, or PDF comes from IA then the file's information page should definitely contain the IA identifier or another link to IA. ia-upload was designed to avoid duplicate uploads based on an assumption that most works available on IA were not, and probably never would be, uploaded to Commons. That assumption has been turned inaccurate over the last couple of months thanks to a way overzealous bulk upload of as many of IA's PDFs (mostly low-quality, and with awkward autogenerated filenames and the raw IA bibliographic metadata) as the bot could get their paws on (mostly constrained by copyright). This state of affairs most likely means that the ia-upload duplicate checking in its current form is no longer feasible, and will either have to be removed or rewritten to work in a significantly different way. At which point the problem Languageseeker is talking about, and that affects one single specialised uploader tool, will disappear, but we will still need good information about the source of media files on Commons. --Xover (talk) 10:06, 5 April 2021 (UTC)

Uploading Large PDFs to CommonsEdit

I don't seem to have a lot of luck uploading large PDFs to Commons. I've tried Chunked Uploader and it does not work. Does anybody have any suggestion? For example, I want to create a PDF for [8]. Languageseeker (talk) 20:29, 2 April 2021 (UTC)

@Languageseeker: I use just Upload Wizard and imo it should be able to handle this file too. --Jan Kameníček (talk) 22:22, 2 April 2021 (UTC)
Upload Wizard refuses documents over 100MB.--Prosfilaes (talk) 23:16, 2 April 2021 (UTC)
@Prosfilaes: that's the Basic Upload - te Wizard goes up to 2GB, I think (it uses chunked uploading). Inductiveloadtalk/contribs 23:27, 2 April 2021 (UTC)
Actually, it should be up to 4GB. --Jan Kameníček (talk) 08:30, 3 April 2021 (UTC)
@Languageseeker: I've been running into phab:T278104 on and off for a while with API uploading and the upload wizard, perhaps it's that?
On the other hand, this document produces a 55 MB DJVU from the 494MB of Hathi images, so perhaps that's a better way forward? If you must have a PDF and you want to crush it down, JBIG2 encoding the PNGs produces a PDF around 12MB, but I don't have tools to combine the JPGs with the PNGs as a PDF so only the PNGs are JBIG2'd, and I don't have tools to write the OCR into PDFs. Also the PDF is mind-expandingly slow to render compared to the DjVu. Inductiveloadtalk/contribs 23:27, 2 April 2021 (UTC)
@Inductiveload: Yep, that's the exact error that I'm getting. I'll just wait until that bug get's fixed. I'm trying to preserve the image quality because of the illustrations. Thanks for your help. Languageseeker (talk) 00:48, 3 April 2021 (UTC)
@Languageseeker: The illustrations are already pretty damaged by the Google's compression, so IMO it's not particularly critical (especially as it's way easier to extract the images from the existing JPGs at Hathi rather than from a PDF that another user wouldn't know has or hasn't re-encoded the image). As was said by Nemo_bis in phab:T277921, Commons isn't attempting to compete with Hathi/IA for storage of endless terabytes of "raw" (not that is really is raw, see below) scan images. Because what's the point?
Even then, the 36 JPGs in this file total 35MB, so, on top of the ~12MB of lossless JBIG2-encoded bitonal images, you could still produce a PDF under 50MB, without a byte of data loss from the Hathi scan (except in the Google watermarks). But the PDF will render like molasses, because JBIG2 is very slow to decode. So I'd still suggest going for DjVu, and if the image quality from the default c44 encoder settings is not good enough for whatever reason, you can set that manually. For example:
$ c44 -decibel 50 mdp.39015011058198.0001.jpg page765.djvu
$ ddjvu page765.djvu -format=pnm page765_from_djvu.pnm
$ compare -metric PSNR mdp.39015011058198.0765.jpg page765_from_djvu.pnm diff.png
Which is kind of what you expect since we asked for 50. 50dB of PSNR is really rather good (way over JPG quality=90). In fact, since 255 is ~48dB it's essentially perfect (below the quantization error of the actual 8-bit image, but since the two aren't quite identical I'm obviously missing something). This is the difference map between the input JPG and the 50dB c44 encoding. White means identical.
Which is all kind of moot, because although the Hathi JPGs may be set at Q=95, they're encoding substantial compression noise, probably from before the data ever reached HT, which implies that 95 is far from representative of the paper-to-user Q factor and using a Q=95 level of compression is mostly just a waste of bits:
Striving to store data that's already totally swamped by compression noise is not particularly useful (in the context of Wikisource), IMO. Sure, reducing compression damage at each step is a nice goal, but once the data is trashed to n dB (where n << 50), what are you hoping to achieve by worrying about further lossless encoding. You have to ask yourself what exactly you are trying to achieve, or it's going to turn into a classic w:XY problem. Inductiveloadtalk/contribs 16:41, 3 April 2021 (UTC)
I'm trying to make sure that users do not have to go through Help:Image_extraction to crop an images from a file. I know that Google scans are of an inferior quality, but they are often all we have. There is nothing wrong with lossless compression, but lossy compression alters the image. As you know, getting images from Haithi Trust is difficult. So why make users go through extra work?
Yes, DJVU can compress more, but DJVU is no longer being actively developed. It's one major bug away from following the fate of Lilypond phab:T257066. If the security team discovers a major security bug in the DJVU viewer, who will fix it? What about if the code become incompatible with the latest release of Debian? As for JBIG2, it's dangerous to use because it can alter the image, see JBIG2.
I'm not asking to import the entire IA or Haithi Trust, but I want to make sure that the images are of the highest quality because the quality of monitors are continuously improving. A higher quality image will last longer. If the scans come from IA, I don't care because I know that we can pull the scans at any time. For Haithi Trust, I'm not so sure because it already imposes restrictions. Downloading from Haithi Trust at this moment, places Wikisource in National Portrait Gallery and Wikimedia Foundation copyright dispute territory. Languageseeker (talk) 00:16, 4 April 2021 (UTC)
FYI, not that I'm saying JBIG2 is ideal (due to the insane decode time making them truly miserable to use on all but the most monstrous CPUs), but jbig2 operates in lossless mode by default (it's lossy if you set -s).
And even if you do just use PNG, remember to make them bitonal first, because the Hathi PNGs are only not bitonal due to the Google watermark. That will save you hundreds of MBs per file. Inductiveloadtalk/contribs 07:11, 4 April 2021 (UTC)

Transcriptions of an audio workEdit

Hello Wikisource editors, we have been publishing (in Apple Podcasts, and the like) and also donating to Wikimedia Commons a podcast series, under the standard Creative Commons Attribution-Share Alike 4.0 International. We are considering creating a Wikisource page with the transcription of those podcast episodes. It seems that Wikisource welcomes transcripts of audio (WS:SCOPE), but more guidance, especially to confirm whether this contribution is within the scope of Wikisource, would be much appreciated. JCPod (talk) 19:52, 3 April 2021 (UTC)

Wikisource:What Wikisource includes should give you an idea of what we include. For works published after 1925, the work should meet out equivalent of "notable". Podcasts generally do not meet that criterion, as they do not pass through peer review or editorial controls. --EncycloPetey (talk) 19:58, 3 April 2021 (UTC)
@JCPod: while it's pretty unlikely a modern "self-published" work like a podcast meets WS:WWI, I think it sounds like something Wikibooks would allow, since it's essentially a book? I don't speak for them, but you could ask at wikibooks:Wikibooks:Reading room. Inductiveloadtalk/contribs 20:27, 3 April 2021 (UTC)
Thank you both for your prompt responses. JCPod (talk) 20:58, 3 April 2021 (UTC)

  Comment Aside from this specific case example, we need to better look at how we handle transcriptions of audio works, especially progressive transcriptions. Are we going to work in the Index: / Page: ns from a file at Commons, and look to go through the double process of validating. How would we get the snippets of sound into files, etc. We have done something with video, and I think that it is time we looked to better formulate these media types. Needs guidance in Help: namespaces for video and audio files. PseudoSkull would be our current lead exponent. — billinghurst sDrewth 00:13, 4 April 2021 (UTC)

Author creation requestedEdit

Can anyone help to create the author page for Bruneian sultan Hassanal Bolkiah? I'm working on his Syariah Penal Code Order, 2013, and other emergency enactments solely made by him. In particular, I'm not sure how to deal with all of those authority control scribble-scrabbles. Many thanks.廣九直通車 (talk) 13:54, 5 April 2021 (UTC)

@廣九直通車:   Done See Author:Hassanal Bolkiah. I am uncertain about the best copyright tag to use, so I've stuck EdictGov there for now. --Xover (talk) 19:45, 5 April 2021 (UTC)

Looking for help for some hebraic caracters in a french Champollion book about hieroglyphs !Edit


I'm active in the french Wikisource, and I'm working on a book from Jean-François Champollion about hieroglyphs... In this book, there is THIS PAGE with a text in hebraic caracters... As I'm not good in hebrew langage nor in hebraic caracters, I'm looking for some help to correct the page. Any help would be welcome. Thanks Lorlam (talk) 18:50, 5 April 2021 (UTC)

@Ineuw: Is this something you are able to help out with? --Xover (talk) 19:38, 5 April 2021 (UTC)
Done. — Ineuw (talk) 19:52, 5 April 2021 (UTC)
Many thanks for your help — Ineuw :-) --Lorlam (talk) 21:19, 5 April 2021 (UTC)

Paginated text without scanEdit

Following from this lengthy discussion and others on English Wikipedia w:en:Talk:Sir Charles Asgill, 2nd Baronet#General Washington's Dilemma by Katherine Mayo, Anne User:Arbil44 has transcribed a hard-to-find historical letter at w:en:user:Arbil44/New_sandbox4. It's well out of copyright. Anne has retained the original pagination and headers. Would someone be able to help copy this across to Wikisource with the appropriate page structure? Or advise me how to do it? (For example, without a scan, do we still use the Index: namespace to assemble the pages?)

Note, I don’t want to ask Anne to go back and add a scan, I get the sense that she has become somewhat frustrated in her interactions with Wikipedia and I don’t want to make things worse. So I’m hoping we can accept this as a non-scan-backed text as it is.

Pelagic (talk) 01:34, 6 April 2021 (UTC)

@Pelagic: I made the 6 pages into an index: Index:General Washington's Dilemma - Mayo - 1938 - Appendix 2.djvu. I'm not quite sure how it should be transcluded to mainspace, as it's just a fragment of a complete work. Inductiveloadtalk/contribs 01:38, 7 April 2021 (UTC)

Index:UN Treaty Series - vol 1.pdf, etcEdit

This work and subsequent volumes of the United Nations Treaty Series are in English and French. I see the first volume proofing only English, so I would like to be ask if separate indexes would have to be made in French Wikisource to proofread the French portions. If so, I am making more indexes here to encourage proofreading, but I do not have a reliable OCR.--Jusjih (talk) 05:01, 6 April 2021 (UTC)

@Jusjih: frWS would need separate index pages, yes. But they should mostly be able to just copy the data we have here if we have ones they don't already have. And, of course, they can use the same File: on Commons.
What's your problem with OCR? --Xover (talk) 07:35, 6 April 2021 (UTC)
Thanks. I wonder if reliable OCR is available online.--Jusjih (talk) 18:04, 6 April 2021 (UTC)
Perhaps this site already has OCR when creating page namespace? I just added some well formatted covers of the United Nations Treaty Series, but we will have to mark the year published since Volume 401.--Jusjih (talk) 00:47, 7 April 2021 (UTC)

Transcribing directly from webpages (Highway Code)Edit

Hi, I believe the current Highway Code, published by the British government's Department for Transport, falls under the CC-BY-compatible Open Government Licence and thus would be eligible for inclusion (we already have a 1931 edition and parts of the 2008 Traffic Signs Manual). But how would I go about copying it here? I know scans are preferred for verifiability - would it be appropriate to print the webpages to PDF and upload them to Commons, or is a URL sufficient attribution? If so, how do I create the relevant pages without a scan? --Wodgester (talk) 17:01, 7 April 2021 (UTC)

I would "print" web pages into PDFs, upload then to Commons saying that the source web pages have been converted to PDFs, create indexes here, then proofread the pages.--Jusjih (talk) 20:49, 8 April 2021 (UTC)
Thanks for the help @Jusjih! I've started an index. --Wodgester (talk) 16:17, 9 April 2021 (UTC)
You are very welcome and I see the PDF well describing the tools used.--Jusjih (talk) 01:48, 10 April 2021 (UTC)