Wikisource:Scriptorium/Help

←

→

sister projects: Wikipedia article, Commons gallery, textbook, course, travel guide, Wikidata item.

The Scriptorium is Wikisource's community discussion page. This subpage is especially designated for requests for help from more experienced Wikisourcers. Feel free to ask questions or leave comments. You may join any current discussion or a new one. Project members can often be found in the #wikisource IRC channel (a web client is available).

Have you seen our help pages and FAQs?

add 1939 report of the commission on the palestine distrubances of august 1929

Latest comment: 28 days ago2 comments2 people in discussion

i was wondering how to best start putting https://commons.wikimedia.org/w/index.php?title=File:Report_of_the_Commission_on_the_Palestine_Disturbances_of_August_1929_cmd_3530.djvu&page=2 , resp https://unispal.un.org/pdfs/Cmd5479.pdf here. in 2 aspects, first how to put at all, and second, how to handle the rather comlex formatting. ThurnerRupert (talk) 00:55, 5 June 2024 (UTC)Reply

To add the text here, begin by creating the Index:Report of the Commission on the Palestine Disturbances of August 1929 cmd 3530.djvu. Then proofread each page to match the original. If the formatting is complex, then this might not be a good choice for creating your first work here. You might try some of the community collaboration works listed through the main page before starting a challenging work. --EncycloPetey (talk) 00:59, 5 June 2024 (UTC)Reply

Question about updating PDF later

Latest comment: 25 days ago3 comments3 people in discussion

So I am working on scanning a large U.S. government online document into a PDF. I currently have 47 pages. I know the Commons allows you to post a new version of a image/PDF. My question is would that screw up stuff over here on Wikisource? So say I uploaded the PDF as-is right now with 47 pages to work on transcribing those (basically 3 chapters done of it). If I then later go back and scan in more images, should I upload that as an entirely separate thing on the Commons or just upload a new version of the PDF? Basically, what should I do in this circumstance. Note, there is probably 200-ish (or more) pages of the document. WeatherWriter (talk) 20:36, 7 June 2024 (UTC)Reply

It's best to transcribe from a complete PDF. Yes, altering a PDF during transcription can create problems. This is one reason we have a setting option on the Index page to indicate a scan needs to be repaired before proofreading. --EncycloPetey (talk) 20:47, 7 June 2024 (UTC)Reply

WeatherWriter: If you’re just adding pages at the end of the document, then there shouldn’t be any problem. It is not best practice, however. TE(æ)A,ea. (talk) 21:52, 7 June 2024 (UTC)Reply

Deciphering bl capital

Latest comment: 22 days ago3 comments2 people in discussion

Sort of silly question, but there, I can't manage to understand what the first letter of the title is, it's in a {{bl}} variant I'm not familiar with. (there is no TOC in this book, so can't use that). It's not a common word either, ending in "idigeigi". Google OCR says it's an H, but it doesn't look like that. A G, maybe? — Alien333 (what I did & why I did it wrong) 15:23, 10 June 2024 (UTC)Reply

Ha, that's a tricky one, but there's an identical character on page 26 which confirms that it's an H. —Beleg Tâl (talk) 15:29, 10 June 2024 (UTC)Reply

Good find, thank you. — Alien333 (what I did & why I did it wrong) 15:49, 10 June 2024 (UTC)Reply

Wrong text layer and OCR

Latest comment: 19 days ago10 comments5 people in discussion

On this page, the text layer is offset (for some reason). However, when I tried to generate OCR for the page, it generated OCR from the same (wrong) page! TE(æ)A,ea. (talk) 23:43, 12 June 2024 (UTC)Reply

It's not offset for me, at least as far as I can tell. --EncycloPetey (talk) 23:55, 12 June 2024 (UTC)Reply

I've created the text from what I see. If it is the correct page, then you may need to clear your browser cache to see the correct page. --EncycloPetey (talk) 23:58, 12 June 2024 (UTC)Reply

EncycloPetey: What does this page look like to you? To me it looks like the start of the next section. If that is the case, then the text layer and OCR generation bases are both correct, but the images are offset. TE(æ)A,ea. (talk) 01:19, 13 June 2024 (UTC)Reply
To me it's just page 389, and no new section. --EncycloPetey (talk) 01:33, 13 June 2024 (UTC)Reply
Something is very weird here. When the page first loads it shows one page scan, and then the page image reloads and shows a different page. The text layer loaded into the editor seems to be the one for the page that was briefly displayed, and so does not match the new page image. Since I've never visited this page before it can't be a browser cache issue. But it could still be a MediaWiki thumbnail cache issue, or that Proofread Page is getting stale text data from the API. @Sohom Datta: You may be interested in this issue. Xover (talk) 06:15, 13 June 2024 (UTC)Reply
But that's not happening to me, so it's not occurring universally for everyone. FWIW, I'm running Firefox on a Mac OS. --EncycloPetey (talk) 07:25, 13 June 2024 (UTC)Reply
For me, I get the image of page 391 and the OCR of page 389, though I'm not sure which is offset. (Firefox Ubunto) — Alien333 (what I did & why I did it wrong) 07:58, 13 June 2024 (UTC)Reply

I'm also seeing page 389 for both the text and image (which matches the downloaded PDF, for the record). Arcorann (talk) 12:18, 13 June 2024 (UTC)Reply
I’ve tried from Edge and Google Chrome on two different computers—neither of which has a cache for the page—and both show the right OCR (p. 389) but the wrong page (p. 391, which is the beginning of the next section). This makes it rather difficult to proofread. TE(æ)A,ea. (talk) 15:02, 13 June 2024 (UTC)Reply
Was this resolved? — ineuw (talk)

Tilted manually scanned pages

Latest comment: 17 days ago3 comments2 people in discussion

Righted tilted page

This book was manually scanned and the resulting text layer is needs to be retyped manually.

I can straighten the page One of the many but then, so what? What options do I have? — ineuw (talk) 04:04, 16 June 2024 (UTC)Reply

@Ineuw: I don't understand the question. Why do you want to straighten this page? If you're asking how to straighten all pages in this scan then I advise against it: it's a lot of manual work, and the benefits are limited. Xover (talk) 06:28, 16 June 2024 (UTC)Reply

Thanks. That's what I thought, but needed an experienced opinion. I will retype it when needed. — ineuw (talk) 06:33, 16 June 2024 (UTC)Reply

Help fixing Index:Tropical Cyclone Report – Hurricane Katrina.pdf

Latest comment: 14 days ago3 comments3 people in discussion

I am unsure what I did, but the pagelist for the PDF is not displaying. I tried to display commons:File:Tropical Cyclone Report – Hurricane Katrina.pdf. WeatherWriter (talk) 16:07, 18 June 2024 (UTC)Reply

Fixed. In general, you can try purging the file page on enWS (e.g. File:Tropical Cyclone Report – Hurricane Katrina.pdf) by adding ?action=purge to the end of the URL, then purging the index page. —CalendulaAsteraceae (talk • contribs) 17:43, 18 June 2024 (UTC)Reply

I think it works better to purge it at commons, and also the ?action=purge only works (I think) if you're already in index.php, so it's simpler to use of of the gadgets that do that. — Alien333 (what I did & why I did it wrong) 07:12, 19 June 2024 (UTC)Reply

Scan resolution (question for the technical people)

Latest comment: 5 hours ago6 comments4 people in discussion

I'm getting frustrated with the poor quality of the scan image when proofreading A Dictionary of Hymnology. Have a look at Page:Dictionary of Hymnology 1908.pdf/44—the fine print is barely legible, even though I have increased the "Scan resolution in edit mode" to 2000. When viewing the PDF directly, the print is perfectly crisp.

I am guessing that the Wikimedia software takes the scan image at its default resolution, heavily JPG-compresses it, then increases the resolution of the compressed image, rather than scaling up before converting and compressing. This results in high-fidelity images of JPEG artefacts instead of actually usable scan images. I also have found a related task phab:T38597, to replace JPG with PNG in these images, which would presumably mitigate this issue—but this ticket is ten years old and hasn't been touched for years.

Anyway, my question is this: is there any way to improve the scan image inside ProofreadPage? Or do I just have to open the PDF in a separate window (which is what I have been doing)? —Beleg Tâl (talk) 18:06, 2 July 2024 (UTC)Reply

Don't know much about it, but there was a discussion a few months ago about the same problem and there the answer given was to use DjVu, not PDF. — Alien333 (what I did & why I did it wrong) 18:36, 2 July 2024 (UTC)Reply

Lol thanks, should have searched the archives first :D —Beleg Tâl (talk) 18:49, 2 July 2024 (UTC)Reply

Taking a quick look at the code, the PdfHandler extension generates jpgs which are then retrieved by us. Which jpg is retrieved might vary but it doesn't regenerate the images at a higher resolution if the original conversion is a poor representation. MarkLSteadman (talk) 21:31, 2 July 2024 (UTC)Reply

It may be possible to regenerate the pdf outside and then upload it such that the conversion goes smoother. MarkLSteadman (talk) 21:36, 2 July 2024 (UTC)Reply

I use User:Inductiveload/jump to file, which is a very useful workaround if the file is from one of the sources it supports, although it is a workaround rather than a proper fix. —CalendulaAsteraceae (talk • contribs) 02:00, 3 July 2024 (UTC)Reply

Add topic