Wikisource:Scriptorium/Help
add 1939 report of the commission on the palestine distrubances of august 1929
editi was wondering how to best start putting https://commons.wikimedia.org/w/index.php?title=File:Report_of_the_Commission_on_the_Palestine_Disturbances_of_August_1929_cmd_3530.djvu&page=2 , resp https://unispal.un.org/pdfs/Cmd5479.pdf here. in 2 aspects, first how to put at all, and second, how to handle the rather comlex formatting. ThurnerRupert (talk) 00:55, 5 June 2024 (UTC)
- To add the text here, begin by creating the Index:Report of the Commission on the Palestine Disturbances of August 1929 cmd 3530.djvu. Then proofread each page to match the original. If the formatting is complex, then this might not be a good choice for creating your first work here. You might try some of the community collaboration works listed through the main page before starting a challenging work. --EncycloPetey (talk) 00:59, 5 June 2024 (UTC)
Question about updating PDF later
editSo I am working on scanning a large U.S. government online document into a PDF. I currently have 47 pages. I know the Commons allows you to post a new version of a image/PDF. My question is would that screw up stuff over here on Wikisource? So say I uploaded the PDF as-is right now with 47 pages to work on transcribing those (basically 3 chapters done of it). If I then later go back and scan in more images, should I upload that as an entirely separate thing on the Commons or just upload a new version of the PDF? Basically, what should I do in this circumstance. Note, there is probably 200-ish (or more) pages of the document. WeatherWriter (talk) 20:36, 7 June 2024 (UTC)
- It's best to transcribe from a complete PDF. Yes, altering a PDF during transcription can create problems. This is one reason we have a setting option on the Index page to indicate a scan needs to be repaired before proofreading. --EncycloPetey (talk) 20:47, 7 June 2024 (UTC)
- WeatherWriter: If you’re just adding pages at the end of the document, then there shouldn’t be any problem. It is not best practice, however. TE(æ)A,ea. (talk) 21:52, 7 June 2024 (UTC)
Deciphering bl capital
editSort of silly question, but there, I can't manage to understand what the first letter of the title is, it's in a {{bl}} variant I'm not familiar with. (there is no TOC in this book, so can't use that). It's not a common word either, ending in "idigeigi". Google OCR says it's an H, but it doesn't look like that. A G, maybe? — Alien333 (what I did & why I did it wrong) 15:23, 10 June 2024 (UTC)
- Ha, that's a tricky one, but there's an identical character on page 26 which confirms that it's an H. —Beleg Tâl (talk) 15:29, 10 June 2024 (UTC)
- Good find, thank you. — Alien333 (what I did & why I did it wrong) 15:49, 10 June 2024 (UTC)
Wrong text layer and OCR
editOn this page, the text layer is offset (for some reason). However, when I tried to generate OCR for the page, it generated OCR from the same (wrong) page! TE(æ)A,ea. (talk) 23:43, 12 June 2024 (UTC)
- It's not offset for me, at least as far as I can tell. --EncycloPetey (talk) 23:55, 12 June 2024 (UTC)
- I've created the text from what I see. If it is the correct page, then you may need to clear your browser cache to see the correct page. --EncycloPetey (talk) 23:58, 12 June 2024 (UTC)
- EncycloPetey: What does this page look like to you? To me it looks like the start of the next section. If that is the case, then the text layer and OCR generation bases are both correct, but the images are offset. TE(æ)A,ea. (talk) 01:19, 13 June 2024 (UTC)
- To me it's just page 389, and no new section. --EncycloPetey (talk) 01:33, 13 June 2024 (UTC)
- Something is very weird here. When the page first loads it shows one page scan, and then the page image reloads and shows a different page. The text layer loaded into the editor seems to be the one for the page that was briefly displayed, and so does not match the new page image. Since I've never visited this page before it can't be a browser cache issue. But it could still be a MediaWiki thumbnail cache issue, or that Proofread Page is getting stale text data from the API. @Sohom Datta: You may be interested in this issue. Xover (talk) 06:15, 13 June 2024 (UTC)
- But that's not happening to me, so it's not occurring universally for everyone. FWIW, I'm running Firefox on a Mac OS. --EncycloPetey (talk) 07:25, 13 June 2024 (UTC)
- For me, I get the image of page 391 and the OCR of page 389, though I'm not sure which is offset. (Firefox Ubunto) — Alien333 (what I did & why I did it wrong) 07:58, 13 June 2024 (UTC)
- I'm also seeing page 389 for both the text and image (which matches the downloaded PDF, for the record). Arcorann (talk) 12:18, 13 June 2024 (UTC)
- But that's not happening to me, so it's not occurring universally for everyone. FWIW, I'm running Firefox on a Mac OS. --EncycloPetey (talk) 07:25, 13 June 2024 (UTC)
- Something is very weird here. When the page first loads it shows one page scan, and then the page image reloads and shows a different page. The text layer loaded into the editor seems to be the one for the page that was briefly displayed, and so does not match the new page image. Since I've never visited this page before it can't be a browser cache issue. But it could still be a MediaWiki thumbnail cache issue, or that Proofread Page is getting stale text data from the API. @Sohom Datta: You may be interested in this issue. Xover (talk) 06:15, 13 June 2024 (UTC)
- To me it's just page 389, and no new section. --EncycloPetey (talk) 01:33, 13 June 2024 (UTC)
- I’ve tried from Edge and Google Chrome on two different computers—neither of which has a cache for the page—and both show the right OCR (p. 389) but the wrong page (p. 391, which is the beginning of the next section). This makes it rather difficult to proofread. TE(æ)A,ea. (talk) 15:02, 13 June 2024 (UTC)
Tilted manually scanned pages
editThis book was manually scanned and the resulting text layer is needs to be retyped manually.
I can straighten the page One of the many but then, so what? What options do I have? — ineuw (talk) 04:04, 16 June 2024 (UTC)
- @Ineuw: I don't understand the question. Why do you want to straighten this page? If you're asking how to straighten all pages in this scan then I advise against it: it's a lot of manual work, and the benefits are limited. Xover (talk) 06:28, 16 June 2024 (UTC)
- Thanks. That's what I thought, but needed an experienced opinion. I will retype it when needed. — ineuw (talk) 06:33, 16 June 2024 (UTC)
I am unsure what I did, but the pagelist for the PDF is not displaying. I tried to display commons:File:Tropical Cyclone Report – Hurricane Katrina.pdf. WeatherWriter (talk) 16:07, 18 June 2024 (UTC)
- Fixed. In general, you can try purging the file page on enWS (e.g. File:Tropical Cyclone Report – Hurricane Katrina.pdf) by adding
?action=purge
to the end of the URL, then purging the index page. —CalendulaAsteraceae (talk • contribs) 17:43, 18 June 2024 (UTC)- I think it works better to purge it at commons, and also the ?action=purge only works (I think) if you're already in index.php, so it's simpler to use of of the gadgets that do that. — Alien333 (what I did & why I did it wrong) 07:12, 19 June 2024 (UTC)
Scan resolution (question for the technical people)
editI'm getting frustrated with the poor quality of the scan image when proofreading A Dictionary of Hymnology. Have a look at Page:Dictionary of Hymnology 1908.pdf/44—the fine print is barely legible, even though I have increased the "Scan resolution in edit mode" to 2000. When viewing the PDF directly, the print is perfectly crisp.
I am guessing that the Wikimedia software takes the scan image at its default resolution, heavily JPG-compresses it, then increases the resolution of the compressed image, rather than scaling up before converting and compressing. This results in high-fidelity images of JPEG artefacts instead of actually usable scan images. I also have found a related task phab:T38597, to replace JPG with PNG in these images, which would presumably mitigate this issue—but this ticket is ten years old and hasn't been touched for years.
Anyway, my question is this: is there any way to improve the scan image inside ProofreadPage? Or do I just have to open the PDF in a separate window (which is what I have been doing)? —Beleg Tâl (talk) 18:06, 2 July 2024 (UTC)
- Don't know much about it, but there was a discussion a few months ago about the same problem and there the answer given was to use DjVu, not PDF. — Alien333 (what I did & why I did it wrong) 18:36, 2 July 2024 (UTC)
- Lol thanks, should have searched the archives first :D —Beleg Tâl (talk) 18:49, 2 July 2024 (UTC)
- Taking a quick look at the code, the PdfHandler extension generates jpgs which are then retrieved by us. Which jpg is retrieved might vary but it doesn't regenerate the images at a higher resolution if the original conversion is a poor representation. MarkLSteadman (talk) 21:31, 2 July 2024 (UTC)
- It may be possible to regenerate the pdf outside and then upload it such that the conversion goes smoother. MarkLSteadman (talk) 21:36, 2 July 2024 (UTC)
- Taking a quick look at the code, the PdfHandler extension generates jpgs which are then retrieved by us. Which jpg is retrieved might vary but it doesn't regenerate the images at a higher resolution if the original conversion is a poor representation. MarkLSteadman (talk) 21:31, 2 July 2024 (UTC)
- Lol thanks, should have searched the archives first :D —Beleg Tâl (talk) 18:49, 2 July 2024 (UTC)