Samwilson
1, 2 |
mw.wikibase.getEntityIdForTitle
edit[OMG a nude page, let me help resolve that]
I just saw announced the lua extension mw.wikibase.getEntityIdForTitle and if I am not mistaken that is could be a joyous little bundle of helpfulness for us.
Here I am thinking where we have an author page, and a related biographical page in main ns, and working out whether we can poke a wikipedia =
parameter on the respective main ns page, or maybe automating a link; similarly I am see the potential for us to more readily get some bot action to better apply "main subject (P921)" at Wikidata for our biographical works. Am I reading the function properly? — billinghurst sDrewth 22:37, 16 April 2018 (UTC)
- @Billinghurst: Interesting! So you mean create a link from the NS0 page of e.g. a biography chapter to the Author NS of the bio's subject? If the bio has a P921, couldn't we link via that (i.e. bio page → sitelink → P921 → Qxx → sitelink → Author page)? I'm not quite getting when we'd need to do a page title look-up... or do you mean, as a means to find unlinked articles? That must be it. So we'd do a getEntityIdForTitle('NS0 Page Name') and see if it comes up with an instance of person, and if it does we'd add some thing to alert editors here to the fact? Sam Wilson 06:27, 17 April 2018 (UTC)
Books & Bytes - Issue 27
editBooks & Bytes
Issue 27, February – March 2018
- #1Lib1Ref
- New collections
- Alexander Street (expansion)
- Cambridge University Press (expansion)
- User Group
- Global branches update
- Wiki Indaba Wikipedia + Library Discussions
- Spotlight: Using librarianship to create a more equitable internet: LGBTQ+ advocacy as a wiki-librarian
- Bytes in brief
Arabic, Chinese and French versions of Books & Bytes are now available in meta!
Read the full newsletter
Sent by MediaWiki message delivery on behalf of The Wikipedia Library team --MediaWiki message delivery (talk) 14:49, 18 April 2018 (UTC)
Your feedback matters: Final reminder to take the global Wikimedia survey
editHello! This is a final reminder that the Wikimedia Foundation survey will close on 23 April, 2018 (07:00 UTC). The survey is available in various languages and will take between 20 and 40 minutes. Take the survey now.
If you already took the survey - thank you! We will not bother you again. We have designed the survey to make it impossible to identify which users have taken the survey, so we have to send reminders to everyone. To opt-out of future surveys, send an email through EmailUser feature to WMF Surveys. You can also send any questions you have to this user email. Learn more about this survey on the project page. This survey is hosted by a third-party service and governed by this Wikimedia Foundation privacy statement.
Unused files as a list?
editDo you know a way to manipulate Special:UnusedFiles so I can get it as an easy list? There are a string of files there that I know that I can straight out delete, though how to get it as a list to easily manipulate in bite size chunks is just not obvious. It is not even obvious that you can pull it from the API, not that I can generate simple text lists from the API anyway — billinghurst sDrewth 04:06, 16 May 2018 (UTC)
- @Billinghurst: It doesn't look like it. That special page isn't transcludable even, and it's constructing the database query itself so I suspect the same query isn't done anywhere else (or we'd be reusing it). Also it's the only place mw:Manual:$wgCountCategorizedImagesAsUsed is used. What sort of list are you trying to build? It probably wouldn't be too hard to add transcluding support, if that'd help. Sam Wilson 04:38, 16 May 2018 (UTC)
- There are works there that have been completed where the original image has been cleaned/gleaned/screened and uploaded to Commons. So we have the residue images to cleanse, and getting these url by url is a PITA. Getting a list, checking the work completion, and zapping more collectively is bettererer. Noting that prefix lists are unreliable in case one/some aren't done. — billinghurst sDrewth 05:23, 16 May 2018 (UTC)
- Dropped the problem into phab:T194865 — billinghurst sDrewth 01:44, 17 May 2018 (UTC)
- Note that File linked via {{raw image}} is still considered 'unused'. In pywikibbot: python scripts/listpages.py -unusedfiles.— Mpaa (talk) 17:36, 19 May 2018 (UTC)
- Dropped the problem into phab:T194865 — billinghurst sDrewth 01:44, 17 May 2018 (UTC)
- There are works there that have been completed where the original image has been cleaned/gleaned/screened and uploaded to Commons. So we have the residue images to cleanse, and getting these url by url is a PITA. Getting a list, checking the work completion, and zapping more collectively is bettererer. Noting that prefix lists are unreliable in case one/some aren't done. — billinghurst sDrewth 05:23, 16 May 2018 (UTC)
Ping
editHi. Just in case you have not been notified about this: https://phabricator.wikimedia.org/T194861 . It is happening quite often recently. Bye— Mpaa (talk) 20:35, 18 May 2018 (UTC)
Books & Bytes – Issue 28
editBooks & Bytes
Issue 28, April – May 2018
- #1Bib1Ref
- New partners
- User Group update
- Global branches update
- Wikipedia Library global coordinators' meeting
- Spotlight: What are the ten most cited sources on Wikipedia? Let's ask the data
- Bytes in brief
Arabic, Chinese, Hindi, Italian and French versions of Books & Bytes are now available in meta!
Read the full newsletter
Sent by MediaWiki message delivery on behalf of The Wikipedia Library team --MediaWiki message delivery (talk) 19:33, 20 June 2018 (UTC)
Meeting followup
editHi Sam, Thanks for being there today. Lots of stuff half heard, half understood, to try to follow up on. One thing you mentioned was some form of mapping using wikipedia when data have been uploaded to the commons. I was curious about this as I dislike my Rgooglemaps: they are too fuzzy. Nor am I mad about my Australian outline maps (produced using SAS), so another technique would be good.... MargaretRDonald (talk) 13:27, 27 June 2018 (UTC)
@MargaretRDonald: There's a new thing called Kartographer that can show data on maps pretty easily. For example, at right is the Cuscuta australis data we were looking at yesterday. The colours and styles and things can all be customised, and the data doesn't have to live in the wiki page (as I've done in this example). —Sam Wilson 01:25, 28 June 2018 (UTC) @Samwilson: Thanks for this. (Only just spotted...) MargaretRDonald (talk) 02:07, 6 July 2018 (UTC)
- @Samwilson: Sorry to be so thick. But here in your text you have listed all the co-ordinates... and of course, the map is embedded in the page.. Writing code to generate the mark-up looks a smidgin ugly. So I am not quite sure how this is easier, or conceptually better from a wikipedian point of view (?) MargaretRDonald (talk) 02:14, 6 July 2018 (UTC)
- No, the idea would be to include the coordinates (in KML format) in a template in the manner of e.g. wikipedia:Template:Attached KML/High Street, Fremantle. Then, to update the range map, only that template would need to be changed and the article map would update automatically from there. I'm not sure if it is easier, but it does make the map zoomable, and perhaps is quicker than creating separate raster map files and uploading them. Just an idea though! :) —Sam Wilson 06:50, 6 July 2018 (UTC)
- @Samwilson: Thanks for the explanation, Sam. MargaretRDonald (talk) 16:54, 20 January 2020 (UTC)
- No, the idea would be to include the coordinates (in KML format) in a template in the manner of e.g. wikipedia:Template:Attached KML/High Street, Fremantle. Then, to update the range map, only that template would need to be changed and the article map would update automatically from there. I'm not sure if it is easier, but it does make the map zoomable, and perhaps is quicker than creating separate raster map files and uploading them. Just an idea though! :) —Sam Wilson 06:50, 6 July 2018 (UTC)
seeing other wikisources
edit@Samwilson: Hi, Sam. It would be very nice if one could see all the corresponding wikisource things on the left as one can in wikipedia or as one can in wikidata. I am constantly seeking other language sources for botanical stuff and would be nice to be able to navigate (relatively) easily to other language sources..... Any thoughts? MargaretRDonald (talk) 02:04, 6 July 2018 (UTC)
- @MargaretRDonald: Yes, this is a definitely wanted thing, and is being worked on as Phabricator:T180303. The trouble with Wikisource interlinking, compared to other projects, is that works in different languages don't get directly linked to the same Wikidata item, but rather each get their own (which has a 'edition or translation of' property that links to the unifying work-level item). —Sam Wilson 06:53, 6 July 2018 (UTC)
- @Samwilson: Hmmm. (I see) I look forward to all those clever persons making it happen sometime.... Cheers, MargaretRDonald (talk) 07:14, 6 July 2018 (UTC)
Living auhors category again
editHi Sam. First I would like to thank you a lot for handling the floruit problem at the template {{Author}} and thus solving partly the Living people category.
There are also some authors who do not have the floruit property filled at Wikidata, because they are not known because of a one-date event, but who were known for a longer time. Such people can have Wikidata properties "work period (start)" and "work period (end)" instead. An example of this is Author:Mordach Mackenzie (Q56612310) whose birth and death dates are unknown and who is known for his work between 1746 and 1764. Do you think it would be possible that a) the authors's page at Wikisource could take these dates from Wikidata and display them as "fl. 1746–1764" and b) remove the authors whose "work period (end)" was more then e.g. 90 or 110 years ago from the Living people category too?
I am writing you because you are the only one here I know that can handle such things (though I believe there are more people like that). However, it is not of the highest importance, so if you do not have enough time, it can wait. Thanks. --Jan Kameníček (talk) 11:30, 18 September 2018 (UTC)
- Yes, that sounds like a great idea! I did see your comment on that other page; sorry I didn't reply yet. I'm keen to help, not sure when I'll find time, but it's conceptually the same thing we're already doing but just with a different property, so it shouldn't be too hard. There are currently 7 failing tests that I want to fix up before embarking on any new features though, so I might try to do them first. Will keep you posted! Sam Wilson 03:15, 19 September 2018 (UTC)
PageCleanUp feature request
editHi,
Just a note to make a record of our recent conversation about my feature request for your very useful PageCleanUp.js tool:
If a full stop (period) is followed by a lower-case letter:
- Some text. then some more
then it should probably be a comma:
- Some text, then some more
If a comma is followed by a capital letter:
- Some text, Different text
then (proper names notwithstanding) it should probably be a full stop:
- Some text. Different text
If this is not a major issue for most OCRd text, perhaps a separate script would be better. What do you think? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:31, 3 October 2018 (UTC)
- Also, perhaps the script could fix ligatures, like the "fi" and "fl" in "magnificent power of flight"? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:14, 5 October 2018 (UTC)
- @Pigsonthewing: dots and commas done, good idea! As for ligatures, Wikisource:Style_guide/Orthography#Ligatures suggests that we not use them as search engines struggle. I suspect that's wildly out of date. We do avoid e.g. the long 's', because it's "just" orthography and so not relevant to the text. Also, there are ligatures (e.g. st) that don't exist in many fonts at all. Sam Wilson 22:58, 9 October 2018 (UTC)
- Sorry if I wasn't clear; I meant the script could change from ligatures generated by OCR to regular letter pairs. Thanks for the punctuation feature. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 00:25, 10 October 2018 (UTC)
- @Pigsonthewing: Oh! Ha, yes I see now. Done! :) Sam Wilson 05:23, 10 October 2018 (UTC)
- That is going to save me a lot of dull drudgery. Thank you! Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:42, 10 October 2018 (UTC)
- @Pigsonthewing: Oh! Ha, yes I see now. Done! :) Sam Wilson 05:23, 10 October 2018 (UTC)
mediawiki-feeds 503?
editHi Sam, thanks for all the stuff you do!
Realized I wasn't subscribed to the Signpost anywhere and when I tried the RSS feed, Feedly said it couldn't reach, and clicking the feed got a 503 from Toolforge and it said you're the mediawiki-feeds maintainer.
Thought you might like to know. Hopefully it's just an old link or something else simple. John Abbe (talk) 17:32, 2 June 2019 (UTC)
- @John Abbe: Thanks for telling me about this! It made me realise that that tool isn't in my list of monitored tools, so I hadn't seen that it was down. That's fixed now, and so is the bug that was causing it to fail on the Signpost feed, and the tool is back online. See how it fares, and ping me with any dramas. :) Thanks! Sam Wilson 02:18, 3 June 2019 (UTC)
- Sweet! And thx for the quick fix.John Abbe (talk) 05:46, 5 June 2019 (UTC)
Requesting import of "Links count" gadget
editAny chance you could help with: Wikisource:Scriptorium#Requesting import of "Links count" gadget, please? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 19:23, 20 June 2019 (UTC)
curly quotes script
editHi -- thanks for the very useful-looking script. How do I install it at my common.js? I tried adding importScript('User:Samwilson/CurlyQuotes.js');
but that didn’t work. Thanks for help Levana Taylor (talk) 16:05, 30 August 2019 (UTC)
- @Levana Taylor: Use this in your commons.js:
mw.loader.load('//en.wikisource.org/w/index.php?title=User:Samwilson/CurlyQuotes.js&action=raw&ctype=text/javascript');
And let me know if you find any bugs with it! :-) It's adapted from https://github.com/gitenberg-dev/punctuation-cleanup —Sam Wilson 01:58, 31 August 2019 (UTC)
- Hmm… nice interface, mostly works, but I’ve already found some issues. Most notably, it’s not always correctly noticing paired apostrophes to leave straight: try it on this page, for example. Also it doesn't know to leave double-quotes alone if they’re inside angle brackets (I use
<section begin="s1" />
for section breaks, for example). And, I’m sure this isn't the only case it doesn’t get right, but when you have italics inside double quotes ("Really?!") it doesn’t alter the double quotes. Here’s a thought for long-term development: since you’ll never be able to get it to work absolutely perfectly, the thing that’d make it really easy to check the results is if it highlights quotes in 3 different colors after finishing work, one for left, one for right, one for straight. Single and double quotes could be the same color: no need to have six! Levana Taylor (talk) 05:05, 31 August 2019 (UTC)- Hey, sorry if that last comment was too negative! I’ve been using the script a lot and finding it extremely useful. I do have a list of stuff it isn't catching, though. Lemme know if you want it. Levana Taylor (talk) 04:53, 5 September 2019 (UTC)
- @Levana Taylor: Oh cool! Yes, please. I'm sure there are going to be a bunch of things we can't handle, but it'd be good to try. :) The quotes thing I'm looking at now, but I think the highlighting colour thing might be a bit harder. Or do you just mean in the preview, not the editing box? That might be easier. Anyway, thanks for the feedback! I'll try to improve the script. Sam Wilson 05:04, 5 September 2019 (UTC)
- @Levana Taylor: Have you found Wikisource:WikisourceMono by the way? That helps with more easily seeing the different characters while editing. Sam Wilson 05:55, 5 September 2019 (UTC)
- Hey, sorry if that last comment was too negative! I’ve been using the script a lot and finding it extremely useful. I do have a list of stuff it isn't catching, though. Lemme know if you want it. Levana Taylor (talk) 04:53, 5 September 2019 (UTC)
- Hmm… nice interface, mostly works, but I’ve already found some issues. Most notably, it’s not always correctly noticing paired apostrophes to leave straight: try it on this page, for example. Also it doesn't know to leave double-quotes alone if they’re inside angle brackets (I use
- Yes, highlighting in preview would be helpful, even if it's not possible in the edit box. I find the font I'm using plenty readable, but the point is that you want the left-right pairs, or absence thereof, to jump out at you so your mind doesn't overlook them.
- Anyhow, I was wondering why you don’t have some simple rules like double quote at start of line is left, at end is right; single quote between two letters of the alphabet is right. Must be a reason I’m sure! As for suggestions:
- My biggest suggestion for improvement would be to drop all the things the script does with dashes and stick to just quotes. I never find the dashes useful and they are constantly messing things up, like pagenames that contain hyphens, and the <!-- comment markup (though I guess you must be finding the dash alterations useful since you put them in!)
- Bug noted: paired apostrophes before s and d are not being correctly interpreted
- The French d’ and l’ are major list items not yet being recognized. Then there’s ’twould - ’twill - ’twere - ’tisn’t - ’twasn’t (etc.) - ’midst - ’neath - ’bout - ’fraid - ’nother - ’uns - People in the novel I’m reading keep saying ’Gad! and ’Pon my honour! Levana Taylor (talk) 06:48, 5 September 2019 (UTC)
Cloud Services and Toolforge
editYou wouldn't happen to be familiar with Cloud Services (WMCS, CVPS), Toolforge, and related infrastructure bits? I have some kinda hacky local tooling for working with DjVu files and Tesseract and was toying with the idea of trying to set up some related utilities of possibly general usefulness. But right now I'm banging my head against the wall of insufficient documentation for the stuff the WMF provides for hosting such things. In other words, I'm looking for someone familiar with the setup that's willing to answer dumb questions and provide some hand-holding. --Xover (talk) 15:28, 17 September 2019 (UTC)
- @Xover: We do have the shared account user:Wikisource-bot for bits, and there is a range of documentation at wikitech: and places. Mailing list is pretty good for support, and IRC can be useful (though you have to be in a good time zone). I am totally useless as a coder, though have managed to find handholders to allow me to bumble through. — billinghurst sDrewth 23:33, 17 September 2019 (UTC)
- @Xover: Yes sure, I'd be happy to help. I'm reasonably familiar with Toolforge. What issues are you having? Sam Wilson 03:37, 18 September 2019 (UTC)
- Well, mostly stupidity and documentation that seems to be written for a different audience than myself.My vague ideas to begin with are a replacement (possibly temporary) for Phe's OCR gadget, and possibly a tool that'll take images from some source (IA id, URL, zip file, etc.) and spit out a DjVu with a text layer. There's some related stuff that might be relevant, like an easy way to add and remove pages from a DjVu, or to redact pages from a DjVu (typically for copyright reasons). Not sure what all would make for tools that are 1) a reasonable amount of effort to get working and 2) of sufficiently general utility to be worth it. A short term alternative for the OCR gadget is the primary motivation as that seems to be pretty critical for several contributors.Right now I'm trying to figure out where and how it'd make sense to host something like that—Cloud VPS, Toolforge, or a third party general web host somewhere—and I'm just not finding the documentation that'll tell me what CVPS and Toolforge actually look like in terms of facilities, hosting environment, and so forth. As I said, the docs seem to be addressing different questions and for a different audience than me. So my first set of dumb questions / need for hand-holding is to figure out that.What I'd need is:
- A sufficiently Unix-y hosting environment. Fedora would be perfect, and RHEL or CentOS good second choices. Any good modern and not too esoteric Linux distro would probably be good enough, but experience tells me there are crucial difference in relevant sysadmin tools and package distribution/management systems and availability of packaged third-party software. Depending on what comes ready out of the box that may or may not be an issue.
- A non-ancient version of Perl 5, with a reasonable set of standard modules installed. Given the rate of change in perl-land, I don't imagine any OS the infrastructure team are willing to host would contain a version that's too old. I don't current need a lot of esoteric perl modules, but by experience I expect to need to be able to install at least some. For example, I believe HTML::Parser was dropped from the core modules so that's something that would need to be available through some method.
- A not-too-esoteric CGI hosting environment. My experience is with Apache, possibly with mod_perl, but anything that supports Perl and can be tweaked for interactive performance would probably work.
- Tesseract 4.1 installed and functioning.
- GraphicsMagick in some recentish version. ImageMagick will probably do in a pinch, but my experience with GraphicsMagick is better.
- The ability to have such software updated in some reasonable timeframe. Whether that's done by the sysadmins as part of the platform, or whether that's something I'd sysadmin myself using the package tools, isn't all that important.
- I'm way past the age where I want to waste time compiling software from source, so I really really would prefer that can be handled through some kind of package system.
- I'd need a moderately large amount of disk space to play with, and moderately performant too (for OCR, disk IO quickly becomes a bottleneck). Several tens of gigs at least for temporary stuff: purged, depending on the case, on the timeframe of hours up to weeks. A gig or so per DjVu file, and room for at least 10 jobs' worth of files sitting around, would be the minimum reasonable. More is better.
- For the batch OCR stuff I'd eat all the CPU I could get too, but for the per-page stuff anything that isn't completely choked would probably work well enough.
- It would be a bonus if I had easy read-only access to files from Commons and local files on enWS (and possibly the other WSes if anyone should want it). A virtual filesystem or something, preferably more performant than having to download a copy of the file over HTTP. If any writing to a wiki is eventually needed that'd go through the API, most likely using OAuth, so read-only would be fine. Not a requirement by any means, but it'd make the DjVu manipulation stuff much more elegant and efficient if I ever get around to it.
- Network access to Commons and enWS for downloading files, and for talking to the API if it becomes relevant.
- General internet access to download stuff from IA, Hathi, etc. if that becomes relevant.
- No immediate need for access to database dumps or similar: everything I have in mind is just crunching File: files.
- No immediate need for database facilities: I might eventually need something to track batch jobs or whatever, but I'd get by with file-based solutions for a good long while before that became an issue.
- If there is a batch system with oodles of compute resources that responds to "run this code on that data over there and notify me with the results once you're done" that'd be neat, but I'm not entirely sure that'd be more performant than doing it on whatever is hosting me directly. Possibly it'd enable parallel execution of large batch OCR/DjVu jobs if the tool is used a lot, that would otherwise need to be serialised, but I'm not sure the volume would be there to make that worth the effort.
- If anyone actually started using a utility here I would want to add extra maintainers with OS-level access.
- For source control and such I'd probably use Github (Gerrit looks completely impenetrable to me), and issue tracking either there or Phabricator, so no special needs related to that.
- That's a rough braindump of the requirements. Given your knowledge of the facilities, what should I be looking at for hosting? By my guesses here, Toolforge might be both too constricting for the needs, and at the same time its advantages aren't all that relevant (I think I've grasped that Toolforge has DB dumps and similar already available, but as I don't need those…). If my vague understanding is anywhere near right, Toolforge is essentially a shared web host with some Wikimedia-specific facilities while Cloud VPS is just server hosting (that happens to be para-virtualised rather than bare-metal); but from there to the details that'd let me assess them against the requirements is a bit steep a climb right now. Anything I haven't thought of? Am I completely off my rocker? Should I go away and stop bothering you? :) Any help and hand-holding would be very much appreciated! --Xover (talk) 08:28, 18 September 2019 (UTC)
- @Xover: Yep, your understanding of the shared-hosting or VPS is right. You could certainly do all you need on a VPS, but I think it sounds like you'll be fine with a Toolforge tool (although I'm not 100% certain off the top of my head of version specifics etc.). I recommend creating a tool account and seeing how you go. Lots of people use Github, so no worries there. Probably the most confusing thing for new toolforge users is the cronjob setup: basically, your cronjobs don't actually run things themselves, they just add a job to the 'grid', where it runs. In practice this just means a command has to use
jsub
. Anyway, I recommend a) creating a tool, b) trying to run what you need, and c) ask me when you hit an issue. Sam Wilson 23:16, 25 September 2019 (UTC)- Well, apparently Toolforge is a no-go. Is there any point requesting a dedicated VPS so I could do that stuff myself? I have no idea what the criteria are for getting one or whether the stuff I have in mind is even remotely what the CVPS infrastructure is intended for. Any suggestions or pointers would be much appreciated! --Xover (talk) 18:32, 6 January 2020 (UTC)
- @Xover: Yes, I think if you've demonstrated that your requirements aren't met by Toolforge, then you should be able to request a VPS. Then you'll be able to install whatever you need. (Not that I'm completely familiar with the whole process, but that's my understanding.) Sam Wilson 00:03, 7 January 2020 (UTC)
- Well, apparently Toolforge is a no-go. Is there any point requesting a dedicated VPS so I could do that stuff myself? I have no idea what the criteria are for getting one or whether the stuff I have in mind is even remotely what the CVPS infrastructure is intended for. Any suggestions or pointers would be much appreciated! --Xover (talk) 18:32, 6 January 2020 (UTC)
- @Xover: Yep, your understanding of the shared-hosting or VPS is right. You could certainly do all you need on a VPS, but I think it sounds like you'll be fine with a Toolforge tool (although I'm not 100% certain off the top of my head of version specifics etc.). I recommend creating a tool account and seeing how you go. Lots of people use Github, so no worries there. Probably the most confusing thing for new toolforge users is the cronjob setup: basically, your cronjobs don't actually run things themselves, they just add a job to the 'grid', where it runs. In practice this just means a command has to use
- Well, mostly stupidity and documentation that seems to be written for a different audience than myself.My vague ideas to begin with are a replacement (possibly temporary) for Phe's OCR gadget, and possibly a tool that'll take images from some source (IA id, URL, zip file, etc.) and spit out a DjVu with a text layer. There's some related stuff that might be relevant, like an easy way to add and remove pages from a DjVu, or to redact pages from a DjVu (typically for copyright reasons). Not sure what all would make for tools that are 1) a reasonable amount of effort to get working and 2) of sufficiently general utility to be worth it. A short term alternative for the OCR gadget is the primary motivation as that seems to be pretty critical for several contributors.Right now I'm trying to figure out where and how it'd make sense to host something like that—Cloud VPS, Toolforge, or a third party general web host somewhere—and I'm just not finding the documentation that'll tell me what CVPS and Toolforge actually look like in terms of facilities, hosting environment, and so forth. As I said, the docs seem to be addressing different questions and for a different audience than me. So my first set of dumb questions / need for hand-holding is to figure out that.What I'd need is:
Little problem on nl-ws
editHi Sam,
may I ask you to take a look at some strange problem, that happens on nl-ws?
It does happen only in one book, for instance at this page: s:nl:Pagina:Heemskerck op Nova Zembla.djvu/101. As you can see the header does not outline correctly. I have been trying all kinds of things. Nothing seems to help. It only happens in this book. In all other books on nl-wiki where we use the RH-template, it works fine, see e.g. s:nl:Pagina:De voeding der planten (1886).djvu/46. Can you explain why this happens to this book?
Many greetings, and looking forward to your answer, --Dick Bos (talk) 10:47, 2 October 2019 (UTC)
- @Dick Bos: It looks to me like there are odd block-level elements being introduced where they shouldn't be. For example, the following part of s:nl:Sjabloon:RunningHeader (and the same applies to the right side):
Currently: Should be: -->|{{#if:{{{1|{{{left|}}}}}} | <span style="float: left; display: block;"> {{{1|{{{left}}}}}} </span>}}<!--
-->|{{#if:{{{1|{{{left|}}}}}} |<!-- --><span style="float: left; display: block;"><!-- -->{{{1|{{{left}}}}}}<!-- --></span>}}<!--
- And also that there isn't a default value for the centre component (in {{RunningHeader}} here, it's a
). - Actually, that template could be rewritten with block-level components and using flexbox, but that's another story! :-)
- Sam Wilson 09:48, 5 October 2019 (UTC)
- You were right! I usually copy this kind of templates from en-ws (I really don't understand a word of the code, to be honest), and apparently there had been an update of the template. Now that I copied the newest version to nl-ws, it is running perfectly! Hurray.....
- We need someone with some technical knowledge to do this kind of maintenance work on the Dutch Wikisource! But unfortunately, activity on nl-ws is very low. Thanks for helping us! --Dick Bos (talk) 16:29, 7 October 2019 (UTC)
- @Dick Bos: Oh good, I'm glad it works. There isn't really a good system yet for keeping imported templates up to date. One possible way could be that we add some of this functionality to the new Wikisource extension, because then it'd be on all Wikisources. That's going to take some more work though. For now, it's export/import and keep an eye on things. Sam Wilson 03:55, 8 October 2019 (UTC)
- @Dick Bos: Can I recommend that you special:import templates (select "en" from dropdown) rather than copy and paste. 1) it brings a history and can actually bring other required components; 2) it allows, in future times, others to track and find what you were doing and reproduce at nl:special:log/import. — billinghurst sDrewth 23:33, 6 January 2020 (UTC)
- @Dick Bos: Oh good, I'm glad it works. There isn't really a good system yet for keeping imported templates up to date. One possible way could be that we add some of this functionality to the new Wikisource extension, because then it'd be on all Wikisources. That's going to take some more work though. For now, it's export/import and keep an eye on things. Sam Wilson 03:55, 8 October 2019 (UTC)
Plain sister updates
editHi SW. [Happy early cricket season, hope your weather is better than mine at the moment] I am wondering whether you had been able to look at my thoughts on template talk:plain sister for an update to Module:Plain sister to automatically link articles to enWP biographis. I know that I lack the skills to make those changes, and wondered whether you had the skills for such a change, or whether we are needing to go searching outside. — billinghurst sDrewth 10:09, 9 November 2019 (UTC)
- @Billinghurst: I've replied over there. I've resurrected some work I did last year on that, and it's now functioning. See what you think. I'm happy to make the changes and monitor things closely. Sam Wilson 21:28, 10 November 2019 (UTC)
validated index count discrepancy
editHi. Hope all is well out west.
- Your tool says "This page presents the categorisation of the 3172 works on the EN Wikisource"
- Category:Index Validated says "... pages are in this category, out of 3,442 total"
Which is correct? What sort of discrepancies would we need to identify to resolve the 270 gap? I cannot work out what to do with a json list (my uselessness) to make any comparisons within AWB or petscan:.
Noting that when I compare Category:Index Validated with Category:Indexes validated by date (3443) that there is some discrepancies to resolve between those two, so the numbers above will probably have bumped around a little by the time you see this. — billinghurst sDrewth 02:51, 6 January 2020 (UTC)
- @Billinghurst: Hello! That's interesting. :( My first thought is that the missing ones are not categorized (i.e. their index pages are in Category:Index Validated but their mainspace pages are not categorized). It could also be that the tool can't figure out what their mainspace pages are (it looks for links to a top-level mainspace page from the Index page, with a query a bit like this one).
—Sam Wilson 03:10, 6 January 2020 (UTC)
- Thanks. For the incompetent, would you be able to generate a wikipage or a petscan query (preferred as regeneratable), and I will take it from there to determine the issue. Some of this list will then be works not transcluded, and I will explore the remainder. I know that there are plenty without title links, it is one of Esme's traits. We should document that we wish for titles to be linked, as I don't always do it for other's works. — billinghurst sDrewth 03:30, 6 January 2020 (UTC)
- @Billinghurst: I've been trying, but have not yet figured out a simple way. Will keep looking at it! The ws-cat-browser is also due for a rejigging I think, because we can now determine validated mainspace pages via Category:Validated texts, so it no longer really needs to go via the index page at all. Although, maybe it's good to keep it as-is, for helping to find discrepancies like this. Sam Wilson 00:45, 7 January 2020 (UTC)
- A SPARQL query in Petscan doesn't function?
I would disagree that "category:validated texts" is a reasonable match. That category does not have a one-to-one relationship with Index: ns—DNB to works, of volumes of DNB to works are one to many. Plus, it is grossly underpopulated. There is no easy means to populate from this side the work side or the index side; even then the root page and the index: page are usually not one-to-one either.
Last time that I asked about flag addition I was told that there was no ready means to bot populate the flag via the available tools. blah blah blah blah... <sigh> — billinghurst sDrewth 04:30, 7 January 2020 (UTC)
- A SPARQL query in Petscan doesn't function?
- @Billinghurst: I've been trying, but have not yet figured out a simple way. Will keep looking at it! The ws-cat-browser is also due for a rejigging I think, because we can now determine validated mainspace pages via Category:Validated texts, so it no longer really needs to go via the index page at all. Although, maybe it's good to keep it as-is, for helping to find discrepancies like this. Sam Wilson 00:45, 7 January 2020 (UTC)
- @Billinghurst: Hello! That's interesting. :( My first thought is that the missing ones are not categorized (i.e. their index pages are in Category:Index Validated but their mainspace pages are not categorized). It could also be that the tool can't figure out what their mainspace pages are (it looks for links to a top-level mainspace page from the Index page, with a query a bit like this one).
Index:East Anglia in the twentieth century.djvu extra image page, now text off by one
editHi SW. IA-upload bot generated the above work, and it seems to have inserted that random image page as the lead, the work shows the image on page 2 at IA. Now the text and scans are out by one. What is the best way to address/resolve? — billinghurst sDrewth 01:42, 20 January 2020 (UTC)
- @Billinghurst: This seems to be some discrepancy with the book viewer at IA, because the imagecount attribute in the work's metadata says 672, but the book viewer is only showing 670 (there are non-book scans at front and back). I can't find anything in the metadata that explains how book viewer is making this decision; I guess it's in there somewhere, and ia-upload could use the same logic to exclude these pages. I don't have time right now to dig into it though. :( It looks like there's only a few pages proofread so far, so I guess it's a matter of resolving it manually. Sam Wilson 03:12, 20 January 2020 (UTC)
- okay. FWIW The display for IA-upload bot, just showed the second scan page as the first page as the page to exclude. — billinghurst sDrewth 03:33, 20 January 2020 (UTC)
- @Billinghurst: oh, hmm yeah that's annoying. It's because it uses the bookreader thumbnails and numbering system to get that image. :( I'll open an issue. Sam Wilson 03:40, 20 January 2020 (UTC)
- Okay, what is your trick to get around 100MB? The PDF -> DJVU conversion pushed it over the upper size, though you clearly have a sneak means through. If you can apply the corrected file it is at toollabs:wikisource-bot/East Anglia in the twentieth century.djvu. — billinghurst sDrewth 12:10, 22 January 2020 (UTC)
- @Billinghurst: Done. It's chunked upload protocol—which is what the UplaodWizard uses behind the scenes—which allows up to 2GB per upload. Last safety valve is server-side upload which can be performed by some WMF staff (sysadmin, not dev, iirc) and bypasses all size restrictions, but that's most suitable for things like massive donations from some archive or library and not individual files. --Xover (talk) 12:29, 22 January 2020 (UTC)
- Culled the file, two times misaligned by different processes. Will await other fixes, it isn't an urgent work. — billinghurst sDrewth 07:36, 23 January 2020 (UTC)
- @Billinghurst: Want me to regenerate the file from the source scans? --Xover (talk) 08:13, 23 January 2020 (UTC)
- I don't mind either way, it was poked up for Charles, and it is not urgent. It can wait until there is a fix, or the need for one to demonstrate a fix. — billinghurst sDrewth 10:43, 23 January 2020 (UTC)
- @Billinghurst: Want me to regenerate the file from the source scans? --Xover (talk) 08:13, 23 January 2020 (UTC)
- Culled the file, two times misaligned by different processes. Will await other fixes, it isn't an urgent work. — billinghurst sDrewth 07:36, 23 January 2020 (UTC)
- @Billinghurst: Done. It's chunked upload protocol—which is what the UplaodWizard uses behind the scenes—which allows up to 2GB per upload. Last safety valve is server-side upload which can be performed by some WMF staff (sysadmin, not dev, iirc) and bypasses all size restrictions, but that's most suitable for things like massive donations from some archive or library and not individual files. --Xover (talk) 12:29, 22 January 2020 (UTC)
- Okay, what is your trick to get around 100MB? The PDF -> DJVU conversion pushed it over the upper size, though you clearly have a sneak means through. If you can apply the corrected file it is at toollabs:wikisource-bot/East Anglia in the twentieth century.djvu. — billinghurst sDrewth 12:10, 22 January 2020 (UTC)
work written 1803, born 1824. Needs a fix, or we have a miracle! — billinghurst sDrewth 03:50, 17 April 2020 (UTC)
- @Billinghurst: Ha! Yes, oops. I've fixed it to be (as @Annalang13 correctly had it) his death date. Also added his birth date based on the fact that he turned 41 while down the hole. Sam Wilson 04:05, 17 April 2020 (UTC)
Using Google OCR for old English text
editHi. I'm running a project to upload 3,000 chapbooks from the National Library of Scotland's digitised collections and we're interested in using the Google OCR function instead of Tesseract because it identifies the long s letter (ſ) really well. i noticed you've been quite heavily involved in the discussion around Google OCR - even though it's discouraged to use Google OCR with English, do you think this would be an acceptable use? https://en.wikisource.org/wiki/Wikisource:WikiProject_NLS Gweduni (talk) 14:51, 4 May 2020 (UTC)
- @Gweduni: I think it definitely would be okay. The only reason it's at all discouraged (and it should be a less strong word, I think) is that we have a quota with Google. However, the quota is always renewed, and Google are (I think!) very happy for us to use their Cloud Vision API. We're also going to be doing some improvements soon (although I guess still a couple of months away) that will hopefully increase the quality of the text returned (Google gives us lots more structure about the OCR text than we're currently using, so we could do things with e.g. automatically improving punctuation, or even adding wiki templates where they're unambiguous). For updates, follow the phabricator:tag/wikisource_ocr project. Sam Wilson 22:32, 4 May 2020 (UTC)
- @Samwilson: Great, that's really good news. We'll move over to using Google OCR on our project from now on, and I'll have a look at the wikisource ocr project you mentioned. Excited to hear new developments are on the way! Gweduni (talk) 12:12, 5 May 2020 (UTC)
Long S
editHello,
I saw that you were involved in some of the "long s" discussion many years back. I've been trying to find good info here on how to approach long s in proofreading, but haven't found a clear guideline one way or the other. When I tried the template someone had created, it didn't seem to work, and also seems rather tedious unless I'm missing something (likely). Thanks for any clarification or advice you might have! Grillo7 (talk) 16:46, 3 June 2020 (UTC)
- @Grillo7: I think the basic guidance is not to use them at all, but of course if you're consistent within a work then it's fine. I used the {{ls}} template, but I think that's now set to only display the long S in the Page namespace (and a normal S in the mainspace). If you definitely want a long S in every situation, then you can just use the ſ character (probably copying and pasting it is the easiest way, or remembering its key shortcut). —Sam Wilson 23:10, 3 June 2020 (UTC)
Template:Person
editHi,
Please could you bring {{Person}} more into line with {{Author}}, in particular with regard to pulling in data (and images) from Wikidata? No doubt they can use the same Lua module. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:01, 19 October 2020 (UTC)
ia-upload front page trim — file path only error?
editIs there an issue with getting the paths fixed up in ia-upload? I added a phab request a while ago as the trim first page functionality dies
An error occurred: Command not found: "djvm -d "/data/project/ia-upload/ia-upload/jobqueue/menkentandkenti00hutcgoog/menkentandkenti00hutcgoog.djvu" 1 2>&1"
and as it has worked previously, I am presuming that it is path issue that can be resolved by declaring a path or putting a symbolic redirect. It is only you and Tpt that have access still around. — billinghurst sDrewth 08:45, 4 November 2020 (UTC)
- Oh, you are hardly around. :-( — billinghurst sDrewth 08:51, 4 November 2020 (UTC)
- @Billinghurst: No no I'm here! :) Just not editing much at the moment. Getting stuck into wsexport stuff right now. I'll have a look at ia-upload and see if it's anything obvious. Sam Wilson 00:47, 5 November 2020 (UTC)
OCR imrovement project..
editGreat to hear about this, and I had some suggestions for areas to look at.
Whilst OCRing some pre 19th century works I was encountering long s ( which was getting recognised as f or l ), and old styler ligatures like ct (recognised as d). Google's OCR seems to be better able to recognise this at present which is a shame as Wikimedia native tools should be better able to cope with older items given that the scanned works being transcribed might be as well.
The other suggestion was concerning 'multi column' based text, and sidetitles/margin notes. On something like The Statutes at Large example page: Page:Pickering - The Statutes at Large - Vol 40, Part 1 (1795, 35 George III).pdf/42. There are margin notes and sidetitles which are read as part of the main run of text.
Ideally they should be more effectively partitioned, so that a transcriber doesn't have to descramble them when setting up the appropriate formatting. (Sidenotes/Sidetitle support is still an unsettled issue on English Wikisource, however.) ShakespeareFan00 (talk) 18:45, 4 December 2020 (UTC)
- Yes, I've been wondering about this sort of thing. Someone mentioned the possibility of having a sort of secondary system where you could select a rectangle in the image and hit an OCR button to get just that bit; that'd work as a last resort. I'm not sure what we can do for sidenotes, but it'll be investigated for sure. The ligatures and long s I guess we just have to live with, because this project (I think) isn't going to tackle OCR training data just yet. Still, easier access to different engines will be better than nothing, and the next step will be further tweaking of the in-house stuff I hope. :-) —Sam Wilson 11:05, 5 December 2020 (UTC)
Would you please consider updating Template:Person to leverage Module:Author for image and years of life from WD? Thanks. — billinghurst sDrewth 23:33, 5 December 2020 (UTC)
- @Billinghurst: I'll see what I can do. :) —Sam Wilson 06:34, 7 December 2020 (UTC)
- @Billinghurst: Okay, it's working now. Slightly hackish, but functional. See what you think. I've changed a few people portals as a test. —Sam Wilson 06:49, 7 December 2020 (UTC)
- er... #Template:Person? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 20:01, 7 December 2020 (UTC)
- @Pigsonthewing: Sorry Andy! I totally missed your message before. Also, I realise I've missed the image part of it; will sort that out. :) —Sam Wilson 22:37, 7 December 2020 (UTC)
- er... #Template:Person? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 20:01, 7 December 2020 (UTC)
- @Billinghurst: Okay, it's working now. Slightly hackish, but functional. See what you think. I've changed a few people portals as a test. —Sam Wilson 06:49, 7 December 2020 (UTC)
Comment I have inserted image logic from {{author}} (well most of it) with adaptation for categories. Need to think about whether that image pull may be better in parent {{portal header}}. There is resulting tidying and documentation to be done, though I want to do some wider investigation prior to calling it a success. Probably also want to consider whether the image recovery logic may be better off in its own module, and/or better pulled from Module:WikidataIB. — billinghurst sDrewth 22:50, 11 December 2020 (UTC)
- @Billinghurst: Looks good! —Sam Wilson 09:15, 15 December 2020 (UTC)
indents? is there way?
edithttps://en.wikisource.org/w/index.php?title=Page:Wongan_Way_by_Lilian_Wooster_Greaves,_1927.pdf/12&action=edit Jarrah Tree (talk) 04:21, 15 January 2021 (UTC)
- @JarrahTree: Do you mean the paragraph indents? There's no need; we leave them flush left. See Wikisource:Style guide#Formatting. Indents in the poetry are another matter, and should be faithfully reproduced with a leading colon
:
. —Sam Wilson 04:25, 15 January 2021 (UTC)
- thanks for that... Jarrah Tree (talk) 04:40, 15 January 2021 (UTC)
Thanks
editThank you very much for your work on enhancing the WS-Export facilities in the Wikisource-software. I don't understand a word of all the technical things involved in it, but I found out that it's readily working on Dutch wikisource. I made a small announcement in the local "pub". I mentioned your name (to thank you). Greetings, --Dick Bos (talk) 16:47, 15 January 2021 (UTC)
- @Dick Bos: Oh thank you! That's very kind of you. :-) I hope we can carry on improving WS Export in the coming weeks! Do let us know of anythings that could be improved. —Sam Wilson 10:38, 16 January 2021 (UTC)
phab:T230415 (OCR text layer paragraphs)
editHi! Sorry to poke, but could we get the revised patch for phab:T230415 reviewed at some point? I managed to actually set up an MW+PP environment myself to check and it appears to work. (Now if only PDF could do that in light of the million+ files from the IA, but I don't think the data is there). I don't really know who else to ping, but this is a daily pain point for WS and has been for a decade, so it'd be nice to get it fixed. Inductiveload—talk/contribs 18:53, 25 January 2021 (UTC)
- @Inductiveload: Sorry, I had a quick glance at this but haven't had time to delve any deeper. First thing I wondered is that it looks like it's the sort of thing that should have some tests to go with it. Not sure that's a blocker though. — Sam Wilson 23:25, 28 January 2021 (UTC)
- Good point: I have figured out the tests and updated the tested DjVu's text-layer with a paragraph and a column. Inductiveload—talk/contribs 10:55, 16 February 2021 (UTC)
translatewiki
editI don't any other way to contact you and you have stated to be more active here so I'd just like to alert you that Phabricator is misspelled on https://translatewiki.net/wiki/Wikimedia:Wsexport-issues/en --Sabelöga (talk) 16:28, 11 February 2021 (UTC)
- @Sabelöga: Oh, thanks! I've made a fix: https://github.com/wikimedia/ws-export/pull/329 — Sam Wilson 03:09, 12 February 2021 (UTC)
- Excellent! :) --Sabelöga (talk) 14:45, 12 February 2021 (UTC)
hmm
editgap/brk seems an inneresting development... Jarrah Tree (talk) 11:57, 13 February 2021 (UTC)
- have found items in newspapers - adding links to wp en article Jarrah Tree (talk) 12:28, 13 February 2021 (UTC)
In our print copies, I would think that we would not be wanting to export this template; it work in offline mode, so do you want to add the #noprint stuff. I am also wondering about how we handle it in mobile productions, as it is going to engulf the top of the work, and wondering we exclude it or should do something else with it in the mode. — billinghurst sDrewth 21:11, 23 February 2021 (UTC)
- @Billinghurst: I've added .ws-noexport to it, to avoid exporting it. I'm not sure about mobile; it needs some more thought I think. It seems there are issues with lots of inputboxes on mobile, that could perhaps be dealt with together. —Sam Wilson 10:47, 24 February 2021 (UTC)
IA-upload kills me with its additional pages from the jp2 zip/tar
editInternet Archive identifier: b29008700 and these 00 pages.
Are we able to do anything to get the 0000 pages ignored so that we are not getting these pages included and then having the text offset? Working with PDFs is f'd as the text for me is always needing manipulating for whatever weird artefacts the text layers have. — billinghurst sDrewth 05:25, 6 April 2021 (UTC)
- FYI, this is (probably) the same as phab:T268246. Inductiveload—talk/contribs 06:33, 6 April 2021 (UTC)
- Yep, and #Index:East Anglia in the twentieth century.djvu extra image page, now text off by one. Not new, just killing me. — billinghurst sDrewth 07:04, 6 April 2021 (UTC)
WSexport
editHi (and @Inductiveload:. Dictionary of Indian Biography has been set up with the lists of names on the subpages, and typically these subpages don't print out. Wrapping a <div class="ws-summary">...</div>
around the list (Dictionary of Indian Biography/A) manages to get an export, compared to (Dictionary of Indian Biography/B). So looking at what may be the more reliable solution.
Putting all the names on the front page will be ugly. The base user isn't going to know to add the wrapper. We could build a faux list template for lists of names to wrap around (and it has to wrap), or we can just enforce the use of auxToC quick special:diff/11237197), and I am not certain about a big green monster, though can cope.
Any other ideas? — billinghurst sDrewth 10:39, 28 April 2021 (UTC)
- Don't fuss it, I have just done it to the pages. — billinghurst sDrewth 13:39, 3 May 2021 (UTC)
- @Billinghurst: Yeah, sorry for not replying! I have actually been thinking about it. I do still want to look into adding the option to traverse all linked subpages, at whatever level, but for now it does seem the best solution is ws-summary everywhere (T253282). —Sam Wilson 22:55, 3 May 2021 (UTC)
- Not an issue, I just had plentiful of browser tabs still open as my memo for action. I was pushing sanity levels so moved to action as there was obviously no overt easy solution. — billinghurst sDrewth 03:18, 4 May 2021 (UTC)
- @Billinghurst: Yeah, sorry for not replying! I have actually been thinking about it. I do still want to look into adding the option to traverse all linked subpages, at whatever level, but for now it does seem the best solution is ws-summary everywhere (T253282). —Sam Wilson 22:55, 3 May 2021 (UTC)
Thought there was a fix to the false lead page issue from IA
editpage=1|100px|rightFile:Carwell.djvu has that ugly and disruptive faux lead ruler page
https://ia-upload.wmcloud.org/log/carwellorcrimeso00sher
[2021-09-18T01:17:54.615414+00:00] LOG.DEBUG: Converting /var/www/tool/jobqueue/carwellorcrimeso00sher/carwellorcrimeso00sher_jp2/carwellorcrimeso00sher_0000.jp2... [] []
[2021-09-18T01:17:54.615463+00:00] LOG.DEBUG: ...to /var/www/tool/jobqueue/carwellorcrimeso00sher/build/carwellorcrimeso00sher_p0.jpg [] []
[2021-09-18T01:17:57.486510+00:00] LOG.DEBUG: ...to /var/www/tool/jobqueue/carwellorcrimeso00sher/build/carwellorcrimeso00sher_p0.djvu [] []
[2021-09-18T01:17:57.664049+00:00] LOG.DEBUG: Converting /var/www/tool/jobqueue/carwellorcrimeso00sher/carwellorcrimeso00sher_jp2/carwellorcrimeso00sher_0001.jp2... [] []
If this is a regression, can we reapply the fix. If I am confused and it wasn't fixed, can we look to a means to resolve this issue as it is problematic where we want to work on djvus, rather than rubbishy pdfs. Thanks. — billinghurst sDrewth 10:24, 18 September 2021 (UTC)
- @Billinghurst No, I can't find anything about fixing that bug, sorry. It looks like T268246 IA Uploader fails to recognize the first page of a book is it, and T243163 'First page' thumbnail isn't always of the first page is possibly a duplicate. It looks like @Inductiveload has identified the root problem. I'm not sure I've got time at the moment to look at it, but I'll try! Sam Wilson 01:24, 20 September 2021 (UTC)
- Okay, thanks. — billinghurst sDrewth 11:05, 20 September 2021 (UTC)
Ya Git(hub admin)!
editHey,
Ankry and I have been doing some various maintenance on phetools lately (Ankry, in particular, just converted the proofreading graphs to look at proofreading status in page_props instead of catlinks!), and are at a point were we figured it was time to get things into VCS (since only Phe has access to the original repo). I started setting up an org. on Github for this, and any other repos we may need, and noticed you'd already set up a Wikisource org. Is this for WMF/CommTech stuff only, or could we reuse it for things like phetools too? I've also got that little toy OCR thing on WMCS that really should have source available somewhere (on general principle; it's still just a personal toy). I'm sure there are others floating around that it would make sense to gather. Xover (talk) 13:48, 30 October 2021 (UTC)
- No, it's not CommTech specific at all — it's for all Wikisourcerors! :) What's your usernames? I'll add you both. —Sam Wilson 14:30, 30 October 2021 (UTC)
- @Xover (sorry, forgot to ping). Also, there's now https://gitlab.wikimedia.org/ which I think at some soon point will be a better place for all this stuff, if it's not already. —Sam Wilson 14:32, 30 October 2021 (UTC)
- Bleh. I seem to always end up with lousy timing. From what reading I've done so far it's just too early to jump on the Gitlab. I'm going to dig a bit more—and maybe bug Bryan (for the Toolforge side) and Tyler (for the Gitlab side) about it—first, but based on first impressions we may have to set up on Github and then migrate later. Xover (talk) 20:40, 31 October 2021 (UTC)
- Yes, sounds sensible. I am looking forward to having everything on Gitlab! In a year or three. :-) What's your Github username? I'll add you to the Wikisource org. Sam Wilson 00:37, 1 November 2021 (UTC)
- Bleh. I seem to always end up with lousy timing. From what reading I've done so far it's just too early to jump on the Gitlab. I'm going to dig a bit more—and maybe bug Bryan (for the Toolforge side) and Tyler (for the Gitlab side) about it—first, but based on first impressions we may have to set up on Github and then migrate later. Xover (talk) 20:40, 31 October 2021 (UTC)
- @Xover (sorry, forgot to ping). Also, there's now https://gitlab.wikimedia.org/ which I think at some soon point will be a better place for all this stuff, if it's not already. —Sam Wilson 14:32, 30 October 2021 (UTC)
How we will see unregistered users
editHi!
You get this message because you are an admin on a Wikimedia wiki.
When someone edits a Wikimedia wiki without being logged in today, we show their IP address. As you may already know, we will not be able to do this in the future. This is a decision by the Wikimedia Foundation Legal department, because norms and regulations for privacy online have changed.
Instead of the IP we will show a masked identity. You as an admin will still be able to access the IP. There will also be a new user right for those who need to see the full IPs of unregistered users to fight vandalism, harassment and spam without being admins. Patrollers will also see part of the IP even without this user right. We are also working on better tools to help.
If you have not seen it before, you can read more on Meta. If you want to make sure you don’t miss technical changes on the Wikimedia wikis, you can subscribe to the weekly technical newsletter.
We have two suggested ways this identity could work. We would appreciate your feedback on which way you think would work best for you and your wiki, now and in the future. You can let us know on the talk page. You can write in your language. The suggestions were posted in October and we will decide after 17 January.
Thank you. /Johan (WMF)
18:14, 4 January 2022 (UTC)
Hello! Genealogy
editHi @Samwilson - just wanted to talk about the wiki genealogy project that has some interesting outlines on WikiMedia. I am not sure if you have created it/ involved still, but would be great to help start creating more detailed space for collaborative creation of general policy/ a specific proposal.
If you are involved, would be great to hear more from you. If not, will try elsewhere. Thanks! Jamzze (talk) 19:09, 2 August 2022 (UTC)
- @Jamzze: Yes, I'm definitely interested! I'll reply to your post on meta:Talk:Wikimedia genealogy project. :-) Sam Wilson 06:34, 3 August 2022 (UTC)
Finding non-scan-backed pages
editYou wouldn't happen to have any tips on how to identify (list) non-scan-backed mainspace pages? I can't think of a on-wiki way to do it, so I'm guessing it'd need to query the database... for a page property associated with Proofread Page maybe? Xover (talk) 06:43, 19 April 2023 (UTC)
- @Xover: Does Special:PagesWithoutScans work for your needs? It could do with some means of filtering, perhaps, or of exporting the whole list. I think it's built from a query of all mainspace pages that do not link (via the templatelinks table) to a Page NS page; I'm sure it could also be done in Quarry. Sam Wilson 07:22, 19 April 2023 (UTC)
- No, I'm looking for something where I can either exclude subpages or group and count them, etc. We have somewhere north of 200k non-scan-backed mainspace pages (roughly a third of our total ns:0 pages) so I'm inching towards some way to 1) analyse where and what they are, and 2) find some way to turn it into a structured maintenance backlog that we can start working on. Xover (talk) 10:49, 19 April 2023 (UTC)
- @Xover: That does sound like a good idea. Sam Wilson 02:40, 20 April 2023 (UTC)
- No, I'm looking for something where I can either exclude subpages or group and count them, etc. We have somewhere north of 200k non-scan-backed mainspace pages (roughly a third of our total ns:0 pages) so I'm inching towards some way to 1) analyse where and what they are, and 2) find some way to turn it into a structured maintenance backlog that we can start working on. Xover (talk) 10:49, 19 April 2023 (UTC)
Possible ePub problem in Apple Books
editI just downloaded ePubs of Thuvia, Maid of Mars and The Gentle Grafter, and in both of them Apple Books (latest-ish on both macOS and iOS) fails to display any of the images except the Wikisource scribe. All the images are present inside the ePub container, but the content images are wrapped in <figure>…</figure>
whereas the Wikisource scribe in title.xhtml is a bare <img alt="" src="images/Accueil_scribe.png" />
. (and, yes, removing the figure wrapper with a text editor fixes it, just to be clear)
Any chance you could test this in your e-reader to see if this is just Apple Books being dumb?
I also imagine images have worked fine in ePub before (in Apple Books), so I'm wondering if either MW's HTML output, or whatever engine ws-export is using, have changed recently. You wouldn't happen to know of any such changes just off the top of your head? Xover (talk) 19:39, 1 September 2023 (UTC)
- @Xover: I'm not sure what the issue could be. Those epubs work fine for me, and I don't think anything has changed recently. The
<figure>...</figure>
has been there for ages, since we switched to Parsoid HTML (we have a test for it). The books validate fine with epubcheck (with some unrelated errors). Is there any sort of validation tool for Apple Books? We could add it to the CI for ws-export. Sam Wilson 03:58, 5 September 2023 (UTC)- Darn, that's what I was afraid of. There's jack all technical info on Apple Books that is easily googleable (after the DOJ in their wisdom granted Amazon a monopsony, ebooks lost strategic focus at Apple and it shows). I'll dig around and see if I can come up with something. Xover (talk) 06:21, 5 September 2023 (UTC)
- Oh, hmm, it strikes me that someone at the WMF most likely has a contact at Apple since they show data from Wikipedia in Siri responses (and they might be interested in the Enterprise stuff). It might be worthwhile to ask around internally so that if we can't find anything we can ask them to take a look from their end. Xover (talk) 06:25, 5 September 2023 (UTC)
- @Xover: I'll see if I can find anyone. Is there a Phab task for this yet? Sam Wilson 01:51, 6 September 2023 (UTC)
- It's not the
<figure>…</figure>
element per se, it's the.mw-halign-center
class it has which appliesdisplay: table;
to it. If I manually change that in the includedmain.css
todisplay: block;
the image I tested displays fine in Apple Books. Xover (talk) 07:22, 6 September 2023 (UTC)- Phab filed: T345723. Xover (talk) 11:13, 6 September 2023 (UTC)
- It's not the
- @Xover: I'll see if I can find anyone. Is there a Phab task for this yet? Sam Wilson 01:51, 6 September 2023 (UTC)
- Oh, hmm, it strikes me that someone at the WMF most likely has a contact at Apple since they show data from Wikipedia in Siri responses (and they might be interested in the Enterprise stuff). It might be worthwhile to ask around internally so that if we can't find anything we can ask them to take a look from their end. Xover (talk) 06:25, 5 September 2023 (UTC)
- Darn, that's what I was afraid of. There's jack all technical info on Apple Books that is easily googleable (after the DOJ in their wisdom granted Amazon a monopsony, ebooks lost strategic focus at Apple and it shows). I'll dig around and see if I can come up with something. Xover (talk) 06:21, 5 September 2023 (UTC)
Google OCR gateway
editcf. WS:S#Google OCR. Are you aware of any current problems with ocr.wmcloud.org, Google's upstream service, or their access to Commons (rate limiting, Commons being its usual charmingly slow self, etc.)? I haven't seen any obviously relevant tickets flow by, but I'm not sure I watch all relevant tags to pick up any filed here. I seem to recall that this was an acute issue at some point previous, and has cropped up intermittently since, but what Jan describes sounds fairly more persistent. Xover (talk) 06:55, 21 November 2023 (UTC)
- @Xover: I've been trying to figure it out, but it's not simple. The Google response doesn't seem to have much of a useful error code, so it's probably going to be a matter of looking at the actual (English) error message returned, and retrying by sending the image rather than the URL. Sam Wilson 06:49, 23 November 2023 (UTC)
- Are we sure this isn't Commons being flakey? Fetching thumbnails of multipage media files is really pretty atrociously bad, and after they stopped writing thumbnail cache across datacenters the performance will vary dramatically depending on whether you hit esams or codfw. See e.g. the generational saga at T328872. Auto-retrying with a file upload seems pretty hacky to me. Xover (talk) 09:27, 23 November 2023 (UTC)
- @Xover: Yeah I know. :( But maybe it's worth it? I'm not sure. Certainly, hitting a different data centre would explain why Google doesn't find the thumbnail that is already visible to the user (i.e. shouldn't need to be generated). (Not sure why I say 'certainly'; I don't feel like any of this is very certain!) [IP redacted] 10:18, 23 November 2023 (UTC)
- @Xover: A few more people are saying that things are pretty annoying with the "The URL does not appear to be accessible by us. Please double check or download the content and pass it in" error, and I'm wondering if we should make this change. It's hacky, but might tide us over for a while. What do you think? Sam Wilson 06:56, 30 November 2023 (UTC)
- Your call is going to be better than mine on that. I don't have the knowledge and information you do.But what I'm thinking is that we won't be able to actually fix this without specific info on where the failure is and what it is. If we hack around it we're just hiding the problem and making it harder to properly fix.Have you tried looking for the failures in WMF-side logs? We know (or can get) approximate time and filename of at least some of the failures, so it should be possible to go spelunking for related log entries generated by Google's request. If we see the requests it's most likely on our side; if we don't see them it's most likely on Google's side. No? Xover (talk) 07:15, 30 November 2023 (UTC)
- @Xover: True. But we're not making much progress at the moment anyway. And we'll still have Google try the first time (because it usually works), so the bad responses will still happen. Sam Wilson 08:03, 30 November 2023 (UTC)
- Oh, and I just realised that this discussion was mostly already had in 2021! Oops. https://phabricator.wikimedia.org/T296912#7584882 Sam Wilson 08:04, 30 November 2023 (UTC)
- As I said, your instincts are going to be better than mine on this.But if we're going to implement file upload, wouldn't it be better to just switch to that as the primary method? Trying URL, then attempting to detect all failure modes in order to retry as upload, sounds needlessly complicated. If we do it by upload we also have control of all the moving parts and can do more deterministic error detection. It's a lot less efficient (pass-by-reference vs. pass-by-value), roundtripping a big hunk of binary data through the user's web browser and traversing their internet link twice, but it seems like it would at least be simpler and more robust. When the pass-by-reference API is flakey that might make this tradeoff tilt in the other direction? Xover (talk) 08:30, 30 November 2023 (UTC)
- @Xover: Oh, nope, it all happens server-side: so we first send the URL, and then some small number of times Google sends us an error saying "can't see it", and then we send the image data. The two requests happen within a single request from the user's point of view. It's only that I'm hoping that the first request often succeeds, that I'm suggesting doing it this way. The image does need to be downloaded to the OCR server before being sent to Google; that's the slower bit (and which could conceivably also hit thumbor errors). Sam Wilson 08:44, 30 November 2023 (UTC)
- As I said, your instincts are going to be better than mine on this.But if we're going to implement file upload, wouldn't it be better to just switch to that as the primary method? Trying URL, then attempting to detect all failure modes in order to retry as upload, sounds needlessly complicated. If we do it by upload we also have control of all the moving parts and can do more deterministic error detection. It's a lot less efficient (pass-by-reference vs. pass-by-value), roundtripping a big hunk of binary data through the user's web browser and traversing their internet link twice, but it seems like it would at least be simpler and more robust. When the pass-by-reference API is flakey that might make this tradeoff tilt in the other direction? Xover (talk) 08:30, 30 November 2023 (UTC)
- Your call is going to be better than mine on that. I don't have the knowledge and information you do.But what I'm thinking is that we won't be able to actually fix this without specific info on where the failure is and what it is. If we hack around it we're just hiding the problem and making it harder to properly fix.Have you tried looking for the failures in WMF-side logs? We know (or can get) approximate time and filename of at least some of the failures, so it should be possible to go spelunking for related log entries generated by Google's request. If we see the requests it's most likely on our side; if we don't see them it's most likely on Google's side. No? Xover (talk) 07:15, 30 November 2023 (UTC)
- @Xover: A few more people are saying that things are pretty annoying with the "The URL does not appear to be accessible by us. Please double check or download the content and pass it in" error, and I'm wondering if we should make this change. It's hacky, but might tide us over for a while. What do you think? Sam Wilson 06:56, 30 November 2023 (UTC)
- @Xover: Yeah I know. :( But maybe it's worth it? I'm not sure. Certainly, hitting a different data centre would explain why Google doesn't find the thumbnail that is already visible to the user (i.e. shouldn't need to be generated). (Not sure why I say 'certainly'; I don't feel like any of this is very certain!) [IP redacted] 10:18, 23 November 2023 (UTC)
- Are we sure this isn't Commons being flakey? Fetching thumbnails of multipage media files is really pretty atrociously bad, and after they stopped writing thumbnail cache across datacenters the performance will vary dramatically depending on whether you hit esams or codfw. See e.g. the generational saga at T328872. Auto-retrying with a file upload seems pretty hacky to me. Xover (talk) 09:27, 23 November 2023 (UTC)
epub download flakey
editcf. WS:S#.epub download on Wikisource rarely works.
I think I've seen various flakiness being described in Phab updates scrolling past in my inbox, so this might just be the known problem and a bit of bad timing? Or does it sound like something different?
I don't download enough epubs have any feel for the state of ws-export. Xover (talk) 06:24, 23 November 2023 (UTC)
QR codes
editHey, Sam! So, I was digging around your user pages, after watching videos from your YouTube channel for fun (thank you for making those by the way!!!).
And I happened to find out you had some involvement with QR codes on Wikipedia. I thought you might be interested to know that a while back, I was toying with the idea of dynamically generating QR codes in modern works (such as Don't Let Anyone Take It Away). For example, instead of just uploading a picture of a QR code (which could sometimes just be a scanned image of one from paper), we can use the approach of regenerating it ourselves, to look exactly like it did in the original. This would be similar to the way we handle music sheets, math equations, etc. and could prove to be a handy tool for some, and a more structured quality of the QR codes. I thought a template, probably supported by a Lua module, called {{QR code}} could do it.
The first issue I had, though, is in my lack of knowledge in this area. I was able to find out that to replicate a QR code exactly, you have to be able to know the data_capacity and the error_correction_level, possibly among other factors, and I'm not even sure how you'd find that out from just looking at it...
I wrote some notes about it at User:SnowyCinema/QR codes. Feel free to let me know what you think. Do you think it's possible to recreate QR codes in an SVG file in this way? SnowyCinema (talk) 06:48, 22 January 2024 (UTC)
- @SnowyCinema: That sounds interesting. I see you've already discovered the QRLite extension, that's what I use (on some non-Wikimedia wikis). But am I understanding what you want correctly: this is for works that contain QR codes, that it'd be good to have an automatic way to generate them? Makes sense to me! Rather than adding them as cropped bits of the scans which I guess is what currently happens? Like File:DOJ IER QR Code.jpg. I could imagine if we had something like QRLite then it'd just be a matter of adding e.g.
{{#qrcode:http://www.justice.gov/ier}}
instead. The toolforge:qrcode-generator tool is probably the best bet at the moment, it allows uploading to Commons. Have you seen the summary we've added at mw:QR Codes? Feel free to add anything that's missing there. Sam Wilson 08:38, 22 January 2024 (UTC)
- Yes, your understanding of my goal with this is perfect! Incidentally, the Toolforge link gave me an "Internal Server Error", and I do remember seeing the MediaWiki portal you've also linked me.
- QR codes that are generated automatically will always go to the same link, but usually end up looking different from one another. This is because there are different sizes and complexities to each one. For example, some people opt into using giant and complex QR codes, medium-sized ones, or tiny ones, based on their personal needs for presenting the code... And on a code level, you can specify how you want your QR code to look based on, at minimum, quantifiers called "data capacity" and "error correction level", some lower-level concepts that I can't say I really understand.
- As I was messing with it, I wrote some Python code that attempted to read a QR code from an image, which is something smartphones can do seamlessly—even this alone was unsuccessful on my part, but should work in theory. Reading the QR code from the scan would be the first step. Reading the URL from it should be fairly straightforward, but... Maybe there's other data? (I have no idea, I'm not the smartphone-savvy type.) Also, I have no idea how you can use code to return the exact size and complexity of the QR code needed to recreate it verbatim.
- Well, maybe it doesn't matter that much anyway if our QR code ends up looking exactly the same as the original, because the way the QR code looks is entirely generated by a computer algorithm with little to no human artistic input. But, I guess it'd be a nice touch to make it look exactly the same, since it's possible some future editors may remove uses of the template in favor of images because they may view it as incorrect. I'd love to hear your thoughts on this.
- If you can fill in any gaps about how QR codes work, I'm all ears. SnowyCinema (talk) 09:04, 22 January 2024 (UTC)
Request for help
editHi,
could you please take a look at Index:TheTreesOfGreatBritainAndIreland vol02.djvu. I think the file is corrupt. What is the best way to handle this? Make a new upload with the IA-uploadtool, or what? How can I arrange this? Many thanx in advance, --Dick Bos (talk) 19:39, 2 February 2024 (UTC)