mw.wikibase.getEntityIdForTitleEdit

[OMG a nude page, let me help resolve that]

I just saw announced the lua extension mw.wikibase.getEntityIdForTitle and if I am not mistaken that is could be a joyous little bundle of helpfulness for us.

Here I am thinking where we have an author page, and a related biographical page in main ns, and working out whether we can poke a wikipedia = parameter on the respective main ns page, or maybe automating a link; similarly I am see the potential for us to more readily get some bot action to better apply "main subject (P921)" at Wikidata for our biographical works. Am I reading the function properly? — billinghurst sDrewth 22:37, 16 April 2018 (UTC)

@Billinghurst: Interesting! So you mean create a link from the NS0 page of e.g. a biography chapter to the Author NS of the bio's subject? If the bio has a P921, couldn't we link via that (i.e. bio page → sitelink → P921 → Qxx → sitelink → Author page)? I'm not quite getting when we'd need to do a page title look-up... or do you mean, as a means to find unlinked articles? That must be it. So we'd do a getEntityIdForTitle('NS0 Page Name') and see if it comes up with an instance of person, and if it does we'd add some thing to alert editors here to the fact? Sam Wilson 06:27, 17 April 2018 (UTC)

Books & Bytes - Issue 27Edit

  The Wikipedia Library

Books & Bytes
Issue 27, February – March 2018

  • #1Lib1Ref
  • New collections
    • Alexander Street (expansion)
    • Cambridge University Press (expansion)
  • User Group
  • Global branches update
    • Wiki Indaba Wikipedia + Library Discussions
  • Spotlight: Using librarianship to create a more equitable internet: LGBTQ+ advocacy as a wiki-librarian
  • Bytes in brief

Arabic, Chinese and French versions of Books & Bytes are now available in meta!
Read the full newsletter

Sent by MediaWiki message delivery on behalf of The Wikipedia Library team --MediaWiki message delivery (talk) 14:49, 18 April 2018 (UTC)

Your feedback matters: Final reminder to take the global Wikimedia surveyEdit

WMF Surveys, 00:43, 20 April 2018 (UTC)

Unused files as a list?Edit

Do you know a way to manipulate Special:UnusedFiles so I can get it as an easy list? There are a string of files there that I know that I can straight out delete, though how to get it as a list to easily manipulate in bite size chunks is just not obvious. It is not even obvious that you can pull it from the API, not that I can generate simple text lists from the API anyway — billinghurst sDrewth 04:06, 16 May 2018 (UTC)

@Billinghurst: It doesn't look like it. That special page isn't transcludable even, and it's constructing the database query itself so I suspect the same query isn't done anywhere else (or we'd be reusing it). Also it's the only place mw:Manual:$wgCountCategorizedImagesAsUsed is used. What sort of list are you trying to build? It probably wouldn't be too hard to add transcluding support, if that'd help. Sam Wilson 04:38, 16 May 2018 (UTC)
There are works there that have been completed where the original image has been cleaned/gleaned/screened and uploaded to Commons. So we have the residue images to cleanse, and getting these url by url is a PITA. Getting a list, checking the work completion, and zapping more collectively is bettererer. Noting that prefix lists are unreliable in case one/some aren't done. — billinghurst sDrewth 05:23, 16 May 2018 (UTC)
Dropped the problem into phab:T194865billinghurst sDrewth 01:44, 17 May 2018 (UTC)
Note that File linked via {{raw image}} is still considered 'unused'. In pywikibbot: python scripts/listpages.py -unusedfiles.— Mpaa (talk) 17:36, 19 May 2018 (UTC)

PingEdit

Hi. Just in case you have not been notified about this: https://phabricator.wikimedia.org/T194861 . It is happening quite often recently. Bye— Mpaa (talk) 20:35, 18 May 2018 (UTC)

Books & Bytes – Issue 28Edit

  The Wikipedia Library

Books & Bytes
Issue 28, April – May 2018

  • #1Bib1Ref
  • New partners
  • User Group update
  • Global branches update
    • Wikipedia Library global coordinators' meeting
  • Spotlight: What are the ten most cited sources on Wikipedia? Let's ask the data
  • Bytes in brief

Arabic, Chinese, Hindi, Italian and French versions of Books & Bytes are now available in meta!
Read the full newsletter

Sent by MediaWiki message delivery on behalf of The Wikipedia Library team --MediaWiki message delivery (talk) 19:33, 20 June 2018 (UTC)

Meeting followupEdit

Hi Sam, Thanks for being there today. Lots of stuff half heard, half understood, to try to follow up on. One thing you mentioned was some form of mapping using wikipedia when data have been uploaded to the commons. I was curious about this as I dislike my Rgooglemaps: they are too fuzzy. Nor am I mad about my Australian outline maps (produced using SAS), so another technique would be good.... MargaretRDonald (talk) 13:27, 27 June 2018 (UTC)

@MargaretRDonald: There's a new thing called Kartographer that can show data on maps pretty easily. For example, at right is the Cuscuta australis data we were looking at yesterday. The colours and styles and things can all be customised, and the data doesn't have to live in the wiki page (as I've done in this example). —Sam Wilson 01:25, 28 June 2018 (UTC) @Samwilson: Thanks for this. (Only just spotted...) MargaretRDonald (talk) 02:07, 6 July 2018 (UTC)

@Samwilson: Sorry to be so thick. But here in your text you have listed all the co-ordinates... and of course, the map is embedded in the page.. Writing code to generate the mark-up looks a smidgin ugly. So I am not quite sure how this is easier, or conceptually better from a wikipedian point of view (?) MargaretRDonald (talk) 02:14, 6 July 2018 (UTC)
No, the idea would be to include the coordinates (in KML format) in a template in the manner of e.g. wikipedia:Template:Attached KML/High Street, Fremantle. Then, to update the range map, only that template would need to be changed and the article map would update automatically from there. I'm not sure if it is easier, but it does make the map zoomable, and perhaps is quicker than creating separate raster map files and uploading them. Just an idea though! :) —Sam Wilson 06:50, 6 July 2018 (UTC)
@Samwilson: Thanks for the explanation, Sam. MargaretRDonald (talk) 16:54, 20 January 2020 (UTC)

seeing other wikisourcesEdit

@Samwilson: Hi, Sam. It would be very nice if one could see all the corresponding wikisource things on the left as one can in wikipedia or as one can in wikidata. I am constantly seeking other language sources for botanical stuff and would be nice to be able to navigate (relatively) easily to other language sources..... Any thoughts? MargaretRDonald (talk) 02:04, 6 July 2018 (UTC)

@MargaretRDonald: Yes, this is a definitely wanted thing, and is being worked on as Phabricator:T180303. The trouble with Wikisource interlinking, compared to other projects, is that works in different languages don't get directly linked to the same Wikidata item, but rather each get their own (which has a 'edition or translation of' property that links to the unifying work-level item). —Sam Wilson 06:53, 6 July 2018 (UTC)
@Samwilson: Hmmm. (I see) I look forward to all those clever persons making it happen sometime.... Cheers, MargaretRDonald (talk) 07:14, 6 July 2018 (UTC)

Living auhors category againEdit

Hi Sam. First I would like to thank you a lot for handling the floruit problem at the template {{Author}} and thus solving partly the Living people category.

There are also some authors who do not have the floruit property filled at Wikidata, because they are not known because of a one-date event, but who were known for a longer time. Such people can have Wikidata properties "work period (start)" and "work period (end)" instead. An example of this is Author:Mordach Mackenzie (Q56612310) whose birth and death dates are unknown and who is known for his work between 1746 and 1764. Do you think it would be possible that a) the authors's page at Wikisource could take these dates from Wikidata and display them as "fl. 1746–1764" and b) remove the authors whose "work period (end)" was more then e.g. 90 or 110 years ago from the Living people category too?

I am writing you because you are the only one here I know that can handle such things (though I believe there are more people like that). However, it is not of the highest importance, so if you do not have enough time, it can wait. Thanks. --Jan Kameníček (talk) 11:30, 18 September 2018 (UTC)

Yes, that sounds like a great idea! I did see your comment on that other page; sorry I didn't reply yet. I'm keen to help, not sure when I'll find time, but it's conceptually the same thing we're already doing but just with a different property, so it shouldn't be too hard. There are currently 7 failing tests that I want to fix up before embarking on any new features though, so I might try to do them first. Will keep you posted! Sam Wilson 03:15, 19 September 2018 (UTC)

PageCleanUp feature requestEdit

Hi,

Just a note to make a record of our recent conversation about my feature request for your very useful PageCleanUp.js tool:

If a full stop (period) is followed by a lower-case letter:

Some text. then some more

then it should probably be a comma:

Some text, then some more

If a comma is followed by a capital letter:

Some text, Different text

then (proper names notwithstanding) it should probably be a full stop:

Some text. Different text

If this is not a major issue for most OCRd text, perhaps a separate script would be better. What do you think? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:31, 3 October 2018 (UTC)

Also, perhaps the script could fix ligatures, like the "fi" and "fl" in "magnificent power of flight"? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 16:14, 5 October 2018 (UTC)
  • @Pigsonthewing: dots and commas done, good idea! As for ligatures, Wikisource:Style_guide/Orthography#Ligatures suggests that we not use them as search engines struggle. I suspect that's wildly out of date. We do avoid e.g. the long 's', because it's "just" orthography and so not relevant to the text. Also, there are ligatures (e.g. st) that don't exist in many fonts at all. Sam Wilson 22:58, 9 October 2018 (UTC)
Sorry if I wasn't clear; I meant the script could change from ligatures generated by OCR to regular letter pairs. Thanks for the punctuation feature. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 00:25, 10 October 2018 (UTC)
@Pigsonthewing: Oh! Ha, yes I see now. Done! :) Sam Wilson 05:23, 10 October 2018 (UTC)
That is going to save me a lot of dull drudgery. Thank you! Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:42, 10 October 2018 (UTC)

mediawiki-feeds 503?Edit

Hi Sam, thanks for all the stuff you do!

Realized I wasn't subscribed to the Signpost anywhere and when I tried the RSS feed, Feedly said it couldn't reach, and clicking the feed got a 503 from Toolforge and it said you're the mediawiki-feeds maintainer.

Thought you might like to know. Hopefully it's just an old link or something else simple. John Abbe (talk) 17:32, 2 June 2019 (UTC)

@John Abbe: Thanks for telling me about this! It made me realise that that tool isn't in my list of monitored tools, so I hadn't seen that it was down. That's fixed now, and so is the bug that was causing it to fail on the Signpost feed, and the tool is back online. See how it fares, and ping me with any dramas. :) Thanks! Sam Wilson 02:18, 3 June 2019 (UTC)
Sweet! And thx for the quick fix.John Abbe (talk) 05:46, 5 June 2019 (UTC)

Requesting import of "Links count" gadgetEdit

Any chance you could help with: Wikisource:Scriptorium#Requesting import of "Links count" gadget, please? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 19:23, 20 June 2019 (UTC)

curly quotes scriptEdit

Hi -- thanks for the very useful-looking script. How do I install it at my common.js? I tried adding importScript('User:Samwilson/CurlyQuotes.js'); but that didn’t work. Thanks for help Levana Taylor (talk) 16:05, 30 August 2019 (UTC)

@Levana Taylor: Use this in your commons.js:
mw.loader.load('//en.wikisource.org/w/index.php?title=User:Samwilson/CurlyQuotes.js&action=raw&ctype=text/javascript');
And let me know if you find any bugs with it! :-) It's adapted from https://github.com/gitenberg-dev/punctuation-cleanupSam Wilson 01:58, 31 August 2019 (UTC)
Hmm… nice interface, mostly works, but I’ve already found some issues. Most notably, it’s not always correctly noticing paired apostrophes to leave straight: try it on this page, for example. Also it doesn't know to leave double-quotes alone if they’re inside angle brackets (I use <section begin="s1" /> for section breaks, for example). And, I’m sure this isn't the only case it doesn’t get right, but when you have italics inside double quotes ("Really?!") it doesn’t alter the double quotes. Here’s a thought for long-term development: since you’ll never be able to get it to work absolutely perfectly, the thing that’d make it really easy to check the results is if it highlights quotes in 3 different colors after finishing work, one for left, one for right, one for straight. Single and double quotes could be the same color: no need to have six! Levana Taylor (talk) 05:05, 31 August 2019 (UTC)
Hey, sorry if that last comment was too negative! I’ve been using the script a lot and finding it extremely useful. I do have a list of stuff it isn't catching, though. Lemme know if you want it. Levana Taylor (talk) 04:53, 5 September 2019 (UTC)
@Levana Taylor: Oh cool! Yes, please. I'm sure there are going to be a bunch of things we can't handle, but it'd be good to try. :) The quotes thing I'm looking at now, but I think the highlighting colour thing might be a bit harder. Or do you just mean in the preview, not the editing box? That might be easier. Anyway, thanks for the feedback! I'll try to improve the script. Sam Wilson 05:04, 5 September 2019 (UTC)
@Levana Taylor: Have you found Wikisource:WikisourceMono by the way? That helps with more easily seeing the different characters while editing. Sam Wilson 05:55, 5 September 2019 (UTC)
Yes, highlighting in preview would be helpful, even if it's not possible in the edit box. I find the font I'm using plenty readable, but the point is that you want the left-right pairs, or absence thereof, to jump out at you so your mind doesn't overlook them.
Anyhow, I was wondering why you don’t have some simple rules like double quote at start of line is left, at end is right; single quote between two letters of the alphabet is right. Must be a reason I’m sure! As for suggestions:
  1. My biggest suggestion for improvement would be to drop all the things the script does with dashes and stick to just quotes. I never find the dashes useful and they are constantly messing things up, like pagenames that contain hyphens, and the <!-- comment markup (though I guess you must be finding the dash alterations useful since you put them in!)
  1. Bug noted: paired apostrophes before s and d are not being correctly interpreted
  1. The French d’ and l’ are major list items not yet being recognized. Then there’s ’twould - ’twill - ’twere - ’tisn’t - ’twasn’t (etc.) - ’midst - ’neath - ’bout - ’fraid - ’nother - ’uns - People in the novel I’m reading keep saying ’Gad! and ’Pon my honour! Levana Taylor (talk) 06:48, 5 September 2019 (UTC)

Cloud Services and ToolforgeEdit

You wouldn't happen to be familiar with Cloud Services (WMCS, CVPS), Toolforge, and related infrastructure bits? I have some kinda hacky local tooling for working with DjVu files and Tesseract and was toying with the idea of trying to set up some related utilities of possibly general usefulness. But right now I'm banging my head against the wall of insufficient documentation for the stuff the WMF provides for hosting such things. In other words, I'm looking for someone familiar with the setup that's willing to answer dumb questions and provide some hand-holding. --Xover (talk) 15:28, 17 September 2019 (UTC)

@Xover: We do have the shared account user:Wikisource-bot for bits, and there is a range of documentation at wikitech: and places. Mailing list is pretty good for support, and IRC can be useful (though you have to be in a good time zone). I am totally useless as a coder, though have managed to find handholders to allow me to bumble through. — billinghurst sDrewth 23:33, 17 September 2019 (UTC)
@Xover: Yes sure, I'd be happy to help. I'm reasonably familiar with Toolforge. What issues are you having? Sam Wilson 03:37, 18 September 2019 (UTC)
Well, mostly stupidity and documentation that seems to be written for a different audience than myself.
My vague ideas to begin with are a replacement (possibly temporary) for Phe's OCR gadget, and possibly a tool that'll take images from some source (IA id, URL, zip file, etc.) and spit out a DjVu with a text layer. There's some related stuff that might be relevant, like an easy way to add and remove pages from a DjVu, or to redact pages from a DjVu (typically for copyright reasons). Not sure what all would make for tools that are 1) a reasonable amount of effort to get working and 2) of sufficiently general utility to be worth it. A short term alternative for the OCR gadget is the primary motivation as that seems to be pretty critical for several contributors.
Right now I'm trying to figure out where and how it'd make sense to host something like that—Cloud VPS, Toolforge, or a third party general web host somewhere—and I'm just not finding the documentation that'll tell me what CVPS and Toolforge actually look like in terms of facilities, hosting environment, and so forth. As I said, the docs seem to be addressing different questions and for a different audience than me. So my first set of dumb questions / need for hand-holding is to figure out that.
What I'd need is:
  • A sufficiently Unix-y hosting environment. Fedora would be perfect, and RHEL or CentOS good second choices. Any good modern and not too esoteric Linux distro would probably be good enough, but experience tells me there are crucial difference in relevant sysadmin tools and package distribution/management systems and availability of packaged third-party software. Depending on what comes ready out of the box that may or may not be an issue.
  • A non-ancient version of Perl 5, with a reasonable set of standard modules installed. Given the rate of change in perl-land, I don't imagine any OS the infrastructure team are willing to host would contain a version that's too old. I don't current need a lot of esoteric perl modules, but by experience I expect to need to be able to install at least some. For example, I believe HTML::Parser was dropped from the core modules so that's something that would need to be available through some method.
  • A not-too-esoteric CGI hosting environment. My experience is with Apache, possibly with mod_perl, but anything that supports Perl and can be tweaked for interactive performance would probably work.
  • Tesseract 4.1 installed and functioning.
  • GraphicsMagick in some recentish version. ImageMagick will probably do in a pinch, but my experience with GraphicsMagick is better.
  • The ability to have such software updated in some reasonable timeframe. Whether that's done by the sysadmins as part of the platform, or whether that's something I'd sysadmin myself using the package tools, isn't all that important.
  • I'm way past the age where I want to waste time compiling software from source, so I really really would prefer that can be handled through some kind of package system.
  • I'd need a moderately large amount of disk space to play with, and moderately performant too (for OCR, disk IO quickly becomes a bottleneck). Several tens of gigs at least for temporary stuff: purged, depending on the case, on the timeframe of hours up to weeks. A gig or so per DjVu file, and room for at least 10 jobs' worth of files sitting around, would be the minimum reasonable. More is better.
  • For the batch OCR stuff I'd eat all the CPU I could get too, but for the per-page stuff anything that isn't completely choked would probably work well enough.
  • It would be a bonus if I had easy read-only access to files from Commons and local files on enWS (and possibly the other WSes if anyone should want it). A virtual filesystem or something, preferably more performant than having to download a copy of the file over HTTP. If any writing to a wiki is eventually needed that'd go through the API, most likely using OAuth, so read-only would be fine. Not a requirement by any means, but it'd make the DjVu manipulation stuff much more elegant and efficient if I ever get around to it.
  • Network access to Commons and enWS for downloading files, and for talking to the API if it becomes relevant.
  • General internet access to download stuff from IA, Hathi, etc. if that becomes relevant.
  • No immediate need for access to database dumps or similar: everything I have in mind is just crunching File: files.
  • No immediate need for database facilities: I might eventually need something to track batch jobs or whatever, but I'd get by with file-based solutions for a good long while before that became an issue.
  • If there is a batch system with oodles of compute resources that responds to "run this code on that data over there and notify me with the results once you're done" that'd be neat, but I'm not entirely sure that'd be more performant than doing it on whatever is hosting me directly. Possibly it'd enable parallel execution of large batch OCR/DjVu jobs if the tool is used a lot, that would otherwise need to be serialised, but I'm not sure the volume would be there to make that worth the effort.
  • If anyone actually started using a utility here I would want to add extra maintainers with OS-level access.
  • For source control and such I'd probably use Github (Gerrit looks completely impenetrable to me), and issue tracking either there or Phabricator, so no special needs related to that.
That's a rough braindump of the requirements. Given your knowledge of the facilities, what should I be looking at for hosting? By my guesses here, Toolforge might be both too constricting for the needs, and at the same time its advantages aren't all that relevant (I think I've grasped that Toolforge has DB dumps and similar already available, but as I don't need those…). If my vague understanding is anywhere near right, Toolforge is essentially a shared web host with some Wikimedia-specific facilities while Cloud VPS is just server hosting (that happens to be para-virtualised rather than bare-metal); but from there to the details that'd let me assess them against the requirements is a bit steep a climb right now. Anything I haven't thought of? Am I completely off my rocker? Should I go away and stop bothering you? :) Any help and hand-holding would be very much appreciated! --Xover (talk) 08:28, 18 September 2019 (UTC)
@Xover: Yep, your understanding of the shared-hosting or VPS is right. You could certainly do all you need on a VPS, but I think it sounds like you'll be fine with a Toolforge tool (although I'm not 100% certain off the top of my head of version specifics etc.). I recommend creating a tool account and seeing how you go. Lots of people use Github, so no worries there. Probably the most confusing thing for new toolforge users is the cronjob setup: basically, your cronjobs don't actually run things themselves, they just add a job to the 'grid', where it runs. In practice this just means a command has to use jsub. Anyway, I recommend a) creating a tool, b) trying to run what you need, and c) ask me when you hit an issue. Sam Wilson 23:16, 25 September 2019 (UTC)
Well, apparently Toolforge is a no-go. Is there any point requesting a dedicated VPS so I could do that stuff myself? I have no idea what the criteria are for getting one or whether the stuff I have in mind is even remotely what the CVPS infrastructure is intended for. Any suggestions or pointers would be much appreciated! --Xover (talk) 18:32, 6 January 2020 (UTC)
@Xover: Yes, I think if you've demonstrated that your requirements aren't met by Toolforge, then you should be able to request a VPS. Then you'll be able to install whatever you need. (Not that I'm completely familiar with the whole process, but that's my understanding.) Sam Wilson 00:03, 7 January 2020 (UTC)

Little problem on nl-wsEdit

Hi Sam,

may I ask you to take a look at some strange problem, that happens on nl-ws?

It does happen only in one book, for instance at this page: s:nl:Pagina:Heemskerck op Nova Zembla.djvu/101. As you can see the header does not outline correctly. I have been trying all kinds of things. Nothing seems to help. It only happens in this book. In all other books on nl-wiki where we use the RH-template, it works fine, see e.g. s:nl:Pagina:De voeding der planten (1886).djvu/46. Can you explain why this happens to this book?

Many greetings, and looking forward to your answer, --Dick Bos (talk) 10:47, 2 October 2019 (UTC)

@Dick Bos: It looks to me like there are odd block-level elements being introduced where they shouldn't be. For example, the following part of s:nl:Sjabloon:RunningHeader (and the same applies to the right side):
Currently: Should be:
    -->|{{#if:{{{1|{{{left|}}}}}} |
<span style="float: left; display: block;">
{{{1|{{{left}}}}}}
</span>}}<!--
    -->|{{#if:{{{1|{{{left|}}}}}} |<!--
--><span style="float: left; display: block;"><!--
-->{{{1|{{{left}}}}}}<!--
--></span>}}<!--
And also that there isn't a default value for the centre component (in {{RunningHeader}} here, it's a &nbsp;).
Actually, that template could be rewritten with block-level components and using flexbox, but that's another story! :-)
Sam Wilson 09:48, 5 October 2019 (UTC)
You were right! I usually copy this kind of templates from en-ws (I really don't understand a word of the code, to be honest), and apparently there had been an update of the template. Now that I copied the newest version to nl-ws, it is running perfectly! Hurray.....
We need someone with some technical knowledge to do this kind of maintenance work on the Dutch Wikisource! But unfortunately, activity on nl-ws is very low. Thanks for helping us! --Dick Bos (talk) 16:29, 7 October 2019 (UTC)
@Dick Bos: Oh good, I'm glad it works. There isn't really a good system yet for keeping imported templates up to date. One possible way could be that we add some of this functionality to the new Wikisource extension, because then it'd be on all Wikisources. That's going to take some more work though. For now, it's export/import and keep an eye on things. Sam Wilson 03:55, 8 October 2019 (UTC)
@Dick Bos: Can I recommend that you special:import templates (select "en" from dropdown) rather than copy and paste. 1) it brings a history and can actually bring other required components; 2) it allows, in future times, others to track and find what you were doing and reproduce at nl:special:log/import. — billinghurst sDrewth 23:33, 6 January 2020 (UTC)

Plain sister updatesEdit

Hi SW. [Happy early cricket season, hope your weather is better than mine at the moment] I am wondering whether you had been able to look at my thoughts on template talk:plain sister for an update to Module:Plain sister to automatically link articles to enWP biographis. I know that I lack the skills to make those changes, and wondered whether you had the skills for such a change, or whether we are needing to go searching outside. — billinghurst sDrewth 10:09, 9 November 2019 (UTC)

@Billinghurst: I've replied over there. I've resurrected some work I did last year on that, and it's now functioning. See what you think. I'm happy to make the changes and monitor things closely. Sam Wilson 21:28, 10 November 2019 (UTC)

validated index count discrepancyEdit

Hi. Hope all is well out west.

  • Your tool says "This page presents the categorisation of the 3172 works on the EN Wikisource"
  • Category:Index Validated says "... pages are in this category, out of 3,442 total"

Which is correct? What sort of discrepancies would we need to identify to resolve the 270 gap? I cannot work out what to do with a json list (my uselessness) to make any comparisons within AWB or petscan:.

Noting that when I compare Category:Index Validated with Category:Indexes validated by date (3443) that there is some discrepancies to resolve between those two, so the numbers above will probably have bumped around a little by the time you see this. — billinghurst sDrewth 02:51, 6 January 2020 (UTC)

  • @Billinghurst: Hello! That's interesting. :( My first thought is that the missing ones are not categorized (i.e. their index pages are in Category:Index Validated but their mainspace pages are not categorized). It could also be that the tool can't figure out what their mainspace pages are (it looks for links to a top-level mainspace page from the Index page, with a query a bit like this one).

    Sam Wilson 03:10, 6 January 2020 (UTC)

    Thanks. For the incompetent, would you be able to generate a wikipage or a petscan query (preferred as regeneratable), and I will take it from there to determine the issue. Some of this list will then be works not transcluded, and I will explore the remainder. I know that there are plenty without title links, it is one of Esme's traits. We should document that we wish for titles to be linked, as I don't always do it for other's works. — billinghurst sDrewth 03:30, 6 January 2020 (UTC)
  • @Billinghurst: I've been trying, but have not yet figured out a simple way. Will keep looking at it! The ws-cat-browser is also due for a rejigging I think, because we can now determine validated mainspace pages via Category:Validated texts, so it no longer really needs to go via the index page at all. Although, maybe it's good to keep it as-is, for helping to find discrepancies like this. Sam Wilson 00:45, 7 January 2020 (UTC)
    A SPARQL query in Petscan doesn't function?

    I would disagree that "category:validated texts" is a reasonable match. That category does not have a one-to-one relationship with Index: ns—DNB to works, of volumes of DNB to works are one to many. Plus, it is grossly underpopulated. There is no easy means to populate from this side the work side or the index side; even then the root page and the index: page are usually not one-to-one either.

    Last time that I asked about flag addition I was told that there was no ready means to bot populate the flag via the available tools. blah blah blah blah... <sigh> — billinghurst sDrewth 04:30, 7 January 2020 (UTC)

Index:East Anglia in the twentieth century.djvu extra image page, now text off by oneEdit

Hi SW. IA-upload bot generated the above work, and it seems to have inserted that random image page as the lead, the work shows the image on page 2 at IA. Now the text and scans are out by one. What is the best way to address/resolve? — billinghurst sDrewth 01:42, 20 January 2020 (UTC)

@Billinghurst: This seems to be some discrepancy with the book viewer at IA, because the imagecount attribute in the work's metadata says 672, but the book viewer is only showing 670 (there are non-book scans at front and back). I can't find anything in the metadata that explains how book viewer is making this decision; I guess it's in there somewhere, and ia-upload could use the same logic to exclude these pages. I don't have time right now to dig into it though. :( It looks like there's only a few pages proofread so far, so I guess it's a matter of resolving it manually. Sam Wilson 03:12, 20 January 2020 (UTC)
okay. FWIW The display for IA-upload bot, just showed the second scan page as the first page as the page to exclude. — billinghurst sDrewth 03:33, 20 January 2020 (UTC)
@Billinghurst: oh, hmm yeah that's annoying. It's because it uses the bookreader thumbnails and numbering system to get that image. :( I'll open an issue. Sam Wilson 03:40, 20 January 2020 (UTC)
Okay, what is your trick to get around 100MB? The PDF -> DJVU conversion pushed it over the upper size, though you clearly have a sneak means through. If you can apply the corrected file it is at toollabs:wikisource-bot/East Anglia in the twentieth century.djvu. — billinghurst sDrewth 12:10, 22 January 2020 (UTC)
@Billinghurst: Done. It's chunked upload protocol—which is what the UplaodWizard uses behind the scenes—which allows up to 2GB per upload. Last safety valve is server-side upload which can be performed by some WMF staff (sysadmin, not dev, iirc) and bypasses all size restrictions, but that's most suitable for things like massive donations from some archive or library and not individual files. --Xover (talk) 12:29, 22 January 2020 (UTC)
Culled the file, two times misaligned by different processes. Will await other fixes, it isn't an urgent work. — billinghurst sDrewth 07:36, 23 January 2020 (UTC)
@Billinghurst: Want me to regenerate the file from the source scans? --Xover (talk) 08:13, 23 January 2020 (UTC)
I don't mind either way, it was poked up for Charles, and it is not urgent. It can wait until there is a fix, or the need for one to demonstrate a fix. — billinghurst sDrewth 10:43, 23 January 2020 (UTC)

Author:George SpearingEdit

work written 1803, born 1824. Needs a fix, or we have a miracle! — billinghurst sDrewth 03:50, 17 April 2020 (UTC)

  • @Billinghurst: Ha! Yes, oops. I've fixed it to be (as @Annalang13 correctly had it) his death date. Also added his birth date based on the fact that he turned 41 while down the hole. Sam Wilson 04:05, 17 April 2020 (UTC)

Using Google OCR for old English textEdit

Hi. I'm running a project to upload 3,000 chapbooks from the National Library of Scotland's digitised collections and we're interested in using the Google OCR function instead of Tesseract because it identifies the long s letter (ſ) really well. i noticed you've been quite heavily involved in the discussion around Google OCR - even though it's discouraged to use Google OCR with English, do you think this would be an acceptable use? https://en.wikisource.org/wiki/Wikisource:WikiProject_NLS Gweduni (talk) 14:51, 4 May 2020 (UTC)

@Gweduni: I think it definitely would be okay. The only reason it's at all discouraged (and it should be a less strong word, I think) is that we have a quota with Google. However, the quota is always renewed, and Google are (I think!) very happy for us to use their Cloud Vision API. We're also going to be doing some improvements soon (although I guess still a couple of months away) that will hopefully increase the quality of the text returned (Google gives us lots more structure about the OCR text than we're currently using, so we could do things with e.g. automatically improving punctuation, or even adding wiki templates where they're unambiguous). For updates, follow the phabricator:tag/wikisource_ocr project. Sam Wilson 22:32, 4 May 2020 (UTC)
@Samwilson: Great, that's really good news. We'll move over to using Google OCR on our project from now on, and I'll have a look at the wikisource ocr project you mentioned. Excited to hear new developments are on the way! Gweduni (talk) 12:12, 5 May 2020 (UTC)

Long SEdit

Hello,

I saw that you were involved in some of the "long s" discussion many years back. I've been trying to find good info here on how to approach long s in proofreading, but haven't found a clear guideline one way or the other. When I tried the template someone had created, it didn't seem to work, and also seems rather tedious unless I'm missing something (likely). Thanks for any clarification or advice you might have! Grillo7 (talk) 16:46, 3 June 2020 (UTC)

@Grillo7: I think the basic guidance is not to use them at all, but of course if you're consistent within a work then it's fine. I used the {{ls}} template, but I think that's now set to only display the long S in the Page namespace (and a normal S in the mainspace). If you definitely want a long S in every situation, then you can just use the ſ character (probably copying and pasting it is the easiest way, or remembering its key shortcut). —Sam Wilson 23:10, 3 June 2020 (UTC)

Template:PersonEdit

Hi,

Please could you bring {{Person}} more into line with {{Author}}, in particular with regard to pulling in data (and images) from Wikidata? No doubt they can use the same Lua module. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 10:01, 19 October 2020 (UTC)

ia-upload front page trim — file path only error?Edit

Is there an issue with getting the paths fixed up in ia-upload? I added a phab request a while ago as the trim first page functionality dies

An error occurred: Command not found: "djvm -d "/data/project/ia-upload/ia-upload/jobqueue/menkentandkenti00hutcgoog/menkentandkenti00hutcgoog.djvu" 1 2>&1"

and as it has worked previously, I am presuming that it is path issue that can be resolved by declaring a path or putting a symbolic redirect. It is only you and Tpt that have access still around. — billinghurst sDrewth 08:45, 4 November 2020 (UTC)

Oh, you are hardly around. :-( — billinghurst sDrewth 08:51, 4 November 2020 (UTC)
@Billinghurst: No no I'm here! :) Just not editing much at the moment. Getting stuck into wsexport stuff right now. I'll have a look at ia-upload and see if it's anything obvious. Sam Wilson 00:47, 5 November 2020 (UTC)

OCR imrovement project..Edit

Great to hear about this, and I had some suggestions for areas to look at.

Whilst OCRing some pre 19th century works I was encountering long s ( which was getting recognised as f or l ), and old styler ligatures like ct (recognised as d). Google's OCR seems to be better able to recognise this at present which is a shame as Wikimedia native tools should be better able to cope with older items given that the scanned works being transcribed might be as well.

The other suggestion was concerning 'multi column' based text, and sidetitles/margin notes. On something like The Statutes at Large example page: Page:Pickering - The Statutes at Large - Vol 40, Part 1 (1795, 35 George III).pdf/42. There are margin notes and sidetitles which are read as part of the main run of text.

Ideally they should be more effectively partitioned, so that a transcriber doesn't have to descramble them when setting up the appropriate formatting. (Sidenotes/Sidetitle support is still an unsettled issue on English Wikisource, however.) ShakespeareFan00 (talk) 18:45, 4 December 2020 (UTC)

Yes, I've been wondering about this sort of thing. Someone mentioned the possibility of having a sort of secondary system where you could select a rectangle in the image and hit an OCR button to get just that bit; that'd work as a last resort. I'm not sure what we can do for sidenotes, but it'll be investigated for sure. The ligatures and long s I guess we just have to live with, because this project (I think) isn't going to tackle OCR training data just yet. Still, easier access to different engines will be better than nothing, and the next step will be further tweaking of the in-house stuff I hope. :-) —Sam Wilson 11:05, 5 December 2020 (UTC)

Template:Person <= Module:AuthorEdit

Would you please consider updating Template:Person to leverage Module:Author for image and years of life from WD? Thanks. — billinghurst sDrewth 23:33, 5 December 2020 (UTC)

@Billinghurst: I'll see what I can do. :) —Sam Wilson 06:34, 7 December 2020 (UTC)
@Billinghurst: Okay, it's working now. Slightly hackish, but functional. See what you think. I've changed a few people portals as a test. —Sam Wilson 06:49, 7 December 2020 (UTC)
er... #Template:Person? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 20:01, 7 December 2020 (UTC)
@Pigsonthewing: Sorry Andy! I totally missed your message before. Also, I realise I've missed the image part of it; will sort that out. :) —Sam Wilson 22:37, 7 December 2020 (UTC)

  Comment I have inserted image logic from {{author}} (well most of it) with adaptation for categories. Need to think about whether that image pull may be better in parent {{portal header}}. There is resulting tidying and documentation to be done, though I want to do some wider investigation prior to calling it a success. Probably also want to consider whether the image recovery logic may be better off in its own module, and/or better pulled from Module:WikidataIB. — billinghurst sDrewth 22:50, 11 December 2020 (UTC)

@Billinghurst: Looks good! —Sam Wilson 09:15, 15 December 2020 (UTC)

indents? is there way?Edit

https://en.wikisource.org/w/index.php?title=Page:Wongan_Way_by_Lilian_Wooster_Greaves,_1927.pdf/12&action=edit Jarrah Tree (talk) 04:21, 15 January 2021 (UTC)

@JarrahTree: Do you mean the paragraph indents? There's no need; we leave them flush left. See Wikisource:Style guide#Formatting. Indents in the poetry are another matter, and should be faithfully reproduced with a leading colon :. —Sam Wilson 04:25, 15 January 2021 (UTC)
thanks for that... Jarrah Tree (talk) 04:40, 15 January 2021 (UTC)

ThanksEdit

Thank you very much for your work on enhancing the WS-Export facilities in the Wikisource-software. I don't understand a word of all the technical things involved in it, but I found out that it's readily working on Dutch wikisource. I made a small announcement in the local "pub". I mentioned your name (to thank you). Greetings, --Dick Bos (talk) 16:47, 15 January 2021 (UTC)

@Dick Bos: Oh thank you! That's very kind of you. :-) I hope we can carry on improving WS Export in the coming weeks! Do let us know of anythings that could be improved. —Sam Wilson 10:38, 16 January 2021 (UTC)

phab:T230415 (OCR text layer paragraphs)Edit

Hi! Sorry to poke, but could we get the revised patch for phab:T230415 reviewed at some point? I managed to actually set up an MW+PP environment myself to check and it appears to work. (Now if only PDF could do that in light of the million+ files from the IA, but I don't think the data is there). I don't really know who else to ping, but this is a daily pain point for WS and has been for a decade, so it'd be nice to get it fixed. Inductiveloadtalk/contribs 18:53, 25 January 2021 (UTC)

@Inductiveload: Sorry, I had a quick glance at this but haven't had time to delve any deeper. First thing I wondered is that it looks like it's the sort of thing that should have some tests to go with it. Not sure that's a blocker though. — Sam Wilson 23:25, 28 January 2021 (UTC)
Good point: I have figured out the tests and updated the tested DjVu's text-layer with a paragraph and a column. Inductiveloadtalk/contribs 10:55, 16 February 2021 (UTC)

translatewikiEdit

I don't any other way to contact you and you have stated to be more active here so I'd just like to alert you that Phabricator is misspelled on https://translatewiki.net/wiki/Wikimedia:Wsexport-issues/en --Sabelöga (talk) 16:28, 11 February 2021 (UTC)

@Sabelöga: Oh, thanks! I've made a fix: https://github.com/wikimedia/ws-export/pull/329Sam Wilson 03:09, 12 February 2021 (UTC)
Excellent! :) --Sabelöga (talk) 14:45, 12 February 2021 (UTC)

hmmEdit

gap/brk seems an inneresting development... Jarrah Tree (talk) 11:57, 13 February 2021 (UTC)

have found items in newspapers - adding links to wp en article Jarrah Tree (talk) 12:28, 13 February 2021 (UTC)

{{engine}}Edit

In our print copies, I would think that we would not be wanting to export this template; it work in offline mode, so do you want to add the #noprint stuff. I am also wondering about how we handle it in mobile productions, as it is going to engulf the top of the work, and wondering we exclude it or should do something else with it in the mode. — billinghurst sDrewth 21:11, 23 February 2021 (UTC)

@Billinghurst: I've added .ws-noexport to it, to avoid exporting it. I'm not sure about mobile; it needs some more thought I think. It seems there are issues with lots of inputboxes on mobile, that could perhaps be dealt with together. —Sam Wilson 10:47, 24 February 2021 (UTC)

IA-upload kills me with its additional pages from the jp2 zip/tarEdit

Internet Archive identifier : b29008700 and these 00 pages.

Are we able to do anything to get the 0000 pages ignored so that we are not getting these pages included and then having the text offset? Working with PDFs is f'd as the text for me is always needing manipulating for whatever weird artefacts the text layers have. — billinghurst sDrewth 05:25, 6 April 2021 (UTC)

FYI, this is (probably) the same as phab:T268246. Inductiveloadtalk/contribs 06:33, 6 April 2021 (UTC)
Yep, and #Index:East Anglia in the twentieth century.djvu extra image page, now text off by one. Not new, just killing me. — billinghurst sDrewth 07:04, 6 April 2021 (UTC)

WSexportEdit

Hi (and @Inductiveload:. Dictionary of Indian Biography has been set up with the lists of names on the subpages, and typically these subpages don't print out. Wrapping a <div class="ws-summary">...</div> around the list (Dictionary of Indian Biography/A) manages to get an export, compared to (Dictionary of Indian Biography/B). So looking at what may be the more reliable solution.

Putting all the names on the front page will be ugly. The base user isn't going to know to add the wrapper. We could build a faux list template for lists of names to wrap around (and it has to wrap), or we can just enforce the use of auxToC quick special:diff/11237197), and I am not certain about a big green monster, though can cope.

Any other ideas? — billinghurst sDrewth 10:39, 28 April 2021 (UTC)

Don't fuss it, I have just done it to the pages. — billinghurst sDrewth 13:39, 3 May 2021 (UTC)
@Billinghurst: Yeah, sorry for not replying! I have actually been thinking about it. I do still want to look into adding the option to traverse all linked subpages, at whatever level, but for now it does seem the best solution is ws-summary everywhere (T253282). —Sam Wilson 22:55, 3 May 2021 (UTC)
Not an issue, I just had plentiful of browser tabs still open as my memo for action. I was pushing sanity levels so moved to action as there was obviously no overt easy solution. — billinghurst sDrewth 03:18, 4 May 2021 (UTC)