Improving the proofreading rate edit

There are an increasing number of indexed works that are not proofread and are effectively "dead". There were 2,200 last year and now are 3,300. In the last year, nearly 1,400 new index files were added but only 106 were completely checked.

Using different techniques including match and split, the basic time for preparing a work can come down to a few tens of seconds per page, but proofreading a page still needs 2-5 minutes or more, depending on the size and difficulty of the page.

Changing this situation needs a lot more proofreaders. At the moment the goal is not much more than one book a month. That needs to become hundreds of books to avoid the creation of the index file system becoming a disappointment.

Yet these index pages are effectively invisible to the newcomer. The only list that is presented is those books that are "done". The category list is worse than useless - it's a turnoff.

So we need to set up a proofreading project.

A look at the current state of play shows over 1,700 single volume index files not yet proofread:

State Single volume Multi-volume
Total/New Done Total/New Done
Up to 2010[1] 898 329 707 76
2010-11[2] 512 42 519 11
2011-12[3] 357 95 939 7
Total left to do 1,767 2,165
Need proofreading 1,318 2,003
Need validation 363 48
Have other problems 86 113


These stats are broad-brush - mostly derived from Hesperion's lists. Multi-volume works have been separated—those which have 3 or more similar titles with mainly numbering differences - e.g. 700 copies of the Scientific American. Since they now dominate in numbers, we need to treat them separately to avoid them overwhelming the other categories as there are only around 100 different titles for 2,000 index files. The proofreading record is also noticeably worse than for single volume works. I've also excluded around 300 non-DJVU/PDF files from this analysis.

To attract new proofreaders clear lists of works that need proofreading and validation are needed, kept up-to-date by at least semi-automatic methods. We need to cater for a wide variety of tastes: different people will be attracted to poetry, history, novels, science, law reports, government documents etc. Categorising the index files would help this.

The main method of rewarding effort has been through marks placed on users' pages concerning "proofread of the month". Would that scale up by a factor of 10 or 100? Doubtful. Another approach would be to have a "progress page" where recent advances are recorded. This has to be able to deal with short and long works- 5 or 500 pages. Milestones are important: confirmed transitions from "To be proofread" to "To be validated" to "Done".

The top 10 in the multiple category, with the number of index files and total pages to be proofed for each, are:

Index file name Files Av pages Total pages
Index:United States Statutes at Large Volume 1.djvu 199 1296 257,869
Index:Notes and Queries - Series 1 - General Index.djvu 152 570 86,625
Index:Popular Science Monthly Volume 1.djvu 92 806 74,159
Index:All the Year Round - Series 1 - Volume 1.djvu 59 644 37,969
Index:Dictionary of National Biography volume 01.djvu 63 464 29,228
Index:Sacred Books of the East - Volume 1.djvu 51 495 25,233
Index:Title 3 CFR 1936-1938 Compilation.djvu 38 581 22,065
Index:Confederate Veteran volume 01.djvu 31 597 18,506
Index:Federal Cases, Volume 15.djvu 14 1286 18,008
Index:Philosophical Transactions - Volume 001.djvu 39 4334 16,870

If these are ever to be completed, then it needs a very different strategy. Here is a summary of the size of the task for multi-volume files.

  1. To 23 December
  2. 23 December to 25 August
  3. 25 August to 31 July

Proofreading single volume works edit

Pages No. files Total pages notproofed
50 197 4,087 3,177
100 94 6,992 4,918
150 67 8,370 6,301
200 83 14,498 11,266
250 85 19,123 14,726
300 108 29,911 22,325
350 110 35,742 27,007
400 102 38,331 32,174
450 107 45,304 38,420
500 84 39,795 34,357
550 78 40,928 35,736
600 53 30,439 25,614
650 44 27,589 23,357
700 29 19,725 17,656
750 20 14,526 12,472
800 16 12,448 11,563
850 8 6,564 6,022
900 12 10,388 8,891
950 3 2,784 2,753
1000 5 4,893 4,760
1050 6 6,139 5,820
1100 5 5,350 5,272
1150 6 6,703 5,699
1200 3 3,580 3,299
1250 3 3,668 3,475
1300 1 1,267 1,197
1350 1 1,337 1,337
1400 1 1,388 1,353
1600 1 1,572 1,154
2050 2 4,056 3,734

There is a huge range of document sizes as shown by the table on the right:

Remembering that the amount of work is basically the number of pages, one can see that much of the work is in documents between 250 and 700 pages and these are very challenging tasks. We propose three sizes of task based on the number of pages to be proofed.

Range Files Total Pages notproofed Percent done
Up to 50 197 4,087 3,177 22.3
51-500 840 238,066 191,494 19.6
Over 500 297 205,344 181,164 11.8

The first category currently includes quite a few works which need just a few pages tidied up to move on into the validation class. Here is an (almost) current listing. One thing that is obvious from these is that many are simply requiring pictures to be extracted from the djvu files. But these shorter works may also appeal to people who like to see completion of a task.

The intermediate length files are much more of a challenge. It's a much longer list and needs breaking up to smaller groups to be presentable. It's here where categorising would be of most use.

The longer works need a much more sustained approach, or else a cooperative one. They are more similar to the multi-volume works.

To make this more attractive it would help to be able to break down these long lists by categories: novels, histories, legislation, science etc. This would require the index files to be categorised, which is not present policy but would have significant benefits in the long run. Not only would these be categorised much earlier, but transclusion could be used to copy the categories into the mainspace, including the individual chapters if this is thought suitable.

Validation edit

The list of works to be validated is currently much shorter than the list needing proofreading for the not very good reason that not that many books have yet reached that stage. There aren't yet clear guidelines for validation: for example, if someone finds a problem which they can't fix and mark a page as problematic, many editors assume that the work needs to be moved back a stage to the "To be proofread" stage. But is this necessarily sensible? "To be validated" does not imply the proofreading removed all problems, or there wouldn't be need for a second stage. Maybe we need a second "blue" stage.

It is tempting to assume that validation can be given to newcomers as "there won't be so many problems". But in some ways it needs more experience to ensure that it keeps to desirable Wikisource guidelines. Much clearer guidelines are needed.

On the other hand, validation should be a more pleasant reading experience and we need to consider how we can encourage people who come to Wikisource purely to read a work to stay and contribute to the validation process. It could be an excellent way of recruiting proofreaders.

Reporting progress edit

The toolserver currently produces many useful statistics that not many people look at. But to encourage proofreading we do need some more specialised reports, such as:

  1. The number of pages advanced a stage (to proofed, validated)
  2. The number of index files advanced from "To be proofread" to "To be validated" and "Done".
  3. More focused reports on user activity.

The reports linked earlier have been produced by a mix of toolserver tools and manual collation but there's no reason why they can't all be produced on a regular basis automatically. We may need others.

Multi-volume Works edit

There is already an established mechanism in WikiProject for handling big projects and one can identify projects for more than half a dozen of the bigger works in the list. It's noticeable that many of the other bigger works have had little done on them.

It would seem sensible to encourage the formation of new wikiprojects to coordinate the proofreading of these other big projects and to send interested people to the project. In reporting progress it would help to concentrate on the current most active volume(s).