User:Alex brolloBot/WYSIWYG djvu project

Here are listed ideas, tests and python scripts covering the following related topics:

  1. how extract most data (text and formatting) from mapped djvu text layer, as shared by Internet Archive;
  2. how to build a WYSIWYG editor for djvu text layer.

Used software

  1. python;
  2. DjvuLibre djvused.exe file;
  3. Seamonkey Composer.


Summary edit

Here the ideas to develop two branches of the project (feel free to use them as you like, obviously!)

Subproject 1 edit

There are lots od data into coordinates of various levels of text segments, both in their x values (horizontal: spacing between words; position of the segment into the page; pattern of position of consecutive segments, as paragraphs and lines), with good perspectives for automation of center, block right, poem, page noinclude header and image position, if "empty areas" are considered too. This subproject is presently sleeping.

24.01.11 edit

First tests of use coordinates as traces of formatting run. So far, a script extracts coordinates, and derived data, into a "list of dictionaries", one dictionary for any dsed code row, so that x1,x2,y1,y2 (ccordinates of the rectangle), length, height, left margin, right margin of the element are available. Now I'm thinking about use of thiese data into the "line" subset; the script will classify lines as "normal", "indented", "short", "centered", "large font", "special". I.e.: short line followed by an indented line=new paragraph.

Consider that IA djvu text layer has the "para" layer, but Any2djvu text layer haven't, so that automatic suggestion about "new paragraph" is meaningful.

Subproject 2 edit

When single words are extracted with their code row, any of them is matched with a "fingerprint", its coordinates. There are tricks to build html code where such a fingerprint is saved but hidden inside a html tag, i.e.:

    1. (word 689 4215 827 4287 "any")
      1. word: any, fingerprint: 689 4215 827 4287
      2. html code: <span title="word 689 4215 827 4287">any</span>

The python script, while building the html code, should save somehow (a dictionary, a file...) fingerprint-old word pairs. A html page can be built as a sequence of these "fingerprinted word code". Using normal mode of Composer editing, an editor will only edit words, leaving their fingerprint unmodified; when html code is saved, it's very simple to match word by word actual, and old words, to select cases which don't match, and to edit the source dsed file using the fingerprint to find words to replace.

So steps of algorithm are:

  1. to extract a dsed file from djvu file, using djvused;
  2. to parse it obtaining the list of fingerprints-words and to save it;
  3. to build the "fingerprinted html code";
  4. to edit it manually with Composer, Normal edit mode; to save the html file;
  5. to extract from html code a new list of fingerprints-words;
  6. to compare them with previous list, and to get a list of modified words;
  7. using such list of modified words, to edit dsed file;
  8. to update djvu text layer.

All steps, but human editing the html page, should be automatized.

running algorithm edit

There's now a file djvu.py with some running functions. It's too rought to be published but:

  1. it converts a bundled djvu into an indirect one, into a subfolder pag;
  2. create a list of secondary djvu files into pag;
  3. asks for the number of page to edit;
    1. user digits the number of the page (or f to finish, or s to next page)
  4. estracts text with option-text parameter into a testo.dsed file;
  5. converts dsed into pagina.html as told above;
  6. waits for input.......
    1. user edits pagina.hmtl
    2. user strokes an enter into python raw_input
  7. words are parsed from html
  8. they are used to edit testo.dsed
  9. testo.dsed is embedded into the current djvu page
  10. back to 3.

It would be great if this could be embedded into a GUI. But... I can't. I simply have no idea, about how to begin. :-(

21.01.11 The routines run. The development has been stopped to pay attention to subproject 1. 21.01.11 To do: apply the same procedure parsing djvu.xml data (perhaps simpler than parsing of dsed files), and to test djvuparserxml.

This is an interesting picture screenshot from a djvu file, self-produced from a pdf file, as seen with djView.

 
Gray: image layer; black: text layer

The concept of "layers" is clearly seen; but, more interesting, note differencies between image layer (gray) and text layer (black); text layer has been edited ans wiki code (for bold characters, for a template) has been added. Now, this is text obtained with djView somply selecting the text layer anc copying it here:

'''running algorithm'''
{{larger|There's}} now a file djvu.py

Outside pre tags, the same text is this: running algorithm There's now a file djvu.py

So, this is a proof that not only reviewed text, but wiki markup too could be loaded into djvu text layer.

Unexpected result edit

Simply adding some rows of code, and asking to ddjvu.exe, I found that it's really simple to extract the segment of tiff image containing exactly any word of the page. So, now I understand fully (and potentially, I could build from scratch) reCAPTCHA.