Reading a scanned book

I have recently been working on extracting data on women composers from the various sources listed in this previous article. The first source on that list is a scanned copy of a French translation of a book – Les femmes compositeurs de musique – compiled in 1910 by Otto Ebel. It is available at here. Although I’ve not had great success in the past in extracting usable data from scanned books, this appears to be a reasonably tidy scan of Ebel, which looks like a useful source on women composers, so I thought I would give it a go.

This has ended up on by an actual book in Wellesley College Library being scanned page by page. The image of each page was passed through Optical Character Recognition (OCR) software, which tries to work out where the text is, how it flows across the page (e.g. in columns), and then – letter by letter – what it says. The output is stored as a plain text file, which can then be searched by a computer, or used in conjunction with the images of the pages to produce the virtual book that appears on the page. OCR software is pretty good these days (although some older scans on are less successful) and, as long as the typeface is big enough and clear enough, and there are not too many dirty marks or odd characters (accents, changes of font, etc), and as long as the software correctly identifies the column structure of each page, the resulting text file can be quite accurate.

Here is an image of part of page 2, and the text that the OCR software produced from it:

"d'un grand nombre de cantates et de chœurs, de plu-"
"sieurs concertos et sonates pour piano."
"Ahlefeldt (Comtesse de). — Auteur allemand. Née au"
"milieu du xvm e siècle. Excellente pianiste. Parmi"
"ses compositions variées, un grand ballet « Télémaque"
"et Calypso » fut édité en 1794, à Leipzig."
"Aleotti (Victoria). — Née à Argenta, en Italie, vers 1560."
"Eut pour père Giovanni Batist Aleotti, célèbre archi-"


I wrote a computer program in R to extract the data from Ebel. I won’t go through the program itself, but here is the procedure…

Stage 1 was to download the text file. On the page there is a list of different formats available, including one entitled ‘Full text’ – this is the plain text as recognised by the OCR software, located at this URL:

If you download this file and look at it, there is some formatting at the start and end, but the text itself is readable and makes sense – it is a reasonably good scan.1

The next stage is to trim any blank spaces from the ends of each line, and find the start of the entry for the first composer by searching for the string "Aarup". I deleted everything before this, and everything after the last word of the final entry. I also removed all of the lines containing the page headings, by searching for the string "COMPOSIT", and any lines containing only one or two characters – these were mainly the headings of new letter sections (A, B, etc).

At this stage the file consists of the material we want, split into lines, with a blank line between each paragraph. The next stage is to combine these blocks of lines to get one line per paragraph. To do this, set a paragraph number for each line by counting the number of blank lines. Then delete the blank lines, and join together all of the lines sharing a paragraph number (with a space between them). To catch hyphenated words that are split between lines, delete all occurrences of the string "- ".

Now it gets a bit messy. Although we have paragraphs together, we don’t yet have each composer together, since a composer entry might consist of several paragraphs. The approach I took to this was to make a list of the paragraph numbers starting with each letter of the alphabet – so a set of paragraphs starting with A, a set starting with B, etc. By looking at this list, it was possible to identify where the alphabetical sections began, as each set of numbers consisted of a string of roughly consecutive paragraph numbers (the composers starting with that letter), with a few random other paragraph numbers before and after. Thus A started on line 1, B appeared roughly consecutively from line 44, C on line 227, etc. It was then straightforward to mark each block of paragraphs with the appropriate section letter, and the composer entries could then be identified as those starting with the same letter as the section in which they fall AND that contain a (, ) or the long dash - used for delimiting the composer’s forenames.2 These composer entries could then be joined together in the same way as the paragraphs were earlier (joining them with " #" to indicate the original paragraph breaks).3

After this stage, the file consists of one line per composer, like this…

"Aarup (Caïa). — Auteur suédois contemporain; habite l'Amérique. A composé : Life, In explanation,
     To be alone, At dawn, The summer wind, etc., et divers morceaux pour piano." 
"Abrams Ilarriet). — Auteur anglais et cantatrice. Née en 1760; morte vers 1825. Elève du célèbre 
     professeur Arne. Son œuvre consiste en une série d' « Airs Ecossais, » écrits pour trois voix,
     publiés en 1790, et un « Recueil de chants, » édité en 1787. Des chansons et romances ont été
     aussi publiées à la même époque." 
"Abbott (Jane Bingham). — Auteur des romances : Just for to day, My soûl, what hast thou doue, etc." 
"Adams (Mrs Crosby). — Auteur américain contemporain. Musique de piano. Five tone skelches, 
     Barcarolle, Tone picture, etc." 
"Adelung (Olga). — Auteur allemand et zithérisle. A publié des compositions pour son instrument 1 .
     #1. Zither. Harpe horizontale en usage dans l'Allemagne du Sud et l'Autriche." 
"Agnesi (Mario-Thérèse). — Compositeur italien. Née à Milan en 1724, où elle mourut vers 1780. 
     Pianiste de premier ordre et compositeur dramatique. Ses opéras « Sophonishe, » « Insubria 
     consolala, » « Cyrus en Arménie, » et « Nitocris » ont obtenu un grand succès dans plusieurs
     villes d'Italie. Agnesi est aussi Fauteur d'un grand nombre de cantates et de chœurs, de 
     plusieurs concertos et sonates pour piano."
"Ahlefeldt (Comtesse de). — Auteur allemand. Née au milieu du xvm e siècle. Excellente pianiste. 
     Parmi ses compositions variées, un grand ballet « Télémaque et Calypso » fut édité en 1794, 
     à Leipzig." 
"Aleotti (Victoria). — Née à Argenta, en Italie, vers 1560. Eut pour père Giovanni Batist Aleotti, 
     célèbre architecte. Ses compositions consistent en un grand nombre de madrigaux et chants 
     sacrés, dont un recueil comprenant 21 de ses meilleures compositions, fut publié par son père,
     à Venise, 1593. Le titre de cette collection est « Ghirlanda dei madrigali a 4 voci. »" 
"Alexandra-Josephowna (Grande Duchesse de Russie). — Auteur de « Psaumes pour soli avec chœurs et 
     orchestre » (exécutés à Saint-Pétersbourg, 1886) et autre musique d'église. A composé un grand
     nombre de morceaux pour piano à quatre mains (Boléro, Défilé-Marche, etc.)."


Each entry starts with the composer’s surname, followed by forenames in parentheses and then a long dash. The full name could thus be easily extracted by looking for the text between the start of each line and the first ) or long dash - (to allow for a few entries where the OCR had not correctly identified the )). Also calculated at this stage was a column indicating the length of each line beyond the name and dates, which is a measure of how much we know about each composer.

After that it was straightforward to give each entry a unique reference number, and to extract the birth year (likely to be the first date mentioned, i.e. the first occurrence in the text of 1 followed by three digits). The surname was isolated by looking for the string between the start of the line and the first character that is not a letter, a space, a hyphen or an apostrophe (to catch the various d' names that appear in a French biographical dictionary), and the forename was then the part of the name after the surname, stripping out any remaining parentheses or long dashes.

The resulting file contained 773 composers, each in the following format…

Ebel = "Ahlefeldt (Comtesse de). — Auteur allemand. Née au milieu du xvm e siècle. Excellente
        pianiste. Parmi ses compositions variées, un grand ballet « Télémaque et Calypso » fut
        édité en 1794, à Leipzig." 
Name      = "Ahlefeldt (Comtesse de)"
Length    = 175
Ref       = "EB00007"
BirthYear = 1794
Surname   = "Ahlefeldt"
Forename  = "Comtesse de"


So we have retained the full text of each entry and extracted some key information which can be used for further analysis. However, examination of this example (and others) reveals a few potential difficulties. Whilst the OCR has identified most of the text pretty accurately, there are a few errors. Harriet Abrams, for example, appears as Abrams Ilarriett. Also, the way of finding birth dates is rather hit and miss. If a specific birth year is mentioned, then it will tend to be the first string of four digits, but if it is not mentioned in that format, then the date, if one is found at all, will be something else. In this case, Ahlefeldt is described as ‘born in the middle of the 15th century’ – another OCR error, as she was actually born in 1755, so "xvm" should have been "xviii" – and the first date mentioned, 1794, refers instead to an edition of her ballet.

Of the 773 names, just under half have a date, many of which will be birth years. These range from 1540 to 1909 – the latter is unlikely to be a birth year as Ebel was only published in 1910! Most entries are between 100 and 400 characters long (the mean is 318), but Clara Schumann has the longest entry with an impressive 6,100 characters.

So, not perfect, but certainly usable, and much easier than typing in data from a paper copy of the book! The next stage involves comparing Ebel with a number of other sources, which is an opportunity to identify and correct some of the errors. That will be the subject of another article…

Cite this article as: Gustar, A.J. 'Reading a scanned book' in Statistics in Historical Musicology, 26th February 2018,
  1. Fortunately, Ebel has only one column of text per page. Some two-column sources I have seen have some pages correct and others where the columns have not been detected, so each line reads straight across the page, ignoring the fact that it is supposed to be two columns. This is almost impossible to work with whilst remaining sane.
  2. There was one exception that had to be recoded manually – the section for L begins with an entry "A.L" – initials used by Amelia Lehmann.
  3. There were, surprisingly, only three rogue entries where this procedure did not work. Lines 233, 446 and 688 were each joined with the line before.

Leave a Reply

Your email address will not be published. Required fields are marked *