A few weeks ago I noticed that those nice people at JSTOR have a scheme whereby researchers can apply to access large chunks of their data in order to carry out quantitative research projects.1 I sent off an application to see if they would let me have all copies of The Musical Times and Singing Class Circular in order to carry out a statistical analysis of the text (a technique which I will cover at some point in a future article). Lo and behold, after a couple of emails and a few days, I received a link from JSTOR to download the data I had asked for.
The data consisted of an index file and about 20,000 plain text files – one for each item in the 728 monthly editions of the Musical Times from June 1844 to December 1903. The text files had been produced by scanning the magazine to create images of each page, which were passed through a computer program to ‘read’ the text using ‘Optical Character Recognition’ (OCR).
My aim was to analyse these text files to see if there was anything interesting about the language or the topics discussed – which things were most covered? how did they change over time? – that sort of thing. However, it was not quite that straightforward, for three main reasons.
The first problem was that the articles (reviews, news items, correspondence, pieces of sheet music, advertisements, etc) were identified individually, but the text files consisted of the whole pages containing the article. So a page containing several articles would be repeated in several text files, and within each file, it was not always clear where the individual articles began and ended – the only way of determining this was to search for the text of the article titles. As will become clear below, this was not a foolproof method.
Secondly, the text files did not always follow the order of text on the page. Most pages consisted of two columns of text, and sometimes the associated text file would have the text of the right hand column followed by that of the left hand column. This made it impossible to confidently join together the text of articles that were spread over several pages.
The third and most serious problem was that the quality of the OCR was poor. Very poor. The text files simply did not accurately reflect the text.
Here is a typical example, using page 225 from the Musical Times of 1 February 1864. The actual page can be found on JSTOR and consists of two columns of text.2 Here is the beginning of the article ‘Queens of Song’ starting at the top of the right column…
I can read this quite clearly. JSTOR’s OCR computer, however, rather struggled with it. This is what appears in the relevant text file…
QUeenS OS SOn9. BY ELLEN GREATHORNE CLAYTON. In Two Yols. London: Srnith, Elder, and Co. IN these twto attractive solumes, Miss Clayton has con- tl ived to interest all who love to traee the origin and various foltunes of tlle many p ime donne who have held smay- over the public mind, flom the time when Katherille Tofts and hIargalita De L’Epine reigned in r;valrv at the com- mencement of the eighteenth century, to the present dar. The introduction of the Italian Opera into EIlgland max he said to hasse founded the d) na6ty of the *’ Queens of Song; ” for before that time thc seattered songs and choluses in the dlamas that wele acted did not afild suicient display to any one singer to create a nwarked impression on the public, and we know little even of those 57ocalists who sang in the operas of Purcell. The weak imitations of Italian opera-the earliestofwhich was Arsinoe Queen of cyprus, ” set,t as it was then called, by Master Claytc)n-paved the way in England
Trying to apply any sort of textual analysis to this is clearly going to be difficult, and probably pointless. There is a certain amount of cleaning that can be done – removing the end-of-line hyphens, replacing some of the more obvious errors (such as 0/O, 1/i, 5/S, etc), even passing it through a spell-checker – but it does not really make much of an improvement where the quality of the scan is this poor.
Admittedly, not all pages were quite as bad as this, but there does not appear to be much of a pattern to them, so it is hard to just avoid the really bad ones. To indicate the scale of the problem, about 25% of all ‘words’ of six or more characters contained in the text files were not recognised by the English spellchecker program
I don’t know when JSTOR digitised The Musical Times but perhaps it should have another go. Today’s state-of-the-art in OCR, as exemplified by Google Docs, makes a much better job of the same scanned image…3
Queens of Song. By ELLEN CREATHORNE CLAYTOX. In Two Vols. London: Smith, Elder, and Co. In these two attractive volumes, Miss Clayton has contrived to interest all who love to trace the origin and various fortunes of the many prime donne who have held sway over the public mind, from the time when Katherine Tofts and Margarita De L’Epine reigned in rivalry at the commencement of the eighteenth century, to the present day. The introduction of the Italian Opera into England may be said to have founded the dynasty of the “Queens of Song;” for before that time the scattered songs and choruses in the dramas that were acted did not afford sufficient display to any one singer to create a marked impression on the public, and we know little even of those vocalists who sang in the operas of Purcell. The weak imitations of Italian opera—the earliest of which was Arsinoe, Queen of Cyprus, “set,” as it was then called, by Master Clayton-paved the way in England
There are still one or two small errors here, but this version of the text is pretty accurate and would certainly be usable for quantitative analysis.
JSTOR is by no means alone in this – I have come across many examples of digitised historical documents with the same problem, sometimes much worse. It would be great if these documents could be reprocessed now that OCR software has improved so much. It would make them so much more useful, not just for quantitative analysis, but also for searching and indexing, so that the information in them is more readily accessible. In most cases the scans themselves are fine (although there are a few exceptions) – it is just the OCR that needs redoing from the existing images.
So I have put the Musical Times data to one side for now. It is rare that I am defeated by messy data, but in this case it is just too hard to make sense of.
- JSTOR’s ‘data for research’ (DfR) page is at https://www.jstor.org/dfr/
- This particular page is also freely available on archive.org here.
- Simply take the image above as a jpg or similar file, upload it to Google Drive, and then open it for editing in Google Docs. It will be automatically passed through Google’s OCR engine.