Collecting Data

Radial Bookshelves 2Finding a great dataset is all very well, but the next step is working out how to get the data onto your computer so that you can start playing with it. Datasets come in many forms, and there are different ways of collecting the data. In this article I will use some examples from the list of datasets in this previous article on women composers.

There are three main approaches to collecting data: read it and type it in, download it, or ‘scrape’ it. Here are some examples of each.

Read and Type In

LibrarianMany datasets only exist as books, or perhaps online as scanned images of books. Or maybe they are websites (such as discussion forums or online encyclopedias) in which the information you want is contained in free-format prose that cannot be processed automatically. In such cases, the only practical option may be simply to read the data yourself and type the relevant details into a suitable document or spreadsheet.

For example, Otto Ebel’s 1910 book Les Femmes Compositeurs de Musique is available as a scanned book at archive.org. The several hundred biographical entries are written in free-format prose. To collect, for example, composers’ dates, places of birth, or mentions of works, there is little alternative but to actually read the text and type the details into a spreadsheet.

For many books on archive.org and other similar websites, a text file is also available.1 This is created by a computer attempting to read, by a process known as ‘Optical Character Recognition’ (OCR), the scanned image of the book. The text file for Les Femmes Compositeurs is of unusually good quality, so in this case it might be possible (depending on your programming skills) to extract, for example, names and dates semi-automatically.2 In many cases, however, OCR does not do such a good job, and the text file is of limited use.3 Manual checking is always sensible in such cases.

The main problem with reading and typing in data is the amount of time and effort required. Even for a small book, it can be impractical to cover every entry. So it is usually the case that only a sample of data can be analysed, rather than the entire dataset. Working with samples is a perfectly good way of analysing data, but it does require an additional step – choosing which entries to include in the sample so that it is representative of the dataset as a whole. This will be covered in a future article.


Download

Download ButtonThere are a handful of datasets that are already in a usable form and can simply be downloaded as a ready-to-use spreadsheet or CSV (comma separated values) file. A good example is the Women’s Song Database which can be downloaded as an Excel spreadsheet containing neatly formatted information on over 19,000 songs. Not only is this data easily acquired, but it requires very little ‘wrangling’ (the process of getting data into a form suitable for further analysis).4

Other datasets are a little messier. The downloaded list of Women’s Orchestral Works, for example, arrives as a PDF file. All of the data is there and in a reasonably standard and consistent format, but a bit of extra wrangling is required in order to get a usable dataset – extracting the text from the PDF file, and then parsing each entry into its component parts (composer name, dates, work name, orchestral forces, etc).

Also worth mentioning are the increasing number of online databases which have an ‘Application Programming Interface’ (API) – essentially a dedicated method for other computers (rather than humans) to access them directly.5 Although using an API often requires some technical knowledge, this can be a quick and versatile way of downloading data in a directly usable form.


Scrape

Computer MonitorFor datasets that don’t have to be read and typed in, and that can’t be accessed by a simple download or API, data generally needs to be collected by a process known as ‘scraping’. Scraping is actually a whole set of tools and techniques for accessing online data, and gathering the information that is of interest. Although there a few general types, described below, scraping is something that has to be fine-tuned to the characteristics of each particular website.

Apart from the simplest cases, scraping generally requires the use of some programming. The simplest cases are those where the entire dataset is contained on a small number of webpages in a form that can be simply copy-and-pasted into a speadsheet. The Kapralova list of women composers, for example, is contained on just two pages (A-K and L-Z), which could be easily copied and pasted into a spreadsheet ready for a bit of tidying up.

This is not the place to go into the nitty-gritty of writing programs to scrape data from websites, although this is a topic that will be covered in more detail in future articles. However, as an example, a relatively straightforward case is the website Archiv Frau and Musik. Each letter of the alphabet has its own page, with a sensibly named URL for each one. It is straightforward to write a program to loop through the letters of the alphabet, construct the appropriate URL, download the page, extract the list of names and dates, and save it all as a single list. Here is a short R program that does just that:

#load the rvest package (for '(ha)rvesting' webpages)
library(rvest)

#set the base URL
baseURL <- "http://www.archiv-frau-musik.de/Komponistinnen"

#initialise a results variable
results <- character(0)

#start a loop for variable 'let' to run through the letters a to z
for(let in letters){
  
  #set the URL for each letter, by pasting onto the base URL 
  URL <- paste0(baseURL, let, ".htm")
  
  #read the page with that URL
  page <- read_html(URL)
  
  #extract the part of the page with the list of names
  names.list <- html_nodes(page, "#ldheLabel3 font")
  
  #extract the text from this part of the page
  names.text <- html_text(names.list)
  
  #names.text is returned as a single long string, with the names separated by groups of spaces
  #the next statement splits this up, and trims whitespace off the end of the resulting pieces
  names <- trimws(strsplit(names.text, "      ")[[1]])
  
  #add this onto the end of 'results' (apart from the first element of 'names', which is blank)
  results <- c(results, names[-1])  

  #it is good practice to 'throttle' this activity to avoid overloading websites 
  Sys.sleep(1) #one second delay between letters 
}

#print first 10 results (there are 1593 in total)
head(results, 10)

[1] "Abe, Kyoko (*1950)"                           
[2] "Aboulker, Isabelle (*1938)"                   
[3] "Abrams, Harriet (1760-1822)"                  
[4] "Abrao, Sandra Maria (*1949)"                  
[5] "Adaïewsky, Ella (1846-1926)"                  
[6] "Adair, Yvonne  (1897-)"                       
[7] "Agnesi Pinottini, Maria Teresa d' (1720-1795)"
[8] "Agudelo, Graciela (*1945)"                    
[9] "Ahrens, Sieglinde (*1936)"                    
[10] "Alabert, Àngels (*1937)"

 

Often a two-stage approach is required. There may be an index page from which the first step is to harvest the links to the pages of interest. The IMSLP page on women composers is a good example, where several index pages must first be scraped for the links to each composer’s page. Other sites, such as Women of Note, have more complicated structures, and (in this case) a masked URL – the links actually point to an address on oboeclassics.com, even though the browser’s address bar always remains on womenofnote.co.uk. It can take a lot of ingenuity and trial and error to successfully scrape data from some websites!

The most difficult sites are those requiring the user to log in (such as Oxford Music Online), or those that use ‘dynamically generated’ content. These are designed for real people operating normal web browsers, and often do not respond well to simple automated scripts such as the example above. The ‘sledgehammer’ solution in such cases is to use something like the R package RSelenium, which automates the clicks and other operations that a human would perform with a browser. This approach can be messy and is usually slower than other methods, but it is very powerful.


The approach to gathering data depends on, and can affect, the type of analysis. Data from books or other free-format prose can often only realistically be analysed by an approach based on sampling. However if a whole dataset can be gathered, then ‘big data’ techniques can be used. Whilst simple methods can be used to collect well-organised data, such as lists of names and dates, an investigation that requires more complex information may require quite advanced programming skills.6 However, most datasets can be cracked with some ingenuity, patience and experimentation, and there is much that can be achieved with even relatively simple data.

Cite this article as: Gustar, A.J. 'Collecting Data' in Statistics in Historical Musicology, 21st August 2017, https://www.musichistorystats.com/collecting-data/.

 

  1. Look at the ‘FULL TEXT’ link, under ‘download options’, for example.
  2. For example, names can be identified by searching for capitalised words at the start of a paragraph. Years of birth can often be identified as the first four-digit string beginning with ‘1’.
  3. OCR does not cope well with poor quality scans (dirty pages, very small print etc), irregular spacing, or text in several different languages or with a lot of accented or unusual characters. It can sometimes also fail to parse text correctly when it is laid out in columns or tables.
  4. Data wrangling will be a topic of future articles.
  5. There aren’t any examples among the women composers datasets, but the Listening Experience Database is a good and sophisticated example, with a detailed explanation of how it works.
  6. Automatically extracting information from free-format text, for example, or handling complex data structures such as XML or the ‘semantic web’.

Leave a Reply

Your email address will not be published. Required fields are marked *