Deduplication is an important, though often messy and time-consuming, part of many statistical investigations. It is usually required when data comes from several different sources, to identify all of the records that actually refer to the same thing. For example, I have recently been deduplicating the names appearing in the ‘women composers’ sources listed in this previous article. Deduplication may also be needed where several publications of the same work are described in different ways in a library catalogue. Continue reading →
I have recently been working on extracting data on women composers from the various sources listed in this previous article. The first source on that list is a scanned copy of a French translation of a book – Les femmes compositeurs de musique – compiled in 1910 by Otto Ebel. It is available at archive.org here. Although I’ve not had great success in the past in extracting usable data from scanned books, this appears to be a reasonably tidy scan of Ebel, which looks like a useful source on women composers, so I thought I would give it a go. Continue reading →
Triangulation is a research technique that involves looking at the same thing from two different perspectives. In surveying, it enables positions and distances to be calculated by measuring angles from two locations. In the social sciences, it can increase the reliability of conclusions if they are found by two (or more) different methods. And in statistical historical musicology, looking for the same works or composers in two or more datasets can tell us a lot about the characteristics of the datasets, and about the works’ patterns of survival or dissemination. Continue reading →
I have recently been trying to collect data from the Listening Experience Database (LED) in order to put together a proposal for a conference paper. The LED is a nicely constructed database using linked open data and a structure based on something called the ‘Semantic Web’. Rather than traditional databases that have a hierarchical ‘tree’ structure, the Semantic Web concept is a true ‘network’, where anything can be linked to anything else. The LED, for example, includes links to data on a number of other databases. Have a look at the LED and follow a few links and you will see what this means – a very rich and flexible means of linking data together. Continue reading →
Finding a great dataset is all very well, but the next step is working out how to get the data onto your computer so that you can start playing with it. Datasets come in many forms, and there are different ways of collecting the data. In this article I will use some examples from the list of datasets in this previous article on women composers.
There are three main approaches to collecting data: read it and type it in, download it, or ‘scrape’ it. Continue reading →