The limitations of musical datasets

The value of statistical techniques in historical musicology depends on the quality of the available data. The extent and diversity of these sources is considerable, but it is important to remember that they can only ever illuminate a small proportion of the musical world.

A historical musical dataset can be thought of as a snapshot of part of the entirety of musical activity. Although we may be tempted to extrapolate our conclusions beyond the scope of the data, there are fundamental reasons why such extrapolations can only ever be valid within narrow limits. 

One reason for this relates to the definition of a musical work. To be included in a dataset, a piece of music has to be identifiable, distinct from other pieces of music, and reproducible (usually as a score or recording). Any musical creation can be considered a ‘work’, but this can be a poorly defined term in some musical cultures and genres. The identity of individual works is not necessarily stable and well-defined. How do we know if two performances (particularly in genres such as jazz, folk, and many non-western musics, that incorporate elements of improvisation) are of the same work or of different works? What about the various derivatives – arrangements, fantasies, cover versions, improvisations, tributes and variations – which can be considered as either new or existing works depending on the context?

Copyright law aims (not always successfully) to define a musical work in legal terms, usually based on a definitive notated version. But it is not always clear as to what constitutes the essential identity of a work. It is not necessarily the ‘tune’ and the ‘words’ – improvised elements, chord sequences, structure, instrumentation, performance practice and context can be at least as important to a work’s identity.

Further confusion can result with hierarchies of works – movements within symphonies, piano pieces within suites or sets, arias within acts within operas within cycles, etc. Any or all of these combinations might be listed as ‘works’ within the same dataset, depending on how publishers or record companies choose to present the material.

A second issue is to do with classification. Most datasets put works, implicitly or explicitly, into different categories. These groupings may be objective – such as the performing forces (piano, wind band, choir, etc) – but often use more subjective descriptions such as form (symphony, minimalist, etc), context (operetta, ‘muzak’), value judgement (light music), genre (nocturne, hip-hop, blues), period (baroque, romantic) or region (‘Western music’, ‘world music’, etc). Such classifications are poorly defined, often overlap, and may be applied inconsistently by different individuals or groups. Historical and modern datasets are thus presented to us through a variety of filters, varying by period, region and other cultural factors, which can be difficult to define and compare.

A third problem is to do with the question of survival. Music is a transient process, and for a work to ‘survive’ after the sound has faded away, it must continue to exist in some form – usually as a recording or a notated score. Precise musical notation is (with few exceptions) peculiar to Western music, and has existed for less than 1,000 years. Many musical cultures have no precise notation, and there is a great deal of music (including much Western music) that is largely improvised or based on patterns and structures that are only partially notated or are handed down aurally. Recording, of course, does not require precise notation, although the proportion of informal, improvised and unnotated music that is actually recorded is extremely small.

One measure of a work’s ‘survival’ may be its inclusion in a dataset – as proof of its existence. Whilst the non-appearance in subsequent datasets of a published work may be seen as a failure to survive, the same conclusion cannot be drawn about all the improvised, non-notated, unpublished, and aurally-transmitted music that does not, indeed cannot, appear in these datasets. Historical evidence is stacked in favour of the small proportion of music than can be catalogued.

Even among well-defined works that have survived, they might never receive any attention, and thus ‘exist’ only as a listing in a historical document. Musicologists, performers, audiences, concert promoters and others tend to focus on a small number of ‘great works’ by a handful of ‘great composers’ (with similar trends in jazz, popular music, and other genres).

The same is true to some extent of those individuals and institutions who have collected and catalogued music in its various forms. There are different degrees of interest in music from different periods or regions, of particular genres or for different combinations of performers. The compilers of musical datasets are more likely to include works and composers that are closer to home – sharing a country, period, language or culture, for example – than those that are more remote, harder to find, and less familiar.

So the proportion of the totality of musical activity that can be explored through the surviving datasets is rather small – we cannot look far beyond that portion of Western music from the last half-millennium that has been either written down or recorded. From a historical and global perspective this is an unquantifiable but undoubtedly small proportion of the musical world. Despite this, it is nevertheless a large and significant body of work, from which much can be learned by the use of quantitative techniques. The same limitations apply to qualitative research techniques, so a statistical view of music history is no less representative than one based entirely on qualitative research.  Indeed, quantitative methods can give a more balanced voice to the huge numbers of minor works and little-known composers that are mostly ignored in qualitative research.

Cite this article as: Gustar, A.J. 'The limitations of musical datasets' in Statistics in Historical Musicology, 28th May 2018,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.