This is the first of a series of articles about analysing text data. The statistical music historian might be interested in many sorts of text – from lists and catalogues through to complex ‘free format’ writing in tweets, record reviews, composer biographies, or encyclopedias. For these articles I will consider a dataset of song lyrics, taken from the LyricWiki website.
The statistical analysis of text is not always straightforward, and the aim of this series of articles is to illustrate some of the techniques, and to discuss their limitations and interpretation. We start with the simple question of counting words.
The dataset is a random sample of songs gathered by a short
R computer program which made repeated use of LyricWiki’s helpful ‘random song’ feature. The data includes the name of the artist (and, inconsistently, the songwriter), the song title, the language, the year (usually the year of issue of the single or album) and the lyrics. LyricWiki includes an eclectic mixture of songs ranging from the famous to the obscure, which have been contributed by users. They are mainly in English but a few are in other languages. After removing non-English songs, those without a year, those with lyrics shorter than 15 characters (mainly instrumentals), and any falling outside of the period 1950-2019, the sample included almost 64,000 songs by around 22,000 different artists. As the following chart shows, just 425 (0.6%) of the songs were from the 1950s, but over 40% were from the 2000s.
This uneven distribution means that any overall results will be biased towards later periods. To reduce this problem, the analysis will often by broken down by decade – which also reveals trends over time.
To count the words, we need to take the long strings of lyrics and split them up into words.1 Words are mainly separated by spaces and/or punctuation, but we also need to consider what to do with hyphenated and apostrophised words (“I’m”, “they’re”, etc) and other odd bits of text such as abbreviations or those containing numbers (“5th”) or symbols. A common convention is just to separate at spaces and punctuation, and assume that any non-standard words are probably rare enough to ignore.
It is also common to convert all words to lower case so that “song”, “Song” and “SONG” are all counted as one word. This might not always be appropriate if, for example, we are particularly interested in proper names, or the words that start lines or sentences.
Any word count is likely to find that so-called ‘stopwords’ are unhelpfully predominant. These are common words (“and”, “the”, “of”, etc) that tend to be mainly structural rather than conveying much meaning. There are several standard lists of stopwords for various languages, and it is straightforward to filter them out. It is, however, worth looking at the list of stopwords being removed to check whether it includes terms you might actually be interested in. Words such as “he” and “she”, for example – often regarded as stopwords – might be important in a study looking at the prevalence of gender references.2
A common way of conveying the result of a word count is with a ‘word cloud’. The following was generated by the
wordcloud2 package in
R, having removed stopwords.3
This is actually the top 200 words appearing in the lyrics – it being sensible to limit the number for practical reasons. The frequency of each word is reflected in both its size and its colour.
Word clouds can be an effective way of quickly conveying the most common words in a piece of text. Love, time, yeah, baby, life, gonna, feel, does not seem an unreasonable list. It is hard, however, to get precise information from word clouds,4 so they should only really be used to give a quick overview.
A better, if less pretty, way of presenting this information is with more traditional graphs such as bar charts. Here, for example, are the top ten words in each decade:
This shows the top 10 words (minus stopwords) in each decade and overall. The sixteen words that have made the top 10 appear in the same order in each chart, to make it easier to compare them over time. We see that “love” has been the top word throughout, but “time” has only been in second place since the 1970s: in the 1950s and 60s the second word was “baby”. “Yeah” was not in the 1950s top 10, but has become more common over time, to stand at third in the 2010s and overall. Other words are specific to individual decades – “rock” and “ya” in the 1950s, or “world” in the 2000s.
There is a lot of overlap between the top words in each decade, although the order changes a little. The charts are dominated by ‘love, time, yeah, baby, life’, which are common terms in all periods. How can we find the terms that are the most ‘characteristic’ of particular periods, rather than simply the most frequent?
A common approach to this question calculates the ‘tf-idf’, which stands for ‘term frequency – inverse document frequency’. It takes the frequency of each word (the ‘term frequency’) and multiplies that by minus the logarithm of the proportion of texts (‘documents’) in which the word occurs. If we define a ‘document’ to be a decade-worth of song lyrics, then words such as “love” that appear in every decade will have their term frequencies multiplied by the log of one (i.e. 100%), which is zero, so they will in effect be ignored. Words that only occur in one of the seven decades will have their frequencies multiplied by -log(1/7), or 1.95.
In this case, ‘tf-idf’ does not give particularly meaningful results – largely because each decade-long ‘document’ of lyrics is very diverse and contains all of the most interesting words. With only seven decades, there are not enough ‘documents’ for the ‘idf’ part of the calculation to do a good job. The analysis by decade, even restricting it to words appearing at least 50 times, gives, for example, “blues”, “till” and “war” as the most characteristic words of the 1950s; “dub”, “ole” and “people” for the 1970s; and various obscenities for the 1990s, 2000s and 2010s.
In other circumstances, tf-idf can give more meaningful insights. Using the lyrics of the 26 songs by The Beatles that appear in the sample, a tf-idf analysis gives the top three characteristic words of various songs as (“remember”, “girl”, “love”),5 (“lied”, “cried”, “listen”),6 (“people”, “ah”, “lonely”),7 and (“day’s”, “feel”, “home”).8 The reason it works better in this case is that the lyrics of individual songs are quite homogeneous, and there is a decent number of ‘documents’, so the ‘idf’ term ranges from 0.86 (for “love” which appears in 11 of the 26 songs) up to 3.26 for words such as “strawberry”, “polythene” and “walrus” that each only appear in one song.
In my experience, tf-idf is a rather hit-and-miss technique which always requires a common sense check. It works best if there is a reasonable number of ‘documents’, and if they are each homogeneous in terms of their language and subject matter.
A bit of common sense, of course, is required in any statistical analysis – especially if it involves complex, rich, multi-layered data, which is inevitably the case with texts.
- There are many programs and packages to do this. For much of this analysis I have used the excellent
- There are many stopwords lists available online.
tidytextcomes with three standard English lists. The shortest, ‘snowball’ has 174 words, but the combined list (which I have used here) contains 728 words, including pronouns, small numbers, conjunctions and quite a few words that might be useful in some situations.
- There are several programs and packages that generate wordclouds. I think this is one of the prettier ones for
R, although it can be slightly tricky to use. There is a lot of scope for playing with colours, fonts, shapes, layouts, etc. Here is an online version that just requires a list of words… https://www.wordclouds.com/
- For example, how do you compare long words against short ones, or the relative size of two words appearing at different angles on opposite sides of the cloud?
- ‘Things we said today’
- ‘Tell me why’
- ‘Eleanor Rigby’
- ‘A hard day’s night’