In a later post in this series of articles analysing a dataset of song lyrics, I will consider the more general question of identifying parts-of-speech (nouns, verbs, adjectives, etc.), which can greatly expand what can be learned from statistical text analysis. However, in this article, I will focus on a particular part of speech: proper nouns.
Unfortunately, identifying proper nouns in song lyrics is not as simple as just looking for words starting with a capital letter. There are four main difficulties. The first is that some proper nouns also appear as ordinary non-capitalised words. “Carol”, “Rick”, or “China”, for example.
Secondly, some often-capitalised words are not actually proper nouns. “I” is an obvious example, but in the song lyrics dataset, words such as “Aah”, “Hey”, and “Whoops” are also commonly capitalised.
A third difficulty is that capital letters are used for other purposes, such as the beginning of sentences or, especially in song lyrics, at the start of lines or verses. One way around this is to look for words that are almost always capitalised wherever they appear.
The final difficulty is that some proper nouns are not single words – “New York”, “Good Lord”, “Eleanor Rigby”. Some of these contain ordinary words such as “new” and “good” that are not generally capitalised.
So, how can we extract the proper nouns from song lyrics? After some trial and error, this is the approach I took…
- Split the lyrics into separate words.
- Identify groups of consecutive words all starting with a capital letter (“New”, “York”).
- Join these capitalised groups back together so that they can be treated as one (“New York”).
- For each word or group of words, count the total number of appearances, both capitalised and ignoring capitalisation.
- Ignore any words or groups that are not capitalised at least 80% of the time, or that appear fewer than five times in the whole dataset.
- Ignore any groups with three or more words (none of these seemed to be genuine proper nouns, or the third word was redundant, as in “New York City”).
- Remove stopwords, and groups containing a stopword.1
- Remove words containing numbers or other unusual characters, or with fewer than three characters.
- Collapse word groups consisting of a repeated word (“Baby Baby”) into a single word.
- Delete “‘s” (short for “… is”) from the end of any words.
- Recalculate (as in 4 above) the number of capitalised and total appearances of each word or group.
- Ignore any words or groups that appear fewer than five times, or which are capitalised less than 80% of the time.
At the end of this process we have a list of words or word groups that are likely to be proper nouns. We can then see how often each capitalised word or group of words appears in the dataset. I did this by counting the number of songs, by decade, in which each term appears.2
As well as names and places, the list includes “Christmas”; titles (“Lord” was the most common capitalised word overall); months (“December” is the only one with a significant number of mentions); weekdays (“Sunday” was most common in every decade, then, in various permutations, “Saturday”, “Friday” and “Monday”); a handful of brands (“Cadillac” and “Gucci”); nationality adjectives (“American”); and the odd rogue word such as “Everytime”, which presumably usually appears at the beginning of a sentence.
It is also worth noting that we have filtered the data down a lot, so the numbers of songs mentioning even quite common terms can be rather small – especially for 1950s songs, which are not well represented in our dataset. This obviously reduces the statistical significance of some of these conclusions.
The following chart shows the top names by decade, among those appearing in at least five songs (which is why only two are listed in the 1950s). “Jesus” briefly fell behind “Mary” and “John” in the 1960s. “Johnny” overtook “John” in the 1980s and then disappeared. “Satan” emerged in the 1980s, but has since been in decline.
The limited range of first names used in songs is rather shocking. Women are mainly called Mary (in a total of 388 songs in the dataset), and men are John (408), Joe (250) or Johnny (244). Further down the list are Billy (163), Michael (154), Jimmy (144), and a few others, but it is a long way before we get to another female name: Jane (112 songs).
The places mentioned in songs are predominantly American, as the chart below shows, with London the only non-US place to make the top five in any decade. New York has been at the top since the 1970s. Texas has been in decline, whilst Hollywood has grown. Note the absence of the 1950s from this chart – no places were mentioned five or more times among the 1950s songs in the dataset.
New York is mentioned in 506 songs, with the most common non-US city, London, in 208. Paris is in 149 songs, and Rome in 123. Among countries, America is in 296 songs, Mexico in 157, England in 124 and France in 98.
After the elaborate process to capture multi-word proper nouns, we did successfully find “New York”, “New Orleans”, “Santa Claus”, “Christmas Eve” and “Los Angeles” in the top 150 capitalised terms overall.
Names, places and other proper nouns are a special class of words that can be identified reasonably easily in a text. To find other word classes – verbs, adjectives, etc – is much more difficult, and will be the subject of a later article in this series.
Now I’m off to write a sure-fire best-seller about Mary and John’s trip to New York on a Sunday. In December. Maybe in a Cadillac…
- I used the relatively short ‘Snowball’ stopword list, in order to retain terms like “new” and “good”. See this article for further discussion of stopwords.
- I could have counted the number of appearances of each term, but that would be biased in favour of the most repetitive songs. See this article for more on repetition.