This is the sixth in a series of articles looking at different ways of analysing a dataset of song lyrics. In this article we will be venturing into hyperspace to explore the differences and similarities between artists, in terms of the words they use in their songs.
To keep things manageable, I will just work with the songs by the 44 artists that are compared towards the end of this previous article. I will only consider the 50 most common words among the 1,150 songs by these artists, after removing stopwords (see here for details). The top 50 words are (in descending order):
love, baby, yeah, time, heart, gonna, day, life, night, world, feel, girl, hear, home, wanna, ooh, hey, mind, eyes, da, ah, call, light, people, dream, sweet, la, till, stop, head, hold, tonight, leave, play, blue, mine, lonely, town, sun, true, hand, left, live, money, happy, hard, inside, morning, gotta, remember
For each artist, we combine all of their songs, ignore words other than these 50, and calculate a list of 50 numbers, representing the proportion accounted for by each word.
For example, among ABBA songs, there are 114 occurrences of words from the list above. “Love” appears twice, “baby” three times, “yeah” zero times, “time” six times, and so on. Dividing these by 114 gives a string of 50 numbers, starting
0.0175, 0.0263, 0.0000, 0.0526, .... These 50 numbers can be regarded as defining a point for ABBA songs in 50-dimensional space (in the same way that two numbers, latitude and longitude, define a point on the 2-dimensional surface of the Earth).
So we now have 44 points (one for each artist) in 50-dimensional space. The “axes” we have used to define these points are the relative frequencies of the top 50 words, but these are not necessarily the most useful coordinates to choose. We are free to choose any set of axes to specify the points – in particular we can take our starting axes and move or rotate them in any direction. Changing the axes does not change the 44 points, just the way that we represent them (in the same way, we might represent a point on Earth as lat/long coordinates, an Ordnance Survey grid reference, or a ‘google tiles’ location).
Principal Component Analysis (PCA) is a mathematical technique to find a set of axes in hyperspace such that the first axis (or component) is the direction that represents the greatest variability in the data, the second component (at right angles to the first) reflects the largest variation after removing the first component, the third (at right angles to the first and second) represents the next greatest source of variability, etc. This is often done to reduce the number of variables in a complex statistical analysis: rather than 50 dimensions, we might find that, say, the first five principal components account for 95% of the variability in the data, in which case we could choose to simply use those components and work in five dimensions rather than fifty (accepting the 5% residual variation as a reasonable price to pay for this simpler model).
Imagining all this going on in 50 dimensions can make your head hurt, unless you are a mathematician, so let’s continue with the analysis and see what happens.1
The chart below plots the artists’ points in hyperspace against the first two principal components. These respectively account for 10% and 8% of the variation between the artists, in terms of their use of the top 50 words.
There are some interesting differences and similarities here, although it is not really clear on what basis we are making the comparison. To see that, we need to look at how each component is defined as a combination of the scores for individual words. This is illustrated in the following chart:
This chart shows the make-up of the first two principal components in terms of the top 50 words. Thus the first component (on the horizontal axis) is calculated as -0.59 times the score for “stop”, plus -0.52 times “feel”, …,2 plus 0.54 times “morning”, plus 0.57 times “night”. The second component (on the vertical axis) is -0.69 times “love”, plus -0.47 times “tonight”, …,3 plus 0.53 times “play”.
Linking this back to the previous chart, we conclude, for example, that Blondie tends to use more “love / tonight / lonely / baby” words, and fewer “play / hard / money / mind” words, whereas the opposite is true of Madness and the other artists at the top of the chart.
Sometimes it is possible to discern some sort of spectrum of meaning in these components, between the words that appear with negative and positive weights. In this case, it is hard to describe such a spectrum – the first component perhaps has hard descriptive words on the right (especially to do with times of day), but less meaningful ‘filler’ words on the left, such as “hey”, “la”, and “yeah”. The second component perhaps has words at the top more associated with day-to-day life, with those at the bottom more to do with moods and feelings. But these are just my impressions, and there are plenty of exceptions and room for disagreement.
If we look at the third and fourth components (explaining 8% and 6% of the variation respectively), things do not get any clearer:
The definitions of the third and fourth components are as follows:
Each of these dimensions has a significant artist outlier (Madness and ABBA respectively). So we might conclude that ABBA’s lyrics are unusually high on words like “happy” and “world”, at the expense of words like “blue” and “light”, and that Madness is high on “call” and “remember” and low on “true” and “heart”. We could conclude that, but it does not really get us very far.
Although they have their uses, PCA techniques can be hard to interpret meaningfully with text data. The best we can hope for, perhaps, is that they might suggest unexpected differences and similarities that can then be investigated further.
In some PCA applications, most of the variability can be explained by just a few principal components. In this case, however, the first component only explains 10%, the first four account for less than one third, and it takes 22 components to explain 90% of the variation.4 It is therefore not surprising that we do not get very clear answers by just considering the first four components.
We can measure the closeness of artists or words by calculating the correlation between their scores.5 The charts below turn these scores into dendrograms (as in this previous post), where the more correlated words or artists are, the further to the left they are linked.
So, for example, the Rolling Stones and Aretha Franklin tend to use similar patterns of words in their songs, as do Elton John / Bee Gees, and Snoop Dogg / Busta Rhymes, although there is little similarity between these three pairs (i.e. they are not closely connected to each other on the dendrogram). There are a few unsurprising links, and one or two interesting ones (Bowie and Dylan, for example). ABBA and Sparks do not connect to anybody else (other than on the negative side of the dashed zero line). Although the charts above showing these artists on the first four principal components are not inconsistent with these patterns, they do not tell the story with the same clarity.
It is similar with the top 50 words dendrogram: a few obvious links, some more interesting ones, and consistency (if in places tenuous) with the previous charts.
Of course, this analysis has only looked at a limited number of artists and just the top 50 words. Maybe it would be more meaningful to compare songs rather than artists, or to use a longer or shorter list of words. Or maybe not! It is often worth trying different analytical techniques just to see whether they throw up any interesting patterns. In this case, it is arguable whether we have learnt much, but that in itself is interesting – songs are indeed as diverse and multidimensional as we always suspected, and do not readily fall into well-defined clusters (despite what the ‘genre’ labels in record shops and automated playlists would have us believe).
- There are several ways of doing PCA in
R. For this article I have used the
- …plus 46 more terms…
- …plus 46 more terms…
- As an aside, there are a total of 43 components that emerge from this PCA analysis. That is because we do not need the full 50 dimensions to describe just 44 points (in the same way that you only need two coordinates to describe a point on a piece of paper, even though it exists in three-dimensional space). The other dimension is lost because there are only 43 directions in which the variability of 44 points can be measured. In the same way, only two directions are needed to measure the spread of three points on a sheet of paper.
- If we have a matrix of scores, as defined above, with rows corresponding to artists, and columns to words, then we can calculate correlations between rows to measure similarity between artists, and correlations between columns to measure similarity between words.