In this article about the Eighteenth Century London Concerts dataset, I will look in more detail at the prices of concert tickets.
The published data has multiple prices and categories listed in single cells of the spreadsheet. This needs to be parsed before it can be used for statistical analysis.
In the first article in this series, we considered several options for doing this, including converting it to a “long” format (with one ticket price per row, cross-referenced to the concert), or extracting summary ticket data and including this in new columns (with one row per concert). I have chosen the latter option here, as there are some complications in parsing the data which mean that summary data is fine, and losing some of the detail is not a problem. Retaining the one-row-per-concert format makes it easier to analyse the data alongside other variables such as location and date.
After a few intermediate steps, I derived
bottomSingleTicket prices for each concert, being the top and bottom prices for one person (perhaps as part of a larger group). These were based on the prices and ticket multiples information in the
Price column, and sometimes in the
Notes field as well.1 This involved some manual checking and highlighted a few anomalies, such as tickets valid for multiple concerts in a series, or covering different combinations of people, expressed in various different ways.2
The following chart shows the distribution of top and bottom ticket prices. They tend to fall on whole- or half-shilling points. The large central peak is at 10½s (half a guinea), with another small peak at 21s. There are a handful of higher values, but I suspect most of these are erroneous (probably prices for series of concerts, but not specified as such).
I have also grouped the prices into bands – “0-2” (2 shillings or less), “2-5”, “5-11”, and “over11”. The following table shows the number of concerts in each band, by venue type (excluding concerts with no price information).
|Bottom Price (shillings)||Top Price (shillings)|
We can see, for example, that Garden concerts were cheap, with a flat price for everyone, mostly under 2s, and very rarely higher than 5s. Taverns were also inexpensive and, like Gardens, mostly had a single price band (the Top and Bottom sides of the table have similar figures). There were relatively few concerts in Churches, but they were the most common venues in the highest price band (24 of the 26 were at Westminster Abbey). Theatres were the only venues (apart from one or two Halls) that offered a wide variety of prices (for Stalls, Gallery, Boxes, etc), allowing them to cater for both ends of the market, with prices typically ranging from 2s or less, up to half a guinea.
The following chart shows average prices broken down by decade. Each venue type is shown as a coloured “ribbon” indicating the average top and bottom prices for each decade. The circular dots marking the edges of the ribbons are proportional in area to the number of concerts in each category – an indicator of how much faith we should give the averages. So Halls (in green) and Theatres (magenta) have plenty of concerts in each decade, whereas Churches (red) and Taverns (blue) have fewer, so the averages for these should be treated with more caution. In fact, all points have at least 16 concerts, apart from Houses (cyan) in the 1750s and 1790s, which each have just two.
This demonstrates the wide price differentials that Theatres could offer. Both the top and bottom prices for concerts at Theatres fell over the half-century. This is in contrast to the other types of venue, where prices tended to rise or stay about the same (bearing in mind the small number of concerts in some of these averages).
So ticket prices seem to depend on venue type and, perhaps, time. However, there are lots of other factors that might affect prices, and it is hard to unpick quite what is going on.
One way of examining the different factors influencing ticket prices is to fit a linear model to the data. That is, we try to fit a formula of the form
Price = A + B*(Year-1750) + C*(PlaceType) + D*(StartTime) + ...
I have added
StartTime as a possible factor, and the
... indicates other variables that we might include. The coefficients
A, B, C, D etc give us an indication of how these factors affected prices. In the formula, terms like
C*(PlaceType) do not make literal sense, since
PlaceType is a categorical variable (
"Church", "Tavern", etc) rather than a number. We need to treat
PlaceType as six different variables –
Tavern, etc – each of which can be 0 (
FALSE) or 1 (
TRUE). These are then numerical variables, and we can find separate coefficients
CChurch, CTavern, etc for them.
The mathematics of fitting a linear model is straightforward if a little messy. It is very easy in
R – you can just use the built-in
lm function. Let’s fit a linear model to the top ticket prices. Here is the summary output…
lm(formula = TopSingleTicket ~ I(Year - 1750) + PlaceType + Weekday +
StartTime, data = ConcCal)
Min 1Q Median 3Q Max
-8.583 -1.666 0.301 1.621 54.070
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.538301 0.396832 29.076 < 2e-16 ***
I(Year - 1750) -0.027514 0.005529 -4.976 6.86e-07 ***
PlaceTypeGarden -8.074205 0.464360 -17.388 < 2e-16 ***
PlaceTypeHall -1.569962 0.427620 -3.671 0.000246 ***
PlaceTypeHouse -1.277468 0.532805 -2.398 0.016564 *
PlaceTypeTavern -4.610204 0.491510 -9.380 < 2e-16 ***
PlaceTypeTheatre -1.173858 0.453909 -2.586 0.009755 **
WeekdayTue -1.020369 0.254254 -4.013 6.14e-05 ***
WeekdayWed -0.846868 0.194715 -4.349 1.41e-05 ***
WeekdayThu -0.606057 0.207291 -2.924 0.003485 **
WeekdayFri -0.340123 0.201690 -1.686 0.091832 .
WeekdaySat 0.256817 0.345830 0.743 0.457777
StartTimeEarlyEve -0.595058 0.361949 -1.644 0.100276
StartTimeEvening -0.900567 0.324858 -2.772 0.005603 **
StartTimeLateEve 0.979616 0.347626 2.818 0.004865 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.195 on 2923 degrees of freedom
(1063 observations deleted due to missingness)
Multiple R-squared: 0.3503, Adjusted R-squared: 0.3472
F-statistic: 112.6 on 14 and 2923 DF, p-value: < 2.2e-16
The first item “Call” is a reminder of what the
lm function was asked to do. In this case it used data from a dataset called
ConcCal to fit a formula for
TopSingleTicket as a linear combination of
StartTime – bearing in mind that the last three of these are categorical variables, so each option for each variable will have its own coefficient.
The next item “Residuals” gives the distribution of the differences between the actual values and those predicted by the linear model. This indicates how well the model fits the data. The first and third quartiles are about 1.6 shillings out, and the extremes are very wide, so this is a warning that the fit is not going to explain everything.
Jumping to the final block, we get some more information about the model. In particular, the “Multiple R-squared” value is the proportion of variability that is explained by the model. In this case, it is just 35% – another warning not to take the results too seriously. Also of interest is the fact that 1,063 concerts were deleted due to having missing values in one of the variables. So over 25% of the data is effectively being ignored.
The “Coefficients” section gives the parameters of the linear model itself. For each variable there are four columns, the first being the estimated value of the coefficient, and the others indicating how statistically significant it is. The final column, with dots and stars, is a summary – the more stars, the more significant the result. No stars, and there is little evidence that the coefficient is different from zero (i.e. that it has no effect).
The first value “Intercept” is the constant term in the model (
A in our formula above). It can be interpreted as the ticket price for
(Year - 1750) = 0, and the first value of each categorical variable. For ordered variables (such as weekdays) these are in order, but otherwise categories are ordered alphabetically.3 So the intercept of 11.54 is the expected ticket price (in shillings) of a concert in 1750, at a Church, on a Monday, during the Daytime.
The next value “I(Year-1750)” is the average yearly change in price.4 The coefficient of -0.0275 means that, over the 50 year period, expected top ticket prices fell by 50*0.0275=1.375, or about 1.4 shillings.
The following rows give the coefficients for the various categorical variables, relative to the “Intercept” values (Church/Monday/Daytime). So, for example, concerts in Gardens are over 8s cheaper than in Churches; concerts on Tuesdays are a shilling cheaper than on Mondays; and Late Evening (i.e after 8pm) concerts are a shilling more expensive than Daytime concerts. The coefficients without stars in the final column are not significantly different from the base values – so, for example, we can’t be confident that Saturday ticket prices differ from those on a Monday.
Looking at the table of coefficients as a whole, we can conclude that
- top prices fell by about 1.4s over the half-century
- Gardens were, on average, about 3.5s cheaper than Taverns, which were about 3s cheaper than Halls, Houses and Theatres, which were 1-1.5s cheaper than Churches.
- Saturday and Monday were the most expensive days, and Tuesday and Wednesday the cheapest.
- Late Evening (after 8pm) concerts were the most expensive, and Evening (7pm-8pm) the cheapest.
If we run a similar model with bottom ticket prices we see a similar pattern, except that Theatres, as we have seen, offered low prices (in between Gardens and Taverns). The annual change in bottom prices is smaller, equating to a drop of about half a shilling over the 50-year period.
So, do we now understand the factors determining ticket prices? Partly, perhaps. But this model only explains 35% of the variation, and ignores 25% of the data. It also only considers the factors that we chose to include. We could just have looked at “Year” as a variable, or we could have included others such as concert type, the size of the musical forces, the month of the year, whether it included vocal music, or the venue’s distance from the centre of London. Any of these might affect ticket prices at least as much as those we considered above.
And adding in extra variables can affect the coefficients of the others. For example, if we add Concert Type as a variable, Venue Type becomes less significant. Garden venues lose their statistical significance, but the coefficient for the “Garden Series” concert type appears as significant.
Another problem with adding variables is that more concerts get ignored due to missing data. The model below includes DistFromCentre (in km from Leicester Square), ConcertType, Month, GenreSize (small/medium/large forces), and GenreVocal (singers or not), and it now explains over 57% of the variation, but ignores almost 40% of concerts. The “Intercept” in this case refers to the following values of the categorical variables: Church / CB (benefit concert) / Monday / Daytime / January / L (large forces) / FALSE (no voices).
lm(formula = TopSingleTicket ~ I(Year - 1750) + DistFromCentre +
PlaceType + Type + Weekday + StartTime + Month + GenreSize +
GenreVocal, data = ConcCal)
Min 1Q Median 3Q Max
-6.2057 -1.1292 0.3297 1.4903 15.7267
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.941762 0.497281 19.992 < 2e-16 ***
I(Year - 1750) -0.047538 0.004589 -10.359 < 2e-16 ***
DistFromCentre -0.184582 0.063635 -2.901 0.00376 **
PlaceTypeGarden -3.292041 2.406590 -1.368 0.17146
PlaceTypeHall 0.151957 0.396318 0.383 0.70144
PlaceTypeHouse 1.687725 0.584074 2.890 0.00389 **
PlaceTypeTavern -2.240284 0.449739 -4.981 6.77e-07 ***
PlaceTypeTheatre 0.003689 0.419244 0.009 0.99298
TypeCS -0.991361 0.176621 -5.613 2.22e-08 ***
TypeGB -1.223688 2.369517 -0.516 0.60560
TypeGS -3.598322 2.346960 -1.533 0.12536
TypeMI -2.290925 0.481918 -4.754 2.11e-06 ***
TypeOB -0.320590 0.250162 -1.282 0.20013
TypeOS 0.479985 0.236448 2.030 0.04247 *
TypeRM -2.998489 0.402407 -7.451 1.28e-13 ***
TypeSOC -4.150005 2.286002 -1.815 0.06959 .
WeekdayTue -0.820406 0.200182 -4.098 4.30e-05 ***
WeekdayWed -0.315386 0.165298 -1.908 0.05651 .
WeekdayThu -0.290850 0.160775 -1.809 0.07057 .
WeekdayFri -0.164692 0.162431 -1.014 0.31072
WeekdaySat 0.425065 0.274138 1.551 0.12114
StartTimeEarlyEve -1.908241 0.347916 -5.485 4.57e-08 ***
StartTimeEvening -1.734394 0.323825 -5.356 9.32e-08 ***
StartTimeLateEve 0.441283 0.356398 1.238 0.21577
MonthFeb 0.679255 0.290619 2.337 0.01951 *
MonthMar 0.623496 0.284292 2.193 0.02839 *
MonthApr 0.896132 0.287514 3.117 0.00185 **
MonthMay 1.485169 0.303660 4.891 1.07e-06 ***
MonthJun 1.638891 0.355560 4.609 4.25e-06 ***
MonthJul 0.124481 0.447573 0.278 0.78094
MonthAug 0.445536 0.425845 1.046 0.29556
MonthSep 0.133247 0.542066 0.246 0.80585
MonthOct -0.321810 0.979900 -0.328 0.74263
MonthNov 0.333242 0.743118 0.448 0.65388
MonthDec -1.234283 0.519775 -2.375 0.01764 *
GenreSizeM -0.883656 0.314069 -2.814 0.00494 **
GenreSizeS -1.320767 0.233800 -5.649 1.80e-08 ***
GenreVocalTRUE 0.991685 0.155974 6.358 2.44e-10 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.269 on 2395 degrees of freedom
(1568 observations deleted due to missingness)
Multiple R-squared: 0.5778, Adjusted R-squared: 0.5713
F-statistic: 88.6 on 37 and 2395 DF, p-value: < 2.2e-16
Adding the extra variables has significantly changed the model, especially the significance levels. For example, neither Garden venues nor Garden-related concert types (GB and GS) now have coefficients with any great statistical significance. Perhaps the model has found other variables that capture the pricing of garden concerts more accurately. Or perhaps it has ‘traded out’ garden concerts for a better fit on something else. It is hard to say without quite a lot of extra analysis.
The new model has also increased our estimate of the fall in top prices to around 2.4s over the half-century (a shilling more than in our first model).
As we might hope, the expanded model reveals some new patterns, such as that April/May/June is the most expensive season, with December the cheapest, or that larger musical forces and the inclusion of voices tend to correspond to higher ticket prices. Prices also fall by 0.2s for each km of distance from Leicester Square – which could perhaps be one of the alternative indicators of Garden concerts.
Some of the difficulties with this sort of modelling can be managed, at least to some extent. There are ways, for example, of working with partial data rather than simply ignoring incomplete records. There are other types of model – non-linear models, or more general linear models with “cross-terms” (such as separate “Year” coefficients for each Venue type) – which might work better. And there are other approaches, such as Random Forests, Neural Networks and a host of others.5
The problem remains, however, that all such models are complex, hard to interpret, and often quite sensitive to small changes in the structure or assumptions. In most cases, they are most useful for highlighting possible patterns of interest, and raising questions and topics for closer investigation.
- Obviously these two values are equal for concerts with just one ticket price.
- It was not uncommon, for example, to list a single price to cover either two gentlemen, or one gentleman and two ladies.
- Actually, ordered categorical variables are only kept in order if they are set as “ordered factors”, otherwise they revert to being treated alphabetically.
I(...)notation here is a way of avoiding the different conventions of
R‘s symbolic formula notation, as used in the model specification of the
- Random Forests are discussed further, in a different context, in this previous article.