Often in statistical analysis we need to select things at random. For example, if it is impractical to work with a complete dataset, the only option might be to use a random sample. The science of statistics tells us how to analyse a sample in order to reach conclusions about the entire dataset, and gives us ways to calculate margins of error based on the size of the sample. But I digress.
So, how might we pick a random composer?
Firstly, we should clarify what we mean by ‘random’. For sampling, the usual approach is to select elements in such a way that each is equally likely to be chosen. This helps to ensure that the sample is representative of the dataset as a whole. So we will try to find a way of ensuring that each composer has the same chance of being picked.
This raises another question. What is the population of composers that we are selecting from? Do we want one of the names that appears on a list of ‘top fifty composers’, or should we include anybody who has ever jotted down a tune? Do we want to restrict it in some way, perhaps by date or nationality? In practice, we will usually be sampling from a particular dataset, so we will take that to be the population we are selecting from.
So we have a dataset of composers, from which we want to pick a name at random. How to do so will depend on the dataset. Here are a few examples…
- a list of composers, one entry per line
- a biographical dictionary of composers
- a catalogue of published or recorded music, ordered alphabetically by composer
- a chronology of concerts, with details of works and their composers, ordered by date
- an online library catalogue of printed music
The first case is straightforward. If we know how long the list is (either the total number of names, or just the number of pages it occupies), then we can choose a composer by generating a random number or two.1 If there are 1,000 names on the list, a random number between 1 and 1,000 will give us the number of a composer according to the criteria above. If there are 20 pages, with 50 composers per page, then a random number between 1 and 20 will give us a page, and another number between 1 and 50 will identify a composer.2
The second case, a biographical dictionary, also looks straightforward. We could pick a page at random, and then (if there is more than one composer listed on that page) the name corresponding to a random line on that page. The subtle problem with this approach is known as ‘length bias’. Famous composers will tend to take up more space in a biographical dictionary than lesser known composers. If the biography of Mozart takes up fifty pages, but the entry for Jane Bloggs only occupies half a page, then this procedure will pick 100 Mozarts for every Bloggs, which violates the ‘equally likely’ rule.
There is a simple way around this problem, or at least to alleviate its impact. Pick a random page and position as above, identify the composer occupying that space, and then choose the nth composer after that one, where n is a random number between, say, 3 and 10. On the assumption that the length of each composer’s entry is independent of those nearby, this will result in a choice that is not biased towards those with the longest entries.3 However, it does favour the composers three to ten places after Mozart, relative to those following Bloggs, so, strictly speaking, the ‘equally likely’ rule is still violated. Nevertheless, if we draw a sample using this procedure (ignoring any repeated names), we will end up with a sample that is not affected by length bias, i.e. ‘long’ and ‘short’ composers will be represented according to the proportions in the population as a whole.4
The third example, a catalogue of recordings or publications, is much the same as the biographical dictionary, in that it will be prone to length bias. In the case of a record guide or catalogue, this bias can be extreme, as the big names not only tend to have more works, but these works have more recordings, the recordings have more reissues and varieties of format, and more tends to be written about each of them. It would not be unusual for a composer such as Mozart to occupy 5% or more of the pages in a record guide.
At least with the previous two cases it is possible to identify the length bias and to take account of it. In the fourth case – a source such as a listing of concerts ordered by date – it is much harder to allow for. Whilst it is straightforward to pick a date or a concert at random, and then a composer from that concert, there is no obvious way to tell the extent of any length bias. We may know that Mozart appears in a lot of concerts, and Bloggs appears very rarely, but because each composer’s entries are spread out essentially at random, it is not at all clear what the effect might be.
One way to reduce this problem would be a two-stage process. In Stage 1, we use the procedure above (random date/concert, random composer from that concert) to select, say, 100 composers.5 If there is length bias, this list might contain Mozart ten times, Bloggs once (if we are lucky), and other composers a varying number of times. Turn this into a list of unique names, sorted according to the number of appearances – (Mozart, 10), …, (Bloggs, 1). There might be, say, 60 unique names, and the total number of appearances will add up to 100. We know that Mozart was very likely to find his way onto the list in Stage 1, and the chance of him doing so will be roughly equal to the proportion of times he is selected. So we can apply negative selection to reduce his chance of getting through Stage 2. To each composer, assign a weighting of 1/n, where n is the number of appearances in Stage 1 – so 1/10 for Mozart, 1/1 for Bloggs, etc. Then find the cumulative sum of these weightings, and select a composer by picking a random number between 0 and the total of the 1/n terms. So if Mozart is first on the list, we will select him if the random number falls between 0 and 0.1. If Bloggs is last on the list, a number between (say) 43.2 and 44.2 will select her, where 44.2 is the total of all the weightings (and the random number is generated to be between 0 and 44.2). With this method we can choose a random composer in a way in which the chance of getting through the first stage, multiplied by the chance of getting through the second, is (roughly) constant.
Drawing a random composer from the fifth case – an online catalogue or database – can be anywhere between simple and impossible. Two main difficulties are likely to arise – working out how big the database is, and finding a way of accessing a random element. If both of these problems cannot be solved, then it might not be possible to pick a random composer according to the ‘equally likely’ criterion.6
With an online library catalogue, it is sometimes impossible to browse all of the records, although it is usually easy to generate a list of records meeting certain criteria – by searching by publication date or genre, for example. In this case, the output of such a search can be used to select a random composer (subject to the search criteria), perhaps allowing for length bias using one of the techniques above. Some catalogues will even allow you to save or download the list of results so that you can process them on a spreadsheet.
It is often worth examining the URL (the ‘http://…’ internet address) of database records. The outputs of searches, for example, may contain a parameter along the lines of ‘page=n’, which you can use to jump to particular pages of results. Sometimes individual records have numerical codes, and it might be possible (by trial and error) to estimate how many records there are and to select one at random.
In a few cases you might be able to download the entire dataset, which makes the job of sampling much easier. You can do this with the music collections of the British Library, for example.7 One or two websites, such as IMSLP, even have a ‘random page’ feature – although to get a random composer you will still have to adjust for length bias, as most of the pages correspond to individual works rather than composers (or you could keep clicking until a composer page appears).
So the next time you need to pick a composer at random, don’t just guess. Find a suitable dataset and do it properly – you will probably discover somebody new and interesting, and you will get a much better understanding of your source.
- Random numbers here refer to ‘uniform’ random numbers (i.e. all values equally likely) which are easily generated by computer programs, spreadsheets, and some calculators.
- An example of this sort of list, which can be useful in some situations, is the index or contents list of a larger and more complex dataset.
- Actually, there is some correlation in the case of families of composers, which is why it is sensible to avoid n=1 or 2.
- Note that it will also be necessary, for composers towards the end of the source, to cycle back to the beginning when counting the nth successor.
- The more the better, although with some sources, this can be a time consuming process.
- An example, as far as I can tell, of an impossible-to-sample database is the iTunes store.
- See the link on this page.