Random sampling with Python

I need to run some tests at work. This isn’t the kind of testing programmers do; I’m testing actual physical devices that will be pulled or crushed or heated to destruction. I have about 100 of the devices and need to get a random sample for testing. I decided to use Python to help me do the sampling.

Define the devices

The devices have already been given ID numbers from 1 to 100. Device 29 was removed from the population in an earlier test, so I generate the list of available devices like this

>>> population = range(1,101)
>>> del population[28]
>>> population
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 31, 32, 33, 34, 35, 36, 37,
38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88,
89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]

The >>> thing at the beginning of each line is Python’s interactive input prompt. I’m doing this interactively because it’s too simple to bother writing a script. The third command just printed out the list so I could check my work.

How to sample?

Python’s random library has the functions needed to get a random sample from this population. Scrolling through the docs, I come upon the sample function:

random.sample(population, k)
Return a k length list of unique elements chosen from the population sequence. Used for random sampling without replacement.

This sounds perfect, except I don’t know what k is. It’s not that I can’t calculate the proper sample size for a given test, it’s that I don’t yet know how many different types of test will have to be run. As is often the case, the nature and number of tests will depend on the early results.

I could, of course, use random.sample(population, k) several times, deleting the samples it chooses from population each time. But I can’t do that interactively, as each test will take days or weeks to run, and it’s not practical to have a Python session running that long. I want something I can generate now and use over the weeks to come.

Shuffle the deck

The shuffle function comes to my rescue:

random.shuffle(x[, random])
Shuffle the sequence x in place. The optional argument random is a 0-argument function returning a random float in [0.0, 1.0); by default, this is the function random().

Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that most permutations of a long sequence can never be generated.

I treat the population list as if it were a pack of playing cards, shuffling it and dealing out the device numbers one by one.

>>> import random
>>> random.shuffle(population)
>>> for i in range(15):
...   for j in range(7):
...     print "%3d" % population[i*7+j],
...   print
... 
 65  53  74  34  35  64  96
 10  11  88  89  87   3  66
 80  76   6   7  48  18  58
 82  94  38  62  67  93  72
 23 100  78  17  68  39  75
 19  91  73  44  71  99  84
 63  85  24  95  31  13  98
 14  97  51  20  28  40  46
 61  37  50  47  56   9  49
 36  52  79  32  42  22  77
  5  12  60  15   2  90   4
 59  55  69  83  57  27  25
 70  41  30  16  33   8   1
 45  21  43  26  86  81  92
 54
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
IndexError: list index out of range

Don’t worry about the error message—that just means the nested loop tried to go beyond the end of the list; there aren’t enough numbers to fill the 15th line.

I select and copy the shuffled list from the Terminal, paste it into a Pages document, and fool around with the font size until I have a full page of numbers to print out.

I choose the devices to test from this list in this order, crossing them off as I go. The list becomes part of the tests’ written documentation.

Is the deck really shuffled?

Do I need to worry about the second paragraph in the shuffle description? Are there permutations of the population list that shuffle will never generate because the period of the random number generator is too small? It’s easy to check.

According to the random documentation,

Python uses the Mersenne Twister as the core generator. It produces 53-bit precision floats and has a period of 2**19937-1.

For those of us who don’t think in binary,

219937=4.3×1060012^{19937} = 4.3 \times 10^{6001}

which is an awfully long period. Since there are only

99!=9.3×1015599! = 9.3 \times 10^{155}

permutations of the list, there’s no need to worry.

Reflection

This isn’t a complicated operation, but I find it quite useful, and it fits in with a recurring theme of this blog: even a little scripting can get a lot done. Strictly speaking, I didn’t even write a script here—just ran through a few interactive commands—but the commands were clearly informed by scripting experience.

Perhaps a little too informed. My first (wrong) thought was to write a function that would use sample to get a list of samples and then use the index function to remove those samples from population. I did this because I was thinking about programming instead of about the problem I wanted to solve. Once I got out of that mindset and realized that the most useful thing would be a printed list of the IDs in the (shuffled) order in which they were to be tested, the solution became obvious.

If I were more clever, I probably could do this in a spreadsheet and have the output immediately ready for printing rather than screwing around in Pages. But I don’t think in spreadsheet terms, so it’s easier for me to use Python.