Sampling

Recently, I’ve had to inspect, measure, or in some way test sets of devices taken randomly from some larger population. I’ve been using Python and its random library to make the choices for me.1 The library makes these scripts very easy to write.

Say I have a hundred devices, each identified by a unique serial number, and I want to run a test on ten of them. This is how I choose the ten:

python:
 1:  #!/usr/bin/python
 2:  
 3:  from random import sample, seed
 4:  
 5:  # Setting the seed like this will give the same set of samples every
 6:  # time the script is run. Omitting this line will give a different
 7:  # set every time the script is run.
 8:  seed(1)
 9:  
10:  # This is where I'd enter the real serial numbers. For illustration,
11:  # I'm just using 0-99 with some extra leading zeros.
12:  devices = map(lambda x: "%05d" % x, range(300, 400))
13:  
14:  testUnits = sample(devices, 10)
15:  
16:  print '\n'.join(testUnits)

The most difficult part is entering the list of serial numbers. I’m usually given the list in a spreadsheet, so I copy it out of there, paste it into my script, and use a regex find/replace to turn it from a column of numbers into a comma-separated Python list of strings. I don’t want to waste your time with stuff like that here, so I just set the list of serial numbers to 00300 to 00399 in Line 12.

I don’t usually set the seed (Line 8), but in more complicated scripts that rely on random number generation or sampling, setting the seed can be very helpful for debugging because it generates the same set every time the script is run. You can keep the seed line in the script until you know it’s running correctly, then comment it out (or change its argument) to generate a new set of values. The argument to seed can be any immutable Python object: numbers are probably the most commonly used seeds, but you can also use strings:

python:
 8:  seed('corn')

The key line is Line 14, which uses the aptly named sample function to draw a random subset of items from the given list. The output is

00313
00384
00376
00325
00349
00344
00365
00378
00309
00302

As a practical matter, I sometimes generate a few more samples than I plan to test. I do this only if I suspect that some of the devices I’ve been given can’t be tested for one reason or another. If a device in the list turns out to be untestable, it’s nice to have an extra serial number or two to use as a random replacement.


  1. Yes, this is related to the confidence limits calculations I did in the SciPy v. Octave post of a few days ago. But random is part of the Standard Python Library—no need for SciPy. 


4 Responses to “Sampling”

  1. idiot commenter says:

    First world problem!

  2. Carl says:

    Python’s shuffle function is also useful. As Wikipedia notes, you’ve gotta use Fisher-Yates for shuffling if you want good results:

    Other common ad-hoc algorithms try to emulate manual shuffling with poor success: the permutations they produce are usually far from random and the running times are poor.

  3. Barron Bichon says:

    “I’m usually given the list in a spreadsheet, so I copy it out of there, paste it into my script, and use a regex find/replace to turn it from a column of numbers into a comma-separated Python list of strings.”

    I often need to bring data from a spreadsheet into Python, too, but I take a different (though I don’t claim better) approach. I copy it out of the sheet to a plain text file, then read that into Python using numpy.loadtxt. An upside to this is that it can be quicker/easier to switch between data sets if they are in separate files (or columns in the file) by passing the file name (or column number) as a command line argument to the Python script.

  4. Dr. Drang says:

    Carl,
    I use shuffle, too, and was going to include a paragraph or two about it. Decided sample was probably more useful in general.

    Barron,
    It’s nice to have NumPy mentor in the comments. I’ll keep loadtxt in mind the next time I need to read data from a text file. Much better than writing yet another ad hoc solution.

    You’ve also given me an idea for a TextMate command that takes a column of data on the clipboard and uses pbpaste and sed to turn it into a Python list. Would save me a lot of time in the long run, because I need to do that conversion fairly often.