How many samples?

In my Monte Carlo solution of the Two Child Problem, I did 10,000 simulations using Python’s random module. Was that enough? If I didn’t already know the answers (the usual case when doing Monte Carlo simulation), how could I estimate the accuracy of the results? Not surprisingly, the answers to these results can be found in the field of statistics.

When the process being simulated consists of a series of independent trials, each trial having one of two results (which we’ll call passes and fails, although any two mutually exclusive results could be used), it’s said to be a Bernoulli sequence, and the count of passes follows a binomial distribution

\[P(x \;\textrm{passes in}\; n \;\textrm{trials}) = \frac{n!}{x! (n-x)!}\: p^x \: (1-p)^{n-x}\]

where \(p\) is the probability of a passing a single trial.1 You may recognize the term with the factorials as the binomial coefficient.

An estimate of \(p\) is easily calculated:

\[\hat{p} = \frac{x}{n}\]

where we put the hat over the \(p\) to distinguish the estimate from the true (albeit typically unknown) value.

Because we could get a different value of \(x\) every time we run a set of \(n\) trials, \(\hat{p}\) is a random variable. If the number of trials is large enough, \(\hat{p}\) will have a normal distribution with a mean of

\[\mu_{\hat{p}} = p\]

and a standard deviation of

\[\sigma_{\hat{p}} = \sqrt{\frac{p (1-p)}{n}}\]

That the mean of \(\hat{p}\) is equal to \(p\) signifies that \(\hat{p}\) is an unbiased estimate of \(p\): as we increase \(n\), \(\hat{p}\) will tend toward \(p\).

Because we know the distribution of \(\hat{p}\) and its parameters, we can calculate the probability that \(\hat{p}\) will fall within some neighborhood, \(\pm \delta\), around \(p\),

\[P(p - \delta \le \hat{p} \le p + \delta) = \Phi\left(\frac{\delta}{\sigma_{\hat{p}}}\right) - \Phi\left(-\frac{\delta}{\sigma_{\hat{p}}}\right) = 1 - 2\Phi\left(-\frac{\delta}{\sigma_{\hat{p}}}\right)\]

where \(\Phi\) is the standard normal cumulative distribution function,

\[\Phi(z) = \int_{-\infty}^z \frac{1}{\sqrt{2\pi}} e^{-\zeta^2/2} d\zeta\]

The \(\Phi\) function can’t be integrated analytically, but many numerical analysis programs include a function for it. Octave has a function called normcdf that does the trick. Back in the not-so-good old days, you’d have to look up it up in a table; every statistics textbook had a table of \(\Phi\) in its appendix.

This is all very nice, but you can only do these calculations if you know \(p\), and the typical reason you’re doing the Monte Carlo simulation is that you don’t know \(p\). What you have is a particular value of \(\hat{p}\) and you want to know how good it is.

Enter the concept of the confidence interval. By flipping the terms around and using \(\hat{p}\) to estimate the standard deviation,

\[\hat{\sigma}_{\hat{p}} = \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}\]

we can say with a confidence2 of

\[1 - 2\:\Phi\left(-\frac{\delta}{\hat{\sigma}_{\hat{p}}}\right)\]

that the true value of \(p\) is in the interval

\[\hat{p} \pm \delta\]

A more convenient way to work is to introduce a new variable,

\[\alpha = \Phi\left(-\frac{\delta}{\hat{\sigma}_{\hat{p}}}\right)\]

and choose a confidence level of \(1 - 2\alpha\). Then

\[\delta = -\Phi^{-1}(\alpha)\:\hat{\sigma}_{\hat{p}} = \Phi^{-1}(1-\alpha)\:\hat{\sigma}_{\hat{p}}\]

where \(\Phi^{-1}\) is the inverse of the standard normal CDF (Octave function norminv).

The confidence interval at this level is

\[\hat{p} \pm \Phi^{-1}(1-\alpha)\:\hat{\sigma}_{\hat{p}}\]

or, expanding out the \(\hat{\sigma}_{\hat{p}}\),

\[\hat{p} \pm \Phi^{-1}(1-\alpha) \: \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}\]

That was a longish derivation, but the result is simple to use. Let’s say we’ve done a Monte Carlo simulation and we want to get the 95% confidence interval for \(p\). Here’s what we do:

  1. Note that for 95% confidence, \(\alpha = 0.025\).
  2. Use whatever tools available to determine that \(\Phi^{-1}(0.975) = 1.96\).
  3. Take the values of \(n\) and \(\hat{p}\) from the simulation and calculate the interval

    \[\hat{p} \pm 1.96\sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}\]

Now let’s say we’re planning a simulation and want to choose a value of \(n\) to get a decent estimate of \(p\).

  1. For 95% confidence, use \(\Phi^{-1}(0.975) = 1.96\), as before.
  2. Get a preliminary value of \(\hat{p}\), perhaps by doing a small Monte Carlo run.
  3. Decide how tight we want the confidence interval, \(\pm \delta\), to be.
  4. Calculate \(n\) via

    \[n = \hat{p} (1 - \hat{p}) \left(\frac{1.96}{\delta}\right)^2\]

Note that if we want a very tight confidence interval, \(\delta\) will be small and \(n\) will have to be large. This makes sense: to get a close estimate, we need a lot of samples.

Although it’s a common confidence level, there’s nothing magical about 95%. We could just as easily use another value. Here are the \(\Phi^{-1}\) values for a few others.

Confidence \(\alpha\) \(\Phi^{-1}(1-\alpha)\)
99% 0.005 2.576
95% 0.025 1.960
90% 0.050 1.645
75% 0.125 1.150
50% 0.250 0.675

As expected, to get higher confidence, we need more samples.

Let’s calculate the confidence interval for the Monte Carlo simulation of the first example of the Two Child Problem. The result was

If we restrict ourselves to families that have at least one son,
the probability of having two sons is 2520/7535 = 0.334

So \(n\) is 75353, and \(\hat{p}\) is 0.334. The 99% confidence interval is

\[0.334 \pm 2.576 \sqrt{\frac{0.334 \cdot 0.666}{7535}} = 0.334 \pm 0.014\]

So if we didn’t already know from earlier work that the true probability was \(1/3\), we could use this result to say with 99% confidence that the true probability was between 0.320 and 0.348.

I should mention that the formulas in this post aren’t restricted to Monte Carlo simulation. They’re valid any time you’re sampling from a pass/fail type of population.

  1. The count of fails also follows a binomial distribution; just substitute \(1-p\) for \(p\) and \(n-x\) for \(x\). 

  2. We use the word confidence instead of probability because we’re really not calculating a probability here. The true value of \(p\) isn’t a random variable, it’s a number that we just don’t happen to know. 

  3. Why isn’t it 10,000? In this simulation we filtered out about a quarter of the sample families because they didn’t have sons, leaving 7535.