As part of a job application process I was asked to write a short explainer article on something technical, and I decided to write a little bit about how testing for randomness works. The brief was to write something playful and friendly, hence the weird framing of this piece. I didn’t get the job, but I did learn a bit about randomness Since I’ve written the damn thing, I thought I’d share it, and International RNG Day seems like the perfect opportunity to do so!

Congratulations on your new job at the Digital Dice Box Factory. You’re in charge of Quality Assurance for their marquee product: the Digital Dice Box (DDB). The DDB is a machine that simulates rolling a fair six-sided die. It’s a box, and you can press a button and bing: a number appears. You are in charge of testing whether the numbers that come out are random. Er, OK. Well, what does it really mean to be random? This is a pretty hard, and surprisingly deep question, but we’re not going to be really diving into it here. We’re just going to dipping our toes in the water by asking: how do you test whether a sequence of numbers is random?

Alright. Time to test. Press the button. Bing: 6. Is this 6 random? Does this question even make sense? Not really. A single outcome can’t really be tested for randomness, only sequences of outcomes can be. Let’s get some of our helper elves to press the button a bunch of times and note down the numbers.

```
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
```

Now, this *could* be the output from a random device:
it’s a possible sequence of outcomes.
You might be tempted to say that this outcome is
an unlikely one,
but we have to be careful.
Every sequence of 48 rolls
has the same probability
(if the die is fair).
It’s one sixth to the power of 48.
But a sequence of 48 6s in a row should still be giving us “not random” vibes.

The goal of the DDB project is to produce a device whose output is indistinguishable from the outcome of rolling a fair die. Our goal, as QA testers, is to find some property that will distinguish dice-roll output from DDB output. Once we can no longer find a way of distinguishing DDB output from dice rolls, the DDB project has succeeded. Our immediate goal is to find some kind of test property that will distinguish real randomiser output from a sequence of all 6s. Let’s try using the average of the sequence as a “test” property. The average of a sequence of 48 6s is 6.00. Now we need to work out what are typical values for this test property for the real randomiser. We call over two elves (called Monty and Carlo) and ask them to roll 48 fair dice and calculate the average value. Then we tell them to repeat this process lots of times until they have a good idea of the range of typical values for the average of 48 fair dice rolls (see Figure 1). If 6.00 is far outside that range, then we can safely say that DDB v1.0 is not working as intended.

As we wait for the elves to do their thing, we contemplate the basic structure of the testing problem. We collect a sequence of numbers from the candidate randomiser that we’re testing, and we use that sequence of numbers to calculate a value: a test property. We have some ideal randomising device we want our dice box to replicate: the ideal fair die. We can generate samples from the real randomiser and calculate the same test property for them, and then check whether the value from the device we’re testing is within the typical range of values for the real randomiser. If not, then we can reject the device we’re testing. It turns out that the typical values are clustering around 3.50, and 6.0 does look quite far from the range we’re seeing. We write up our report, rejecting DDB v1.0.

As we wait for production to deliver the next iteration of the DDB,
we think about typical values some more.
In fact, we often don’t need to resort to using the Monty and Carlo dice-rolling method:
we can often calculate the probability distribution of the test property for the real randomiser just
using probability theory.
We can calculate, for example, that 3.50 is the *expected value*
of the distribution of outcomes,
and we can say how far from that value a sample average can be
and still be considered typical.
(The actual numbers depend on how long the sequence is).

Oh, they’re here with version 2.0 of the DDB. We hand the box to our helper elves, and ask them to collect a sufficiently large sample of output from the box, and then to calculate the average value. How large is sufficiently large? We’re going to swerve this question for now, we’ll come back to it in a later post.

Alright, the elves have finished with their calculations: the sample average is 3.50. Excellent! Right? I mean… It’s suspiciously perfect isn’t it? It is exactly in the middle of the range of typical values… Hm. Let’s take a look at the raw data.

```
1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6
1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6
1 6 1 6 1 6 1 6 1 6 1 6 1 6 1 6
```

Oh no! We’re getting the average right, but this is doesn’t look random… We’re going to need to find another property that will distinguish this sequence from a properly random one. One problem with this alternating 1s and 6s sequence is that there are no 2s, 3s, 4s, or 5s. While the average value is what you’d expect, the distribution of outcomes is not. In a sequence from a real randomising device, you’d expect there to be roughly as many 1s as there are 2s and so on.

We need to turn this hunch into a clearly specified test property we can test against. We don’t really have the space to go into the details here, but essentially what you do is count up how many of each outcome you get, and calculate a number that captures how far that outcome is from what you’d expect to see in the sample. Then you add up those numbers, and that gives you what’s called the “chi squared” test. From there, the procedure is much as before: we have a method for turning a sequence of observed outcomes into a number (the average value or the chi squared value). We then ask Monty and Carlo to do the same procedure lots of times using real dice rolls (see Figure 2). We then check how far the value we get for the DDB v2.0 is from the range of typical values that Monty and Carlo produce.

A distribution of 24 1s and 24 6s gets a chi squared value which is quite far outside the range we see from similar sized sequences from a real randomiser. It’s safe to conclude the DDB is still flawed: you write up your report and send it off.

While you’re waiting for the next version of the dice box to arrive, you spend some time looking up what statisticians call the various concepts we’ve been using. The thing we’ve been calling the “test property” that you calculate for the sequence (e.g. the mean or the chi squared value) is what statisticians call a “test statistic”. Statisticians also spend a lot of time thinking about when a value of a test statistic is typical and when a value is sufficiently far from typical to be good evidence against the device being random. As your eyes glaze over, and you start to feel drowsy, a delivery elf arrives with Version 3.1 of the dice box. Phew! No more stats, back to work.

You send the elves off to calculate the mean and the chi squared value for a sequence of numbers generated by the new dice box. Everything looks good! In fact, maybe… everything looks too good? The mean is precisely 3.50. The chi squared value is precisely what you’d get if your sample of 48 dice rolls had 8 of each number… Hmmm… Let’s have a look at the raw data.

```
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4
5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2
3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
```

Oh come on! Alright fine. You get the right mean, you get the right distribution of values, but this sequence clearly isn’t random, is it? Time to come up with another property that a real random sequence would be likely to have that this sequence does not.

Before reading any further, have a think about what properties might work. There’s no one right answer, I’ll mention a couple of examples that would do the job.

Here’s one approach.
Notice that every number 1 is followed by a number 2.
That’s unlikely for a real randomiser.
For a real random sequence,
a 1 is just as likely to be followed by *any* possible outcome.
So let’s look at *pairs* of outcomes from the sequence.
Take the first and second numbers as a pair,
then the third and fourth, then the fifth and sixth and so on.
We treat the sequence of numbers as if it’s delivering us random *pairs*
of dice rolls.
And then we can look at the statistics we’d expect a sequence of pairs of rolls to have.
We can figure out the distribution such a sequence of pairs would have if
it had been generated by a real randomiser.
There are $6\times 6 = 36$ possible outcome pairs,
and each outcome pair is equally likely if they’re being generated by a real randomiser.
Compare this with what we get from the sequence above.
The only pairs that occur in that sequence are $(1,2), (3,4)$ and $(5,6)$.

Now we need to calculate some property of the sequence of pairs that will serve as our test. But this is basically the same as a problem we’ve already solved! We have a distribution of outcomes from a sequence, and we want to compare that to what outcomes are likely if the sequence had been generated by a particular kind of randomiser. This is just the situation we were in when we deployed the chi squared test. Now we have 36 possible outcomes rather than 6, but that’s a difference the chi squared test can accommodate. And again, the value for the chi squared test for the sequence of pairs is very far from what you’d expect for a sequence generated by a real randomiser. So this sequence is unlikely to have been generated by a real random process.

Notice what we did here: we use the sequence of outcomes to generate a different sequence, and calculate some property of that new sequence. We use the sequence of outcomes to generate a sequence of pairs of outcomes. And then we find and evaluate a test statistic on that new sequence of pairs. That’s a technique that gets used a lot.

Let’s look at another way we could do something similar to produce other tests that would reject the sequence from DDB v3.1. One thing we could do is start the same way, by producing a sequence of pairs, but then instead of considering pairs, we consider the sequence of the sums of the pairs. So the pair $(1,2)$ becomes 3, the pair $(3,4)$ becomes 7 and the pair $(5,6)$ becomes 11. And so on for all the other possible pairs. The generated sequence above has an equal distribution of 3s, 7s and 11s, whereas a sequence of sums of pairs generated by a real randomiser would contain values ranging from 2 to 12, and would contain more 7s and fewer 3s, for example. We can also use the chi square test here: it can compare a distribution of values to an expected distribution even when not all values have the same probability.

How does testing for randomness work in practice?
You don’t need to generate test statistics and calculate likely values yourself,
there are a number of different sets of tools that you can use to automate this process.
These will typically consist of a few dozen test statistics,
carefully chosen to pinpoint known weakness in certain kinds of random number generator.
See, for example, the `diehard`

/`dieharder`

tests, or `TestU01`

.
But these suites of tools are working in pretty much the way we outlined above.
Generate some kind of test statistic for the sequence,
and compare it to the likely values of that statistic for the randomiser.

That’s what I wrote. I was quite pleased with it. I have to go to work now but I wanted to get this up in time for RNG day. I will come back and add some links to the text at some point probably this weekend.