The complete question was: The other day I was trying to explain the difference between “impossible” and “with probability zero” to a friend. I remarked “if you pick an integer, and I pick an integer, the mathematical probability that we picked the same number is zero, but it’s certainly not impossible.” A seemingly harmless statement, but only later did I think, what does it even mean to “pick ANY integer with EQUAL probability?” Is there any meaning to a random variable that can take on an infinite number of values with equal probability? It would be like a uniform distribution that has one or both bounds stretched to infinity. Does this kind of object have a name? What kind of mathematical properties would this even have? Oh dear, this is blowing my mind just thinking about it. Please help!
Mathematician: Developing a consistent theory of probability for sets with an infinite number of elements (like the set of all integers, the set of all real numbers, or the set of real numbers between 0 and 1) requires dealing with a handful of subtle and tricky problems. In fact, to determine whether it is possible to choose one object from an infinite number of objects such that each object has the same probability of being chosen, we must delve into what we really mean by “possible”. Various interpretations of this question are:
1. Can we work with probabilities (as they are defined in a formal, mathematical way) on infinite sets of items, and if so, can we assign equal probabilities to each item?
2. Can a (formal, mathematical) procedure be devised to sample uniformly at random from an infinite set?
3. Can an algorithm be designed that can carry out this sampling procedure on a real computer such that the algorithm will terminate in a finite amount of time?
Each of these questions, which I’ll address one by one, raises interesting considerations.
Can a formal theory of probability be developed for infinite sets of items? Absolutely. For example, the Gaussian distribution (often known as the bell curve or normal distribution) is defined on the set of real numbers (so when you draw a sample according to this distribution, you get some real number). Strictly speaking, it assigns a probability of zero to each of the individual real numbers, but assigns a non-zero probability to subsets of real numbers. For example, if we sample from a Gaussian distribution, there is a formula that can tell us how likely the number is to be less than any particular value X (so the set of all numbers less than X is assigned a positive probability). Why does the gaussian distribution not assign non-zero probabilities to the actual numbers themselves? The reason (loosely speaking) is because the probability of getting ANY number when we sample from a distribution must be 1 (which just means that some number must always occur). On the other hand, the set of real numbers contains so many numbers that, if each of them had a non-zero probability of occurring, it would not be possible for the total probability (which is, essentially, just the sum or integral of all of the numbers’ probabilities) to be 1. Another, more intuitive way to think about this is to consider a dart board. If we throw a dart at the board and are willing to assume that matter is infinitely divisible (sorry, Physicist) then we will always hit some point on the board (assuming our aim isn’t too terrible). But, at the same time, the chance of hitting any particular point is negligibly small (zero in fact) since there are so many possible points. So clearly, to describe the probability of hitting this board’s points, it is not sufficient to consider only the probability of each individual point being hit, but rather we have to consider how likely we are to hit various regions of the board, such as the region constituting the bullseye. Even though each particular point has a zero probability of being hit, some point is always hit in practice, and the set of points that make up the bullseye together have a positive probability of being hit with each throw.
Okay, so we can define probabilities on an infinite set, though as the Gaussian distribution case shows, we may have to actually let the probabilities be assigned to subsets of our original set, rather than to every object in the set itself. But can we do this in such a way that every object in our set is sampled with equal likelihood? The answer is again yes, though with some caveats. For instance, if we limit ourselves to the real numbers that are between 0 and 1, we can assign a uniform distribution to these numbers which will give them each an equal probability. The uniform distribution basically says that the probability that a given sampled number will be between a and b (for 0<a<b<1) is equal to b-a. This fact implies that all numbers are equally likely to be sampled from this distribution.
Fine, but can we define a probability distribution on the set of integers (rather than the real numbers between 0 and 1) such that they each occur with equal probability (i.e. does a uniform distribution on the integers exist)? The answer, sadly, is no. A probability mass function (which is the kind of probability distribution we need in this case) is defined to be a positive function that has a sum of values equal to 1. But any positive function that assigns an equal value to each integer must have probabilities that sum to either infinity or zero, so the desired distribution is impossible to construct. As a technical side note though, people sometimes try to get around this issue in Bayesian analysis by applying what are known as improper priors. Attempting to define a uniform distribution on the full set of real numbers also fails, for a very similar reason that it doesn’t work on the integers (it can not be the case that each real number (or equally sized interval of real numbers) has the same probability and the probability density function integrates to 1).
On to our second question of whether is it possible to come up with a formal mathematical procedure for sampling from infinite sets. The answer is yes, if we have an unlimited amount of time to spare. For real numbers between 0 and 1, we can use the following sampling procedure:
i. Start with the number 0.0, and set n=1
ii. Set the nth digit of our number after the decimal point to a random number from 0 to 9.
iii. Increase n by 1 and return to step ii.
If this procedure were iterated forever, it would produce a single random number between 0 and 1, and all real numbers between 0 and 1 are equally likely to be generated. Essentially we are just constructing a number by choosing each of it’s decimal digits randomly.
But what if we wanted to carry out this procedure for the set of all integers? Here things get stranger. The natural choice is the following procedure:
i. Start with the number 0.0, and set n=1
ii. Set the nth digit of our number BEFORE the decimal point to a random number from 0 to 9.
iii. Increase n by 1 and return to step ii.
If this procedure were carried out forever, it would seem as though it would produce an integer, and that this integer would have an equal probability of being any integer. This is true, in the sense that all integers that the algorithm produces have an equal likelihood of being produced (i.e. when it DOES produce integers those integers are each equally likely). But the algorithm does not actually do what we would like. We begin to see the problem when we pick any integer X, and ask the question, “what is the probability that this procedure would produce a number that is greater than X?”. The answer, is that there is a probability of 1 (i.e. a 100% chance), NO MATTER WHAT X IS. This makes sense, given that the integers stretch off to infinity, and that the number of integers close to 0 will always be dwarfed by the number of them far from it. But how can this procedure (when run forever) produce one, single integer, while at the same time having a probability of 1 of producing a number bigger than any particular number we choose? Well, each integer can be thought of as having an infinite sequence of zero digits to the left of its first non-zero digit. This algorithm will have a probability of 0 of producing a number with an infinite successive sequence of zeros, and therefore will have a zero probability of producing an integer! In other words, there is a probability of 1 that each number it produces will be (in a certain sense) “infinite”, so it does not serve the purpose that we hoped.
This brings us to our final question, regarding whether a terminating algorithm (that can be run on a real computer) can be created that will sample uniformly at random from an infinite set of numbers. The answer to this question is no, but with the footnote that this does not matter too much in practice (for reasons to be discussed). One way to understand why this is impossible is to consider how many bits it would take to transmit the number produced by such a sampling procedure. We would have to transmit some kind of code (agreed upon in advance) that represents the number we got from sampling, but since there are an infinite number of possible outcomes and since we have no knowledge (in advance) of what number will occur, we would need to have an infinite number of codes to represent all the outcomes. Hence, if we think of the codes as numbers, and choose some number N, then some of the codes must contain more than N digits (since N digits is only enough to describe a finite number of different codes). But since this holds for all N no matter how large, this means that, on average, it would take literally forever to transmit one of these codes. But, if the number cannot be transmitted, no computer could ever make a copy of it, which implies that no computer could ever generate such a number. What this confusing, convoluted argument is getting at is the fact that a uniform distribution on an infinite set of items (if it existed) would have an infinite entropy, so the numbers sampled from such a procedure could never be transmitted or stored (as doing so, on average, would require an infinite number of bits) so there is no way that such an algorithm could be used in real life. One way to see that the entropy of such a distribution would be infinite is to note that if we define a uniform distribution on n items, that as we let n go to infinity the entropy of the distribution approaches infinity.
Despite these problems, some infinite sets have a nice property that the procedure of sampling (with equal probability) from them can be nicely approximated on a computer. For example, if we want to sample a real number between 0 and 1, we can approximate this procedure by limiting ourselves only to numbers with at most 40 digits after the decimal point, and then sampling uniformly at random from this restricted set. While this procedure is not perfect, it will produce numbers that (for most purposes) look like those we would get if we truly sampled from all real numbers between 0 and 1. On the other hand, sampling uniformly at random from the set of all integers cannot be approximated in any nice way (and hence, is in some sense an inaccessible procedure). The problem here is that, as noted, if you fix any number X that you like, 100% of integers are greater than that X, no matter what X is. Since real computers are limited in the size of the numbers they can store, any attempt to approximate the procedure of sampling from all integers will be limited to sampling from integers less than some number X, despite the fact that 100% of integers truly are above X. If we try to sample uniformly at random from the set of all integers (or the set of real numbers, for that matter) we are doomed to complete failure.