More on the evidence of a single coin toss

Posted on December 20, 2010

Tags: probability, odds, coin-toss-problem, evidence, reasoning, bayesian

Recently, I asked how much evidence was contained in a single coin toss:

After seeing the outcome of this single coin toss [which came up heads], how much more should you believe my claim that the coin always comes up heads, compared to what you believed before the coin toss?

Many people submitted answers here on the blog and also on Hacker News, where the question led to an interesting discussion. Before I get to my answer, however, let’s talk about the question.

I like this question because it’s simple yet offers ample opportunity to explore something valuable but often unappreciated: weak evidence. Here we have the evidence of a single coin toss that comes up heads. That’s not much to go on. But it is something, and we would be wrong to ignore it.

Nevertheless, I’ve witnessed many experts ignore weak evidence, doctors in particular. The problem with ignoring weak evidence is that it’s abundant. Think of it as “long-tail” evidence: there’s so much of it that even if each piece is worth only a tiny bit, as a whole it’s worth a ton. So, if you don’t know how to mine it, you’re leaving a ton of potential knowledge buried within that long tail.

Interpreting evidence (weak or otherwise)

So, let’s talk about the evidence of our coin toss. My question was how much your prior beliefs about my claim (that the coin always comes up heads) should be swayed by the outcome of that single coin toss. I’m not asking about the coin, but about your state of knowledge about the coin and, more specifically, how that state should change in light of the coin toss.

There are many ways to approach the question, but to start, let’s define some notation. We’ll let P(X) denote our degree of belief in the proposition X, some statement that can be either true or false. Let P(X) = 0 represent our absolute conviction that X is false, and P(X) = 1 our absolute conviction that X is true. Values in between represent degrees of belief in between. When P(X) = 1/2, for example, right in the middle, it represents that we have no reason to believe that X is more likely to be true than false. If we know nothing about X, then, our default value for P(X) must be 1/2.

Let’s be clear that X is either true or false, regardless of what we think. Our X represents some real property of the universe, and the universe doesn’t alter itself just because our thoughts about it change or because we do a mathematical calculation that we think describes it in some way. That’s why we write P(X): the P notation represents that we’re not talking about X itself but rather our belief in X. The P(·) can be read as “the probability of” (or “the plausibility of”), so P(X) represents “the probability of X.”

Instead of some placeholder X, let’s define some real propositions that relate to our coin toss:

S: the coin is a special coin that always comes up heads when tossed
H: we observe the coin to come up heads in a coin toss
T: we observe the coin to come up tails in a coin toss
K: our prior knowledge about the coin, the universe, and everything

That last one, K, is important. It’s a massive proposition, the logical conjunction of many smaller propositions that represent everything we already know – that the Earth is approximately spherical, that gravity pulls things toward one another, that the author of this blog post is exceedingly handsome, and so on. This massive proposition is often left out of probability calculations with the understanding that it’s implied, but I’m going to include it because it makes our assumptions more explicit.

Now, the probabilities we’re interested in:

P(S|K): our belief that the coin is a special heads-always coin, in light of our prior knowledge
P(S|H∧K): our belief that the coin is a special heads-always coin, in light of our prior knowledge and the knowledge that we observed the coin to come up heads in a coin toss

I’ve introduced some new notation. The vertical bar (|) is read as “given” and can be interpreted to mean “in light of the following.” The ∧ operator is new, too. It represents logical conjunction and can be read as “and.” For instance, A∧B represents the proposition that both propositions A and B are true; and P(S|H∧K) represents the probability that S is true, in light of both H and K being true.

The first probability, P(S|K), is sometimes called our prior probability because it represents how much we believe S before considering new evidence, when we have only our prior knowledge K to go on. The second, P(S|H∧K), is sometimes called our posterior probability because it represents how much we believe S after considering the new evidence H, too.

Now, how do we update our prior beliefs about the coin to arrive at our posterior beliefs, in light of having witnessed the coin toss come up heads? Let’s think about this updating process for a moment.

Our new beliefs about the plausibility of some proposition X, in light of new evidence E, ought to be the same as our prior beliefs about X, but adjusted to account for observing the new evidence. The adjustment factor, according to Bayes’ rule (and justified by Cox’s theorem), is given by a quotient: the plausibility of observing the new evidence, given that X is true, divided by the plausibility of observing the new evidence in any case. (And, of course, all of these adjustments occur in light of our prior knowledge K about the universe in general.)

As a pseudo-English equation, Bayes’ rule is surprisingly intuitive:

(new plausibility) = (old plausibility) × (evidence adjustment),

or, equivalently, using our probability notation:

P(X|E∧K) = P(X|K) × [ P(E|X∧K) / P(E|K) ].

The evidence adjustment itself may not seem so intuitive, but it does make sense. It is the quotient of two plausibilities: that of observing the evidence E given that the proposition X is true, and that of observing E regardless. You can think of the adjustment as quantifying how well the proposition uniquely explains the evidence.

For example, if the proposition being true is the only reasonable explanation for the evidence, observing the evidence ought to provide strong support for the proposition. If rain is the only way that every house in the neighborhood gets wet at the same time, knowing that every house in the neighborhood is currently getting wet provides strong support to the proposition that it is raining. On the other hand, knowing that somebody is carrying an umbrella provides weaker support because things besides rain can also explain that evidence, the anticipation of rain, for one.

Getting back to my original question, I asked how much more you should believe my claim S (that the coin always comes up heads) after observing the evidence H (that the coin did come up heads when you tossed it). That is, I’m asking you to characterize the new plausibility in light of the old. The relative change between the two is given as follows:

[ (new plausibility) / (old plausibility) ] – 1

This quantity, we can see from Bayes’s rule, is merely our evidence adjustment less one. But to calculate this value, we’ll first need the probabilities the calculation is likely to require. Let’s see, what do we already know?

Representing our knowledge

First, our prior knowledge K informs us that a coin toss is understood to have only two potential outcomes: heads and tails. A coin toss is considered invalid, for example, if the coin stands on edge or is tossed into a chasm. Therefore, a coin toss must result in heads or tails:

P(H∨T|K) = 1,

and getting tails is the same as not getting heads:

P(T|K) = P(¬H|K).

More notation: we use ¬ to denote “not” and ∨ to denote logical disjunction, read “or.”

Next, we know that if the coin is special, it will come up heads when tossed:

P(H|S∧K) = 1.

But what if the coin is not special? In that case, do we have any reason to believe it is more likely to come up heads than tails, or vice versa? No. So, we must consider each proposition equally likely:

P(H|¬S∧K) = P(¬H|¬S∧K).

Further, because there are no other possibilities – the coin must come up heads or tails – their total probability must be one:

P(H|¬S∧K) + P(¬H|¬S∧K) = 1.

If the two probabilities are equal and must sum to one, each must be one half:

P(H|¬S∧K) = P(¬H|¬S∧K) = 1/2.

At this point, you may be tempted to object that our beliefs, being overly subjective, have led us to an unjustified conclusion. Even if the coin isn’t special, how can we say it has an even chance of coming up heads (or tails), in other words, that it’s fair? What justifies this claim?

In truth, we can’t justify it. But we didn’t make it, either.

Remember, we are not making any claims about the coin. Our equations make claims only about our knowledge of the coin. If the coin isn’t special, maybe it is still biased somehow. Even so, we have no reason to believe it is more likely to be biased one way or the other. Therefore, by symmetry, we can assign only one degree of belief to either proposition H or ¬H, and that is 1/2.

The evidence-adjustment factor

With our prior beliefs represented as probability equations, let’s get back to computing that evidence adjustment.

(evidence adjustment) = P(H|S∧K) / P(H|K)

The numerator on the right-hand side we already know: P(H|S∧K) = 1.

The denominator, P(H|K), we do not. We must find some way to break it into terms that we do know.

The nice thing about propositions, like H, is that we can use Boolean logic to manipulate them. So, let’s break H into pieces that are more likely to be useful:

H = H∧(S ∨ ¬S) = (H∧S) ∨ (H∧¬S).

What I did was split the proposition that the coin comes up heads into a disjunction of two mutually exclusive propositions: that the coin comes up heads and is special, or that the coin comes up heads and is not special. That first term of the disjunction, however, is redundant: if a coin is special, our prior knowledge already tells us that it must come up heads; therefore, we can simplify H∧S to S. Now we have,

H = S ∨ (H∧¬S), given K,

and, therefore,

P(H|K) = P((S ∨ (H∧¬S))|K).

We can break up the disjunction on the right-hand side using the sum rule for probabilities, which is given as:

P(A∨B) = P(A) + P(B) – P(A∧B).

Since our disjunction is of mutually exclusive propositions, the final term of the sum-rule expansion drops out; therefore,

P(H|K) = P(S|K) + P(H∧¬S|K).

Now let’s crack that new final term, P(H∧¬S|K). To do so, we’ll use the product rule for probabilities:

P(A∧B) = P(A|B) P(B) = P(B|A) P(A).

So:

P(H∧¬S|K) = P(H|¬S∧K) P(¬S|K).

And, already knowing that P(H|¬S∧K) = 1/2, we can simplify the right-hand side:

P(H∧¬S|K) = P(¬S|K)/2.

And substituting this reduction back into the equation for P(H|K) gives,

P(H|K) = P(S|K) + P(¬S|K)/2.

We can further simplify the equation by noting that P(S|K) + P(¬S|K) must equal 1 and, therefore, that the ¬S term can be rewritten in terms of S to give,

P(H|K) = P(S|K) + (1 - P(S|K))/2 = (1 + P(S|K)) / 2.

Now, to bring it all home, let’s plug these values into our evidence-adjustment formula:

(evidence adjustment)
= P(H|S∧K) / P(H|K)
= 1 / P(H|K)
= 1 / [(1 + P(S|K)) / 2]
= 2 / (1 + P(S|K)).

And that’s our evidence-adjustment factor. Now, what does it do?

Adjusting our beliefs in light of the new evidence

To better understand what the evidence adjustment does, let’s recall the original belief-adjustment equation:

(new plausibility) = (old plausibility) × (evidence adjustment)

So the adjustment factor nudges our initial degree of belief, whatever it may be, one way or the other, depending on the evidence. To see the effect of this nudge for various initial degrees of belief, consider the following plot:

Looking at the plot, let’s see if the nudge agrees with our intuition. First, if we were absolutely convinced that the coin is (or is not) special, no amount of evidence should sway our beliefs. Looking at the plot, we see that when our prior probability is 0 or 1, so is our adjusted (posterior) probability, exactly what we expected.

But what if our initial knowledge is complete ignorance about the coin being special? In that case, upon seeing the coin toss, our prior probability of 1/2 gets nudged to the posterior probability of 2/3 – toward the belief that the coin is indeed special. Again, it’s what we would expect.

In fact, the evidence adjustment is always going to push us toward confirming the belief that the coin is special because the evidence supports that belief. The force of that push, however, depends on how surprising we find the evidence, that is, how much it challenges our prior beliefs. The following plot shows this relationship:

Note that the evidence provides the strongest push – a factor of 2 – when our prior knowledge makes us doubt most strongly that the coin is special. On the other extreme, when we are already convinced that the coin is special, observing that the coin comes up heads when tossed isn’t surprising at all, and correspondingly the evidential push of that observation is nothing: an adjustment factor of unity.

Answering the original question

Finally, with our evidence adjustment well in hand, we can answer the original question: After seeing the outcome of this single coin toss, how much more should you believe my claim that the coin always comes up heads, compared to what you believed before the coin toss?

The answer, we reasoned earlier, is the evidence adjustment less one:

(relative plausibility increase)
= [evidence adjustment] – 1
= [2 / (1 + P(S|K))] – 1
= [2 / (1 + (prior plausibility))] – 1
= (1 – (prior plausibility)) / (1 + (prior plausibility))).

So, if we let p represent our prior degree of belief that the coin is a special, heads-always coin, we should be

100% × (1 – p) / (1 + p)

more confident in our belief after seeing the coin come up heads when tossed.

And that’s the answer.

But there are other ways of arriving at it. One of the more convenient is to use odds instead of probabilities. But let’s save that discussion for next time.

Update: Here’s the promised discussion: Odds and the evidence of a single coin toss.