An odd post by “news” at UD raises yet again the issue of Fisherian p values – and reveals yet again that many ID proponents don’t understand them.
She (I assume it is Denyse) writes:
Further to “Everyone seems to know now that there’s a problem in science research today and “At a British Journal of Medicine blog, a former editor says, medical research is still a scandal,” Ronald Fisher’s p-value measure, a staple of research, is coming under serious scrutiny.
Many will remember Ronald Fisher (1890–1962) as the early twentieth century Darwinian who reconciled Darwinism with Mendelian genetics, hailed by Richard Dawkins as the greatest biologist since Darwin. Hid original idea of p-values (a measure of whether an observed result can be attributed to chance) was reasonable enough, but over time the dead hand got hold of it:
Many at UD may also “remember” Ronald Fisher as the early twentieth century statistician who inspired William Dembski’s eleP(T|H)ant.
In Specification: The Pattern that Signifies Intelligence, Dembski writes:
Fisher’s approach to testing the statistical significance of hypotheses, one is justified in rejecting (or eliminating) a chance hypothesis provided that a sample falls within a prespecified rejection region (also known as a critical region).6 For example, suppose one’s chance hypothesis is that a coin is fair. To test whether the coin is biased in favor of heads, and thus not fair, one can set a rejection region of ten heads in a row and then flip the coin ten times. In Fisher’s approach, if the coin lands ten heads in a row, then one is justified rejecting the chance hypothesis.
Fisher’s approach to hypothesis testing is the one most widely used in the applied statistics literature and the first one taught in introductory statistics courses. Nevertheless, in its original formulation, Fisher’s approach is problematic: for a rejection region to warrant rejecting a chance hypothesis, the rejection region must have sufficiently small probability. But how small is small enough? Given a chance hypothesis and a rejection region, how small does the probability of the rejection region have to be so that if a sample falls within it, then the chance hypothesis can legitimately be rejected? Fisher never answered this question. The problem here is to justify what is called a significance level such that whenever the sample falls within the rejection region and the probability of the rejection region given the chance hypothesis is less than the significance level, then the chance hypothesis can be legitimately rejected.
More formally, the problem is to justify a significance level α (always a positive real number less than one) such that whenever the sample (an event we will call E) falls within the rejection region (call it T) and the probability of the rejection region given the chance hypothesis (call it H) is less than α (i.e., P(T|H) < α), then the chance hypothesis H can be rejected as the explanation of the sample. In the applied statistics literature, it is common to see significance levels of .05 and .01. The problem to date has been that any such proposed significance levels have seemed arbitrary, lacking “a rational foundation.”7
The only problem Dembski sees in Fisher’s approach, which he otherwise adopts, is that of coming up with a sufficiently conservative rejection threshold that would enable us to reject a hypothesis as “impossible”, and his somewhat bizarre solution was to come up with an arbitrarily conservative threshold based on his misunderstanding of a paper by Seth Lloyd. Most scientists are perfectly happy with something much less stringent: 5, or even 3 sigma, rather than Dembski’s 23.
But the problem of the appropriate alpha criterion is not the problem articulated in the Nature article by Regina Nuzzo that Denyse cites, and Dembski’s “solution” would not solve it. What Nuzzo draws attention to is the problem is that the p value that is the output of Fisherian null hypothesis testing is not the probability we want to know. Jacob Cohen’s classic article, The world is round, p<.05 is 10 years old this year, and IMO should be required reading for all ID proponents (and indeed for anyone who ever uses null hypothesis testing), so Denyse is late to the party.
But Cohen’s point is worth repeating. Nuzzo writes:
One result is an abundance of confusion about what the P value means4. Consider Motyl’s study about political extremists. Most scientists would look at his original P value of 0.01 and say that there was just a 1% chance of his result being a false alarm. But they would be wrong. The P value cannot say this: all it can do is summarize the data assuming a specific null hypothesis. It cannot work backwards and make statements about the underlying reality. That requires another piece of information: the odds that a real effect was there in the first place. To ignore this would be like waking up with a headache and concluding that you have a rare brain tumour — possible, but so unlikely that it requires a lot more evidence to supersede an everyday explanation such as an allergic reaction. The more implausible the hypothesis — telepathy, aliens, homeopathy — the greater the chance that an exciting finding is a false alarm, no matter what the P value is.
Exactly. This is precisely the problem that underlies the entire ID project. Both the concept of CSI (and friends) and the concept of Irreducible Complexity depend on the principle that phenomenon X can be demonstrated to be improbable under some null i.e. have a low Fisherian p value. But what we want to know, as Cohen points out, is not how probable our observations (or “Target” as Dembski calls it) are if a hypothesis is true, but how probable it is that the hypothesis is true. And Fisherian hypothesis testing simply does not give you that probability. And worse – a Fisherian p value is only as good as the definition of your null – it only tells you that your observations are unlikely under the null that you tested, which brings us back to the good old eleP(T|H)ant in the room.
The infinitessimally tiny p values that pass the stringent Dembskian alpha criterion of 23 sigma (or 500 bits, or whatever) as a result of CSI (or other alphabet soup) computations are Fisherian p values. Those p values are not the impossibly low probability that “evolution did it”, nor is 1-P the unfeasibly high probability that the pattern in question was intelligently designed. P is simply the probability that the pattern was the result of the process specified in H, where, to quote Dembski, “H, here, is the relevant chance hypothesis that takes into account Darwinian and other material mechanisms”.
Which is why, Barry, that on seeing a table laid with 500 coins, all heads up, we do not conclude that they were intelligently laid. What we do is reject the probability that they got there by a process of fair-coin-tossing. And this is not a minor quibble. It is precisely the point of the Nature article that your newsdesk has commended to you.