Physicist Carl Sagan famously said “Extraordinary claims require extraordinary evidence.” I think its useful to extend this to the distinctly less elegant “surprising findings are less likely to be true, and thus require a higher standard of evidence.”
I started thinking more about what influences the reliability of a scientific result when analyses of my post-doc data weren’t lining up with published findings from other studies of the same species. When I encountered this problem with reproducibility, the causes I first focused on were old standbys like multiple tests of the same hypothesis driving up type I error and the flexibility to interpret an array of different results as support for a hypothesis. What I wasn’t thinking about was low prior probability – if we test an unlikely hypothesis, support for that hypothesis (e.g., a statistically significant result) is more likely to be a false positive than if we’re testing a likely hypothesis. Put another way, a hypothesis that would be surprising if true is, in fact, less likely to be true if it contradicts well-supported prior empirical understanding or if it is just one of many plausible but previously unsupported alternate hypotheses. Arguments that I’ve heard against taking prior probability into account are that it isn’t ‘fair’ to impose different standards of evidence on different hypotheses, and that it introduces bias. I think the risk of bias is real (we probably overestimate the probability of our own hypotheses being true), but I think the argument about fairness is misleading. Let’s consider an example where we have a pretty good idea of prior probability.
A couple of months ago, I saw some photos on Twitter of a cat in the genus Lynx running down the street in Moscow, Idaho, a university town not that far from where I live in nearby Washington State. The tweet asked ‘bobcat or lynx?’ Bobcats (Lynx rufus) are fairly common in this part of North America, but Canada lynx (Lynx canadensis) are extremely rare, and it was exciting to contemplate the possibility that there was a Canada lynx right in the in the midst of Moscow. Most of the folks who replied to the tweet looked at the photos, decided that it looked a bit more like a lynx (spots somewhat indistinct, hind legs possibly longer, tail tip possibly lacking white tuft) and thus voted ‘lynx’. But these folks seemed to be ignoring prior probability. Here in Washington State, there are probably something on the order of 1000 bobcats for every lynx, so let’s assume that a few miles across the border in Idaho it is still approximately 1000 times more likely that a bobcat is running down the streets of Moscow than a lynx. If the people weighing in on Twitter are good at distinguishing bobcats from lynx and only mistakenly call a bobcat a lynx 1 in 100 times, that means out of 1000 bobcat photos they might see, they’re going to mistakenly call 10 of those bobcats ‘lynx’. However, if bobcats and lynx are photographed in proportion to their abundance, they are going to see only one lynx photo for those 1000 bobcat photos. Thus they’re going to make 10 times the number of false positive ‘lynx’ identifications than actual lynx identifications (I’m ignoring the possibility of a false negative, but the results are about the same assuming people are good at identifying lynx). To make the same number of false positive ‘lynx’ identifications as actual lynx identifications, they’d have to mistakenly call a bobcat a ‘lynx’ only 1 in 1000 times. In other words, if they were 99.9% reliable, they’d still have only a 50:50 chance of being correct when their call was ‘it’s a lynx’. So, when someone comes to me in Washington State and say’s “I saw a bobcat”, I’m more inclined to believe them than if someone comes to me and says “I saw a lynx.” I have different standards of evidence for these two outcomes, and that’s the way it should be. If I didn’t, I’d think there were far more lynx than there actually are. This mistake is what caused the California Department of Fish and Wildlife to conclude they had a healthy population of wolverines when in fact they had none.
So, if we have a notion of prior probability, it’s appropriate to adjust our confidence in results accordingly when we’re evaluating evidence, for instance, while reading scientific papers. If we encounter a finding that makes us say ‘that’s surprising’, then we ought to demand more evidence than if we encounter a finding that makes us say ‘that’s what we should have expected based on all this other stuff we’ve seen already’.
The problem, of course, is that unlike the lynx and bobcat situation, we rarely have a precise prior probability. We don’t even know what the typical range of prior probabilities is in ecology and evolutionary biology research, and since we have reasons to believe that there’s quite a bit of bias in the literature, we can’t easily figure this out. Further, I suspect that prior probabilities vary among sub-fields. My feeling is that in behavioral ecology, my primary subfield, we probably test hypotheses with lower prior probabilities than some other subfields of ecology or evolutionary biology. Where does that feeling come from? I like the ‘clickable headline’ test. The more ‘clickable’ the idea, the more likely it is to be surprising, and things that surprise us are presumably (on average) ideas that contradict existing evidence or understanding. This sort of clickable headline seems common (not ubiquitous, just common) in behavioral ecology, but I’m quick to admit that this is just an untested opinion. I’d like to see data. And regardless, it’s not wrong to test clickable hypotheses. If we didn’t test surprising hypotheses, we wouldn’t push the boundaries of our knowledge. However, I think we’d be better off if we treated clickable findings more cautiously. As authors, we can acknowledge the need for more evidence when we have a surprising finding, and as reviewers we can ask authors to provide these caveats. Further, before we re-direct our research program or advocate for major policy change in response to some new surprising result, we should accumulate particularly robust evidence, including one or more pre-registered replication studies (ideally registered replication reports, which are available to ecologists and evolutionary biologists at Royal Society Open Science).
p.s. Talking about prior probabilities means I should probably be explicitly discussing Bayes. I haven’t done so mostly because I’m not very knowledgeable about Bayesian statistics, but also because I think that just taking a step towards awareness of this issue can improve our inferential practices even if we’re using frequentist tools.