Open Science – what’s the way (for Australia)? – Notes from a panel discussion

by Malgorzata (Losia) Lagisz, i-deel, EERC, UNSW, Australia

We all heard about Open Science, and particularly Plan S, which has been announced in Europe last year (read more here). On 14th February 2019, I had an opportunity to be a panellist during discussion on what it all could mean for Australia. The panel discussion was organised by Springer Nature as a part of the ALIA conference, which is the main meeting for the librarians and information specialists in Australia and New Zealand (I realised these are mostly lovely middle-aged ladies, although they said more men are starting to join this profession with the new technologies, closing the “gender gap”…)
 
The discussion panel itself was made of different stakeholders and actors in scholarly communication, including: Director of Policy and Integrity ARC, Institutional Engagement Manager and Head of Data Publishing from Springer Nature from Springer Nature, Associate Librarian from Scholarly Information Services & Campus Libraries VU, and me as “the centrally important view of the researcher” (that’s at least how my presence was justified…)
 
After a short introduction we had to answer three pre-determined questions:

  1. “Is Plan S the right plan for ANZ in the short term? And long term?”
  2. “What is the role of institutional repositories in a scholarly publishing system that is moving towards gold open access as the preferred model for funders and authors?”
  3. “Outline an example of an open scholarship or open data initiative and why this underlines the benefits of open research?”


We did not reach a strong conclusion on any of these questions, but there were a few emerging insights (at least for me):

  • “Open access” is a general term encompassing publications that are freely available and permanently archived in a public repository or from more traditional publishers and licensed in a way that allows broad use and reuse. While Plan S requires all publicly funded research outcomes to be immediately made available to the public and for the benefit of the public, it also places restrictions where and how these outcomes have to be published.
  • Depositing in institutional repositories is currently mandatory in most of the Australian research institutions and is also the preferred option for freely sharing the research outputs. Plan S acknowledges the role of repositories for archiving research but does not see it as the main publishing venue. Importantly, institutional repositories and many other free non-profit repositories (like preprint servers) are not likely to be compliant with Plan S requirements.
  • Another concern is related to academic freedom – the right of academics to decide where to publish their findings and the impacts on the researchers themselves. Especially in the early stage of Plan S, there will be not many journals fully compliant with the requirements, and thus reduced the choice of publishing venues. This may mean having to publish in less reputable or less impactful journals than the researchers would otherwise submit their work to. Such restrictions may strain international collaborations and also affect the career prospects of researchers.
  • Plan S would also negatively affect research societies and their role in fostering good quality research and research careers. Many societies earn the bulk of their income from the subscription-based journals they publish. Flipping to publishing in a full open access model requires significant financial resources, and many societies will not be able to afford this. If they don’t flip, they may lose on submissions, reputation and income (for more details read this opinion).
  • It is not clear whether under Plan S there will be savings in the overall research costs. If not-for-profit publishers and repositories get marginalised, the overall bill might be actually higher, with the costs shifting from the reading/access fees to publishing fees (and it is not quite clear how the later will be covered).
  • Finally, change to Open Access and Open Science should not be rushed. Taking time will allow figuring out the safest path for transition for the publishers and researchers. Changing the mindsets of academics via education, not enforcement, will be an important factor. It will also be easier with a new generation of young scientists joining academia, more ready to embrace open science.

EcoEvoRxiv launched!

I am very excited to announce the launch of  EcoEvoRxiv – a preprint server where ecologists and evolutionary biologists can upload their forthcoming papers. I am aware that many ecologists and evolutionary biologists already use the preprint service, bioRxiv and that’s great! I have used bioRxiv several times myself. EcoEvoRxiv is a more targeted server, and it is convenient because a preprint at EcoEvoRxiv can seamlessly integrate a project that makes use of the services at the Open Science Framework (OSF). My group, like others, uses OSF for project management so this is a great feature of EcoEvoRxiv.

There are several reasons I have taken on the challenge of kickstarting and leading EcoEvoRxiv with my colleagues, others than the reasons I already mentioned.

1. Having preprints online and citable is especially wonderful for my students and postdocs (and any other young scientists out there). This is because their potential employers can immediately read their work online. Last year, I did a reference for the Human Frontier Science Program (HFSP), and they asked whether the candidate has preprints in addition to published papers (a very nice change).

2. It is a part of Transparency in Ecology and Evolution (TEE) movement, so I’ve got a lot of support from Fiona and Tim (the co-founders of TEE). We believe that EvoEvoRxiv will not only raise the awareness of preprint servers (including bioRxiv) but also of other transparency activities as part of the credibility revolution.

3. The biggest reason is probably that I just cannot say NO when I get asked by people (but in 2019, I will be saying a record number of NOs – I am making a tally chart so that I can report to my mother, who skypes me from Japan regularly, at the end of the year). Nonetheless I am very glad to say YES to EcoEvoRxiv.

We hope EcoEvoRxvi will encourage more ecologists and evolutionary biologists to put their preprints online. We have more information at a dedicated information website (ecoevorxiv.com). As you will find out, we have a wonderful team of committee members and ambassadors from 11 different countries, helping me to launch EcoEvoRixv. EcoEvoRxiv wants your preprints (and also postprints)!

Here I would like to acknowledge people from the Center of Open Science (COS; especially, Rusty, Rebecca, David and Matt thank you) for their support in launching EcoEvoRxiv.

Join the Credibility Revolution!

Last week (14-15 Nov), I went to Melbourne for a workshop (“From Replication Crisis to Credibility Revolution”). The workshop was hosted by my collaborator and “credibility revolutionary” Fiona Fidler.

I suspect many workshops and mini-conferences of this nature are popping out all over the world as many researchers are very much aware of “reproducibility crisis”. But what was unique about this one is its interdisciplinary nature; we had philosophers, psychologists, computer scientists, lawyers, pharmacologists, oncologists, statisticians, ecologists and evolutionary biologists (like myself).

I really like the idea of calling “reproducibility crisis” “credibility revolution” (hence the title). A speaker at the workshop, Glenn Begley, wants to call it “innovation opportunity” (he wrote this seminal comment for Nature). What a cool idea! And these re-namings make things a lot more positive than a bit of doom-and-gloom feel of “replicability crisis”. Indeed, there are a lot of positive movements toward Open Science and Transparency, happening to remedy the current ‘questionable’ practice.

Although I live in Sydney, I was also in Melbourne early last month (4-5 Oct) for a small conference. This is because Tom Stanly invited me over, as an evolutionary biologist, to give a talk on meta-analysis to a bunch of economists who love meta-analyses. To my surprise, I had a lot of fun chatting with meta-enthusiastic economists.

Tom is not only an economist but also a credibility revolutionary, like Fiona. He has asked me to invite ecologists and evolutionary biologists to comment on his blog about a credibility revolution. It is an excellent read. And if you can make comments to join the conversation, Tom will appreciate it a lot, and get conversations going. Disciplines need to unite together to make this revolution successful or make the most of this innovation opportunity. So join the credibility revolution! (meanwhile, I am now off to Japan to talk more about meta-analysis, and sample nice food – will joint the revolution once I am back).

Prior probability and reproducibility

Physicist Carl Sagan famously said “Extraordinary claims require extraordinary evidence.”  I think its useful to extend this to the distinctly less elegant “surprising findings are less likely to be true, and thus require a higher standard of evidence.”

I started thinking more about what influences the reliability of a scientific result when analyses of my post-doc data weren’t lining up with published findings from other studies of the same species. When I encountered this problem with reproducibility, the causes I first focused on were old standbys like multiple tests of the same hypothesis driving up type I error and the flexibility to interpret an array of different results as support for a hypothesis.  What I wasn’t thinking about was low prior probability – if we test an unlikely hypothesis, support for that hypothesis (e.g., a statistically significant result) is more likely to be a false positive than if we’re testing a likely hypothesis. Put another way, a hypothesis that would be surprising if true is, in fact, less likely to be true if it contradicts well-supported prior empirical understanding or if it is just one of many plausible but previously unsupported alternate hypotheses. Arguments that I’ve heard against taking prior probability into account are that it isn’t ‘fair’ to impose different standards of evidence on different hypotheses, and that it introduces bias. I think the risk of bias is real (we probably overestimate the probability of our own hypotheses being true), but I think the argument about fairness is misleading. Let’s consider an example where we have a pretty good idea of prior probability.

A couple of months ago, I saw some photos on Twitter of a cat in the genus Lynx running down the street in Moscow, Idaho, a university town not that far from where I live in nearby Washington State. The tweet asked ‘bobcat or lynx?’ Bobcats (Lynx rufus) are fairly common in this part of North America, but Canada lynx (Lynx canadensis) are extremely rare, and it was exciting to contemplate the possibility that there was a Canada lynx right in the in the midst of Moscow. Most of the folks who replied to the tweet looked at the photos, decided that it looked a bit more like a lynx (spots somewhat indistinct, hind legs possibly longer, tail tip possibly lacking white tuft) and thus voted ‘lynx’. But these folks seemed to be ignoring prior probability. Here in Washington State, there are probably something on the order of 1000 bobcats for every lynx, so let’s assume that a few miles across the border in Idaho it is still approximately 1000 times more likely that a bobcat is running down the streets of Moscow than a lynx. If the people weighing in on Twitter are good at distinguishing bobcats from lynx and only mistakenly call a bobcat a lynx 1 in 100 times, that means out of 1000 bobcat photos they might see, they’re going to mistakenly call 10 of those bobcats ‘lynx’. However, if bobcats and lynx are photographed in proportion to their abundance, they are going to see only one lynx photo for those 1000 bobcat photos.   Thus they’re going to make 10 times the number of false positive ‘lynx’ identifications than actual lynx identifications (I’m ignoring the possibility of a false negative, but the results are about the same assuming people are good at identifying lynx). To make the same number of false positive ‘lynx’ identifications as actual lynx identifications, they’d have to mistakenly call a bobcat a ‘lynx’ only 1 in 1000 times. In other words, if they were 99.9% reliable, they’d still have only a 50:50 chance of being correct when their call was ‘it’s a lynx’. So, when someone comes to me in Washington State and say’s “I saw a bobcat”, I’m more inclined to believe them than if someone comes to me and says “I saw a lynx.” I have different standards of evidence for these two outcomes, and that’s the way it should be. If I didn’t, I’d think there were far more lynx than there actually are. This mistake is what caused the California Department of Fish and Wildlife to conclude they had a healthy population of wolverines when in fact they had none.

So, if we have a notion of prior probability, it’s appropriate to adjust our confidence in results accordingly when we’re evaluating evidence, for instance, while reading scientific papers. If we encounter a finding that makes us say ‘that’s surprising’, then we ought to demand more evidence than if we encounter a finding that makes us say ‘that’s what we should have expected based on all this other stuff we’ve seen already’.

The problem, of course, is that unlike the lynx and bobcat situation, we rarely have a precise prior probability. We don’t even know what the typical range of prior probabilities is in ecology and evolutionary biology research, and since we have reasons to believe that there’s quite a bit of bias in the literature, we can’t easily figure this out. Further, I suspect that prior probabilities vary among sub-fields. My feeling is that in behavioral ecology, my primary subfield, we probably test hypotheses with lower prior probabilities than some other subfields of ecology or evolutionary biology. Where does that feeling come from? I like the ‘clickable headline’ test. The more ‘clickable’ the idea, the more likely it is to be surprising, and things that surprise us are presumably (on average) ideas that contradict existing evidence or understanding. This sort of clickable headline seems common (not ubiquitous, just common) in behavioral ecology, but I’m quick to admit that this is just an untested opinion. I’d like to see data. And regardless, it’s not wrong to test clickable hypotheses. If we didn’t test surprising hypotheses, we wouldn’t push the boundaries of our knowledge. However, I think we’d be better off if we treated clickable findings more cautiously. As authors, we can acknowledge the need for more evidence when we have a surprising finding, and as reviewers we can ask authors to provide these caveats. Further, before we re-direct our research program or advocate for major policy change in response to some new surprising result, we should accumulate particularly robust evidence, including one or more pre-registered replication studies (ideally registered replication reports, which are available to ecologists and evolutionary biologists at Royal Society Open Science).

p.s. Talking about prior probabilities means I should probably be explicitly discussing Bayes. I haven’t done so mostly because I’m not very knowledgeable about Bayesian statistics, but also because I think that just taking a step towards awareness of this issue can improve our inferential practices even if we’re using frequentist tools.

An iconic finding in behavioral ecology fails to reproduce

Just how reproducible are studies in ecology and evolutionary biology? We don’t know precisely, but a new case study in the journal Evolution shows that even textbook knowledge can be unreliable. Daiping Wang, Wolfgang Forstmeier, and co-authors have convinced me of the unreliability of an iconic finding in behavioral ecology, and I hope their results brings our field one step closer to a systematic assessment of reproducibility.

When I was doing my PhD, one of the hottest topics in behavioral ecology was the evolutionary origin of sexual ornaments. A tantalizing clue was the existence of latent female preferences – preferences that females would express if a mutation came along that produced the right male proto-ornament. One of the first hints of latent preferences was detected by Nancy Burley in female zebra finches by fitting male finches with leg bands of different colors. It turned out that a red band was attractive, a green band unattractive.  Multiple studies appeared to support the original, and the story entered textbooks.

But now it’s non-reproducible textbook knowledge. Wang et al. report on multiple robust replication attempts that failed to reproduce this effect. So where does this leave us? It could be that the original effect was real, but contingent on some as-yet-undiscovered moderator variable. That hypothesis can never be disproven, but if someone wants to make that argument, it’s on them to identify the mysterious moderator and show how the color leg band effect can be reproducible. Until then, I’m adding the color band attractiveness effect to the list of things I learned in graduate school that were wrong.

By the way, in this case, ‘not reproducible’ means an average effect size that approximates zero. This is not just a case of one study crossing a significance threshold and another failing to cross the threshold. The sum of these replications looks exactly like the true absence of an effect.

It’s also worth noting that the distribution of published results from the lab that originally discovered the color band effect follows the pattern expected from various common research practices that unintentionally increase the publication of false positives and inflated effect sizes. I don’t mention this as an accusation, but rather as a reminder to the community that if we don’t take deliberate steps to minimize bias, its likely to creep in and reduce our reproducibility.

A conversation: Where do ecology and evolution stand in the broader ‘reproducibility crisis’ of science?

In this post, I float some ideas that I’ve had about the ‘reproducibility crisis’ as it is emerging in ecology and evolutionary biology, and how this emergence may or may not differ from what is happening in other disciplines, in particular psychology. Two other experts on this topic (Fiona Fidler and David Mellor) respond to my ideas, and propose some different ideas as well. This process has led me to reject some of the ideas I proposed, and has led me to what I think is a better understanding of the similarities (and differences) among disciplines.

Here’s why my co-authors are experts on this topic (more so than I am):

Fiona’s PhD thesis was about explaining disciplinary differences between psychology, ecology and medicine in their responses to criticism of null hypothesis significance testing, and she’s been interacting with researchers from multiple disciplines for 20 years. She often works closely with ecologists, but she has the benefit of an outsider’s perspective.

David has a PhD in behavioral ecology, and now works at the Center for Open Science, interacting on a daily basis with researchers from a wide range of disciplines as journals and other institutions adopt transparency standards.

TP: Several years ago, Shinichi Nakagawa and I wrote a short opinion piece arguing that ecology and evolutionary biology should look to other disciplines for ideas to reduce bias and improve the reliability of our published literature. We had become convinced that bias was common in the literature. Evidence of bias was stacking up in other disciplines as well, and the risk factors in those discipline seemed widespread in ecology and evolution. People in those other disciplines were responding with action. In psychology, these actions included new editorial policies in journals, and major efforts to directly assess reproducibility with large-scale replications of published studies. Shinichi and I were hoping to see something similar happen in ecology and evolutionary biology.

To an important extent ecologists and evolutionary biologists have begun to realize there is a problem, and they have started taking action. Back in 2010, several journals announced they would require authors to publicly archive the data behind their reported results. This wasn’t a direct response to concerns about bias, but it was an important step towards ecologists and evolutionary biologists accepting the importance of transparency. In 2015 representatives from about 30 major journals in ecology and evolutionary biology joined advocates for increased transparency to discuss strategies for reducing bias. From this workshop emerged a consensus  that the recently-introduced TOP (Transparency and Openness Promotion) guidelines would be a practical way to help eco-evo journals implement transparency standards. Another outcome was TTEE (Tools for Transparency in Ecology and Evolution), which were designed to help journals in ecology and evolutionary biology implement TOP guidelines. A number of journals published editorials stating their commitment to TOP. Many of these journals have now also updated their editorial policies and instructions to authors to match their stated commitments to transparency. A few pioneering journals, such as Conservation Biology, have instituted more dramatic changes to ensure, to the extent possible, that authors are fully transparent regarding their reporting. A handful of other papers have also been published, reviewing evidence of bias or making recommendations for individual or institutional action.

Despite this long list of steps towards transparency, it seems to me that the groundswell seen in psychology has not yet transpired in ecology and evolution. For instance, only one ecology or evolution journal (BMC Ecology) has yet adopted registered reports (the most rigorous way to reduce bias on the part of both authors and journals), and there has been only one attempt to pursue a major multi-study replication effort and it has not yet gained major funding.

FF: At this point I feel the need to add that what Tim wrote above does describe an incredible amount of action in a short time in the disciplines of ecology and evolution. It might be harder to see the change when you’ve made it yourself 🙂

TP: I agree that there have been important changes, but it seems to me that many ecologists and evolutionary biologists remain unconvinced or unaware of the types of problems that led Shinichi and me to try to kick start this movement in the first place. A few months ago the Dynamic Ecology blog conducted an informal survey asking “What kind of scientific crisis is the field of ecology having?” Only about a quarter of those voting were convinced that ecology was having a crisis, and only about 40% of respondents thought a reproducibility crisis was the sort of crisis ecology was having or was most likely to have in the future. So, ecologists (at least those who fill out surveys on the Dynamic Ecology blog), aren’t convinced there is a crisis, and even if there is a crisis, they’re not convinced that it’s in the form of the ‘reproducibility crisis’ discussed so much recently in psychology, medicine, economics, and some other disciplines. Of course not everyone in psychology thinks there’s a crisis either, but my sense is that the notion of a crisis is much more widely accepted there.

So why aren’t ecologists and evolutionary biologists more concerned? We’ve got the risk factors for a reproducibility crisis in abundance. What’s different about perceptions in ecology and evolutionary biology? I don’t claim to know, but I entertain several hypotheses below.

It seems highly plausible to me that many in ecology and evolution have simply not seen or appreciated the evidence needed to convince them that there is a problem. In psychology, one of the catalysts of the ‘crisis’ was the publication of an article in a respected journal claiming to have evidence that people could see into the future. The unintended outcome of this article, the conclusions of which were largely rejected by the field, was that many researchers in psychology realized that false results could emerge from standard research practices, and this was unsettling to many. In ecology and evolution, we haven’t experienced this sort wake-up call.

DM: I think that was a huge wake-up call, that something so unlikely could be presented with the same standard techniques that every study used. In eco/evo, the inherent plausibility (dare I say, our priors), may more difficult to judge, so a wild claim presented with flimsy evidence is not as easily spotted as being so wild.

However, I think a major underlying cause is the lack of value given to direct replication studies. Direct replications are the sad workhorse of science: they’re the best way to judge the credibility of a finding but virtually no credit is given for conducting them (and good luck trying to get one funded!). I think that a subset of psychological research was fairly easy to replicate using inexpensive study designs (e.g. undergraduate or online research participants), and so some wild findings were somewhat easy to check with new data collection.

In ecology, there are certainly some datasets that can be fairly easily re-collected, but maybe not as many. Furthermore, I sense that ecologists have an easier time attributing a “failure to replicate” to either 1) as of yet unknown moderating variables or 2) simple environmental change (in field studies). So the skepticism may be less sharp on published claims.

FF: At the moment, my research group is analysing data from a survey we did of over 400 ecology and evolution researchers, asking what they think about the role of replication in science. So far our results suggest that the vast majority of researchers think replication is very important. We’ve been a bit surprised by the results. We were expecting many more researchers to be dismissive of direct replication in particular, or to argue that it wasn’t possible or applicable in ecology. But in our survey sample, that wasn’t a mainstream view. Of course, it’s hard to reconcile this with its virtual non-existence of direct replication in the literature. We can really only explain the discrepancy by appealing to institutional (e.g., editorial and grant policies) and cultural norms (e.g., what we believe gets us promoted). In ecology, neither have been broken to the extent that they have in psychology, despite individual researchers having sound intuitions about the importance of replication.

TP: Another possibility to explain why so many ecologists and evolutionary biologists remain unconvinced that there is a replication crisis is that bias may actually be less widespread in ecology and evolutionary biology than in psychology. Let me be clear. The evidence that bias is a serious problem in ecology and evolutionary biology is compelling. However, this bias may be less intense on average than in psychology, and it may be that bias varies more among sub-disciplines within eco-evo, so there may be some ecologists and evolutionary biologists who can, with good reason, be confident in the conclusions drawn in their subdiscipline.

FF: Hmm, I think it’s more likely that psychologists are simply more accepting that bias is a real thing that’s everywhere, because they are psychologists and many study bias as their day job.

TP: OK, I buy that psychologists may be more open to the existence of bias because its one of the things psychologists study. However, I’d like to at least consider some possibilities of differences in bias and some other differences in perception of bias.

For instance, maybe in subdisciplines where researchers begin with strong a priori hypotheses, they are more likely to use their ‘researcher degrees of freedom’ to explore their data until they find patterns consistent with their hypothesis. This is a seriously ironic possibility, but one I’ve warmed to. The relevant flip side to this is that many researchers in ecology and evolution (though I think more often in ecology) often conduct exploratory studies where they have no reason to expect or hope for one result over another, and readily acknowledge the absence of strong a priori hypotheses. This could lead to less bias in reporting, therefore greater reliability of literature, and more of a sense that the literature is reliable. I should point out, though, that bias can still emerge in the absence of a priori hypotheses if researchers are not transparent about the full set of analyses they conduct, and I know this happens at least some of the time.

FF:  So there are two claims. First, that if you have strong a priori hypotheses you might be more likely to use researcher degrees of freedom. This certainly seems plausible. You really want your hypotheses to be true, so you’re more inclined to make it so. Second, researchers in ecology and evolution are less likely to have strong a priori hypotheses than researchers in psychology. The latter is a disciplinary difference I just don’t see, but it’s an empirical question. It’s a great sociology of science question.

TP: Well,I like empirical questions, and I’d certainly like to know the answer to that one.

Moving on to throw out yet another hypothesis, it is my relatively uniformed perception that there is probably much more heterogeneity in methods across ecology and evolutionary biology than across psychology. If some methods present fewer ‘researcher degrees of freedom’, then bias may be less likely in some cases.

FF: This reminds me of older attempts to demonstrate grand differences between the disciplines. For example, there’s a common perception that the difference between hard and soft sciences is that physics etc are more cumulative than psychology and behavioural sciences. But attempts to pin this down, like this one from Larry Hedges, shows there are more similarities than differences. I’m generally pretty skeptical about attributing differences in research practice to inherent properties of what we study. They usually turn out to be explained by more mundane institutional and social factors.

TP: Well, this next idea is subject to the same critique, but I’ll present it anyway. Statistical methods may be much more heterogeneous across sub-disciplines, and even across studies within subdisciplines of ecology and evolution. This could mean that some researchers are conducting analyses in ways that are actually less susceptible to bias. It could also mean that researchers fail to recognize the risks of bias in whatever method they are using because they focus on the differences between their method and other more widespread methods. In other words, many ecologists and evolutionary biologists may believe that they are not at risk of bias, even if they are.

FF: If you look at very particular sub-fields you may well find differences, but my bet is these can be explained by the cultural norms of a small group of individuals (e.g., the practices in particular labs that have a shared academic lineage).

TP: There certainly are some sub-disciplines where a given stats practice has become the norm, such as demographers studying patterns of survival by comparing and averaging candidate models using AIC and the ‘information theoretic’ approach. I’m not prepared to say how common this sort of sub-field standardization is, however.

Again, on to another hypotheses. Some ecologists and evolutionary biologists test hypotheses that are likely to be true, and some test hypotheses that are unlikely to be true. It is not widely recognized, but it is easily shown that testing unlikely hypotheses leads to a much higher proportion of observed relationships being due to chance (when real signal is rare, most patterns are just due to noise). It may be that unlikely hypotheses are more common in psychology, and thus their false positive rate is higher on average than what we experience in ecology and evolutionary biology. I strongly suspect that the likelihood of hypotheses varies a good bit across ecology and evolutionary biology, but certainly if you’re in a subdiscipline that mostly tests likely hypotheses, it would be reasonable to have more confidence in that published literature.

FF: I don’t really know what to say about this. It could be that better researchers test more hypotheses that are likely. Or maybe not. Maybe crummy researchers do, because they just go for low-hanging fruit. I concede that the a prior likelihood of a hypothesis being true would definitely be be correlated to something, but not that it would be a property of a discipline.

TP: Well, I’m not quite done with my ‘property of a discipline’ hypotheses, so here’s another. In some subfields of psychology, conducting a publishable study requires substantially less work than in many subfields of ecology and evolutionary biology. For instance, as David mentioned earlier, papers in psychology are sometimes based on answers to a few hundred surveys administered to undergraduate students (a resource that’s not in short supply in a university). If studies are easy to come by, then opting not to publish (leaving a result in the proverbial file drawer) is much cheaper. In eco/evo, gathering a comparable amount of data might take years and lots of money, so it’s not so easy to just abandon an ‘uninteresting’ result and go out and gather new data instead.

FF: It’s not clear to me how big the file drawer problem is in any discipline. To be clear, I’m not saying publication bias isn’t a problem. We know it is. But are whole studies are really left in file drawers, or are they cherry picked and p-hacked back into the literature? There is a little less publication bias in ecology (~74% of papers publish ‘positive’ results compared to psychology’s ~92%) but there is probably also slightly lower statistical power. Tim’s explanation is not implausible, but I doubt we currently have enough evidence to say either way.

TP: As David mentioned briefly above, in ecology and evolutionary biology, dramatic differences among study systems (different species, different ecosystems, even stochastic or directional change over time in the ‘same’ system) make it easy to believe that differences in results among studies are due to meaningful biological differences among these studies. It seems that we do not take the inevitability of sampling error seriously, and thus rarely seriously consider the fact than many reported findings will be wrong (even WITHOUT the bias that we know is there and that should be elevating the rate of incorrect findings).

DM: This is related to the fact that in ecology and evolutionary biology, there’s no culture of direct replication. If most studies are conducted just once, there’s no reliable way to assess their credibility. If a study is replicated, it’s usually couched as a conceptual replication with known differences in the study. That new twist is the intellectual progeny of the author. If the results aren’t the same as the original, chalk it up to whatever those differences were. However, direct replications, where the expectation is for similar results, are the best way to assess credibility empirically.

This lack of direct replication has led to plausible deniability that there is any problem. And since there is no perceived problem, there is no need to empirically look for a problem (only a real troublemaker would do that!).

TP: We are clearly in agreement here, David. Now we just need to figure out how to establish some better institutional incentives for replication.

While we’re planning that, I’ll throw out my last hypothesis, which if right, would mean that all my other hypotheses were largely unnecessary. Psychology is a much larger discipline than ecology and evolutionary biology. Because of this, it may be that the number of people actively working to promote transparency in psychology is larger overall, but is a similar proportion to the number working in ecology and evolutionary biology.

FF: This seems very likely to me, and also something we should calculate sometime.

What I found in my PhD research on attempts to reform statistical practices through the 1970s-2000s (i.e., to get rid of Null Hypothesis Significance Testing) was the medicine banned it (and it snuck back in), psychology showed some progress, and ecology was behind at that time. But almost all disciplinary differences turn out to be institutional and social/cultural, rather than an inherent property of studying that particular science.

This scientific reform about reproducibility differs from the NHST one because the main players are much more aware of best behaviour change practices. The NHST reform was lead by cranky old men (almost exclusively!) writing cranky articles that often insulted researchers intelligence and motives. This new reform has by and large been led by people who know how to motivate change. (There are some early exceptions here.) Psychologists should be ahead of this game, given their core business.

DM: I think psychologists are certainly aware of bias, but ecologists are too. I suspect that a missing element is one of those outstanding claims that deserves to be checked. Results that seem “too good to be true” probably are, and identifying those will likely be the first step to assessing credibility of a field’s body of work through direct replication.

TP: Thanks to Fiona and David for engaging in this discussion. Here some brief take-homes:

  1. It may be that psychologists are NOT considerably more concerned about the replication crisis than are ecologists and evolutionary biologists. Instead it may be that the much larger number of psychology researchers means there are more concerned psychologists only in absolute numbers, but similar numbers proportionally.

 

  1. To the extent that psychologists may have greater levels of concerns about reproducibility, much of this may be attributable to a single major event in psychology in which a result widely believed to be false was derived through common research practices and published in a respectable journal. It may also be that psychologists tend to be more comfortable with the idea that they have biases that could influence their research.

 

  1. Ecologists may recognize the value of replication, but their use of replication to assess validity of earlier conclusions is too rare to have led them to see low rates of replicability.

 

  1. Some of the other ideas we discussed above may be worth empirical exploration, but we should be aware that hypotheses rooted in fundamental differences between disciplines have often not been strongly supported in the past.

guest post: Reproducibility Project: Ecology and Evolutionary Biology

Written by: Hannah Fraser

The problem

As you probably already know, researchers in some fields are finding that it’s often not possible to reproduce others’ findings. Fields like psychology and cancer biology have undertaken large-scale coordinated projects aimed at determining how reproducible their research is. There has been no such attempt in ecology and evolutionary biology.

A starting point

Earlier this year Bruna, Chazdon, Errington and Nosek wrote an article citing the need to start this process by reproducing foundational studies. This echoes early research undertaken in psychology and cancer biology reproducibility projects attempting to reproduce the fields’ most influential findings. Bruna et al’s focus was on tropical biology but I say why not the whole of ecology and evolutionary biology!

There are many obstacles to this process, most notably obtaining funding and buy-in from researchers, but it is hard to obtain either of these things without a clear plan of attack. First off, we need to decide on which ‘influential’ findings we will try to replicate and how we are going to replicate them.

Deciding on what qualifies as an influential finding is tricky and can be controversial. In good news, this year an article came out that has the potential to (either directly or indirectly) answer this question for us. Courchamp and Bradshaw (2017)’s “100 articles every ecologist should read” provides a neat list of candidate influential articles/findings. There are some issues with biases in the list which may make it unsuitable for our purposes but at least one list is currently being compiled with the express purpose of redressing these biases. Once this is released it should be easy to use some combination of the two lists to identify – and try to replicate – influential findings.

What is unique about ecology and evolutionary biology?

In psychology and cancer biology where reproducibility has been scrutinised, research is primarily conducted inside and based on experiments. Work in ecology and evolutionary biology is different in two ways: 1) it is often conducted outside, and 2) a substantial portion is observational.

Ecology and evolutionary biology are outdoor activities

Conducting research outdoors means that results are influenced by environmental conditions. Environmental conditions fluctuate through time, influencing the likelihood reproducing a finding in different years. Further, climate change is causing directional changes in environmental conditions, which may mean that you might not expect to reproduce a finding from 20 years ago this year. I’ve talked to a lot of ecologists about this troublesome variation and have been really interested to find two competing interpretations:

1) trying to reproduce findings is futile because you would never know whether any differences were reflective of the reliability of the original result or purely because of changes in environmental conditions

2) trying to reproducing findings is vital because there is so much environmental variation that findings might not generalise beyond the exact instance in space and time in which the data were collected – and if this is true the findings are not very useful.

Ecology and evolutionary biology use observation

Although some studies in ecology and evolutionary biology involve experimentation, many are based on observation. This adds even more variation and can limit and bias how sites/species are sampled. For example, in a study on the impacts of fire, ‘burnt’ sites are likely to be clustered together in space and share similar characteristics that made them more susceptible to burning that the ‘unburnt’ sites, biasing the sample of sites. Also, the intensity of the fire may have differed even within a single fire, introducing uncontrolled variation. In some ways, the reliance on observational data is one of the greatest limitations in ecology and evolutionary biology. However, I think it is actually a huge asset because it could make it more feasible to attempt reproducing findings.

Previous reproducibility projects in experimental fields have either focussed on a) collecting and analysing the data exactly according to the methods of the original study, or b) using the data collected for the original analysis and re-running the original analysis. While ‘b’ is quite possible in ecology and evolutionary biology, this kind of test can only tell you whether the analyses are reproducible… not the pattern itself. Collecting the new data required for ‘a’ is expensive and labour intensive. Given limited funding and publishing opportunities for these ‘less novel’ studies, it seems unlikely that many researchers will be willing or able to collect new data to test whether a finding can be reproduced. In an experimental context, examining reproducibility is tied to these two options. However, in observational studies there is no need to reproduce an intervention, so only the measurements and the context of the study need to be replicated. Therefore, it should be possible to use data collected for other studies to evaluate how reproducible a particular finding is.

Even better, many measurements are standard and have already been collected in similar contexts by different researchers. For example, when writing the lit review for my PhD I collated 7 Australian studies that looked at the relationship between the number of woodland birds and treecover, collected bird data using 2ha 20 minute bird counts and recorded the size of the patches of vegetation. It should be possible to use the data from any one of these studies to test whether the findings of another study are reproducible.

Matching the context of the study is a bit more tricky. Different inferences can be made from attempts to reproduce findings in studies with closely matching contexts than those conducted in distinctly different contexts. For example, you might interpret failure to reproduce a finding differently if it was in a very similar context (e.g. same species in the same geographic and climatic region) than if the context was more different (e.g. sister species in a different country with the same climatic conditions). In order to test the reliability of a finding you should match the context closely. In order to test the generalisability of a finding should match the context less closely. However, determining what matches a study’s context is difficult. Do you try to match the conditions where the data were collected or the conditions that the article specifies it should generalise to? My feeling is that trying to replicate the latter is more relevant but potentially problematic.

In a perfect world, all articles would provide a considered statement about which conditions they would expect their results to generalise to (Simons et al 2017). Unfortunately, many articles overgeneralise to increase their probability of publication which may mean that findings appear less reproducible than they would have if they’d been more realistic about their generalisability.

Where to from here?

This brings me to my grand plan!

I intend to wait a few months to allow the competing list (or possibly lists) of influential ecological articles to be completed and published.

I’ll augment these lists with information on the studies’ data requirements and (where possible) statements from the articles about the generalisability of their findings. I’ll share this list with you all via a blog (and a page that I will eventually create on the Open Science Framework).

Once that’s done I will call for people to check through their datasets to see whether they have any data that could be used to test whether the findings of these articles can be reproduced. I’m hoping that we can all work together to arrange reproducing these findings (regardless of whether you have data and/or the time and inclination to re-analyse things).

My dream is to have the reproducibility of each finding/article tested across a range of datasets so that we can 1) calculate the overall reproducibility of these influential findings, 2) combine them using meta-analytic techniques to understand the overall effect, and 3) try to understand why they may or may not have been reproduced when using different datasets. Anyway, I’m very excited about this! Watch this space for further updates and feel free to contact me directly if you have suggestions or would like to be involved. My email is hannahsfraser@gmail.com.

Why ‘MORE’ published research findings are false

In a classic article titled “Why most published research findings are false”,  John Ioannidis explains 5 main reasons for just that. These reasons are largely related to large ‘false positive reporting probabilities’ (FPRP) in most studies, and ‘researcher degrees of freedom’, facilitating the practice as such ‘p-hacking’. If you aren’t familiar with these terms (FPRP, researcher degrees of freedom, and p-hacking), please read Tim Parker and his colleagues’ paper.

I would like to add one more important reason why research findings are often wrong (thus, the title of this blog). Many researchers simply get their stats wrong. This point has been talked about less in the current discussion of the ‘reproducibility crisis’. There are many ways to getting stats wrong, but I will discuss a few examples here.

In our lab’s recent article, we explore one way that biologists, especially when statistically accounting for body size, can produce unreliable results. Problems arise when a researcher divides a non-size trait measurement by size (e.g., food intake by weight), and uses this derived variable in a statistical model (a worryingly common practice!).Traits are usually allometrically related to each other, meaning, for example, food intake will not increase linearly with weight. In fact, food intake increases slower than weight. The consequence of using the derived variable is that we may find statistically significant results where no actual effect exists. (see Fig 1). An easy solution for this issue is to log-transform and fit the trait of interest as a response with size as a predictor (i.e., allometrically related traits are log-linear to each other).

But surprisingly, even this solution can lead to wrong conclusions. We discussed a situation where an experimental treatment affects both a trait of interest (food intake) and size (weight). In such a situation, size is known as an intermediate outcome, and fitting size as a predictor could result in wrongly estimating an experimental effect. I have made similar mistakes because it’s difficult to know when and how to control for size. It depends on both your question and the nature of relationships between the trait of interest and size. For example, if the experiment affected both body size and appetite and also body size influences appetite as well, then, you do not want to control for body size. This is because the effect of body size on appetite is due to the experimental effect (complicated!).

Although I said, ‘getting stats wrong’ is less talked about, there are exceptions. For instance, the pitfalls of pseudoreplication (statistical non-independence) have been known for many years, but researchers continue to overlook this problem. Recently my good friend Wolfgang Forstmeier and his colleagues devoted part of a paper on avoiding false positives (Type I error) to explaining the importance of accounting for pseudo-replication in statistical models. If you work in a probabilistic discipline, this article is a must read! As you will find out, not all pseudoreplication is obvious. Modelling pseudoreplication properly can reduce Type I dramatically (BTW, we recently wrote about statistical non-independence and the importance of sensitivity analysis).

What can we do to prevent these stats mistakes? Doing stats right is more difficult than I used to think. When I was a PhD student, I thought I was good at statistical modelling, but I made many statistical mistakes including ones mentioned above. Statistical modelling is difficult because we need to understand both statistics and biological properties of a system under investigation. A statistician can help with the former but not the latter. If statisticians can’t recognize all our potential mistakes, I think this means that we as biologists should become better statisticians.

Luckily, we have some amazing resource. I would recommend all graduate students read Gelman and Hill’s book. Also, you will learn a lot from Andrew Gelman’s blog where he often talks about common statistical mistakes and scientific reproducibility. Although no match to Gelman and Hill, I am doing y part to educate biology students about stats by writing a new kind of stats book, which is based on conversations, i.e. a play!

I would like to thank Tim Parker for detailed comments on an earlier version of this blog.

Replication: step 1 in PhD research

Here are a few statements that won’t surprise anyone who knows me. I think replication has the potential to be really useful. I think we don’t do nearly enough of it and I think our understanding of the world suffers from this rarity. In this post I try to make the case for the utility of replication based on an anecdote from my own scientific past.

A couple of years ago Shinichi Nakagawa and I wrote a short opinion piece about replication in ecology and evolutionary biology. We talked about why we think replication is important and how we can interpret results from different sorts of replications, and we also discussed a few ideas for how replication might become more common. One of those ideas was for supervisors to expect graduate students to replicate part of the previously published work that inspired their project. When Shinichi and I were writing that piece, I didn’t take the time to investigate the extent to which this already happens, or even to think of examples of it happening.

Then out of the blue the other day, it occurred to me that I’d seen this happen up-close with one of my own findings. First some background. Bear with me and I’ll try to be brief. When I was a naïve master’s student (with a hands-off adviser who had at least one foot in retirement), I decided to test Tom Martin’s ideas about nest predators shaping bird species co-existence, but in a new study system: the shrub nesting bird community at Konza Prairie in Kansas (by the way, this anecdote is NOT about my choice to do a conceptual replication for my MSc work). Anyway, I was gathering all the data myself, trying to find as many nests as I could from multiple species, monitoring those nests to determine predation outcomes, and measuring vegetation around each nest. I bit off more than I could chew, but I wanted to be done in one field season. I was in a hurry for some reason – not a recipe for sufficient statistical power. Instead, it was a recipe for an ambiguous test of the hypothesis since I didn’t find many nests for most bird species. I did, however, find a decent number of nests of one species: Bell’s vireo. Among the more than 60 vireo nests I found, I noticed something striking – brood parasitic cowbirds laid eggs in many of them, and if a cowbird egg hatched in a vireo nest, all vireo chicks were outcompeted and died. What was really interesting is that vireos abandoned many parasitized nests before cowbird eggs hatched and these vireos appeared to re-nest up to seven times in a season. I first thought this was evidence of an adaptation in Bell’s vireos to avoid parasitism by cowbirds via re-nesting (that’s another story), but I ended up publishing a paper that pointed out that the number of vireo eggs in the nest (rather than the number of cowbird eggs) was the best predictor of vireo nest abandonment. Thus it seemed like a response to egg loss (cowbirds remove host eggs) by Bell’s vireos might explain their nest abandonment and therefore how they could persist despite high brood parasitism. Now on to the heart of the story.

Several years later, after doing a PhD elsewhere, I found myself back in Kansas. A new K-State PhD student (Karl Kosciuch – who was one of Brett Sandercock’s first students) arrived and was excited about the Bell’s vireo –cowbird results I had reported. Looking back on it, this is a textbook case of how exploratory work and replication should go together. I found a result I wasn’t looking for. Someone else came along and thought it was interesting and wanted to build on it but decided to replicate it first. Karl did several things for his PhD, but one of them was simply to replicate my observational data set with an even bigger sample. He found the same pattern, thus dramatically strengthening our understanding of this system, and strongly justifying follow-up experiments. I actually joined Karl for one of these experiments, and it was very satisfying behavioral ecology. It turned out that it really is loss of their own eggs that induce Bell’s vireos to abandon and that cowbird eggs do not induce nest abandonment on their own.

This study had a happy ending for all involved, but what if Karl’s replication of my correlative study had failed to support my result? Well, for one it hopefully would have saved Karl the trouble of pursuing an experiment based on a pattern that wasn’t robust. Such an experiment would presumably have failed to produce a compelling result, and then would have left Karl wondering why. Were the experimental manipulations flawed? Was his sample size too small? Was there some unknown environmental moderator variable? Further, although the population of Bell’s vireo we studied is not endangered, the sub-species in Southern California is and one of the primary threats to that endangered population has been cowbird parasitism. My result had been discussed as evidence that Bell’s vireo populations might be able to evolve nest abandonment as an adaptive response to cowbird parasitism. If no replication had been conducted and only an unconvincing experiment had been produced, this flawed hypothesis might have persisted with harmful outcomes to management practices of Bell’s vireo in California.

I think there’s a clear take-home message here. Students benefit from replicating previously published studies that serve as the basis for their thesis research. Of course it’s not just students who can benefit here – anyone who replicates foundational work will reduce their risk of building on an unreliable foundation. And what’s more we all benefit when we can better distinguish reliable and repeatable results from those which are not repeatable.

I’m curious to hear about other replications of previously published results that were conducted as part of the process of building on those previously published results.

Is overstatement of generality an Open Science issue?

I teach an undergraduate class in ecology and every week or two I have the students in that class read a paper from the primary literature. I want them to learn to extract important information and to critically evaluate that information. This involves distinguishing evidence from inference and identifying assumptions that link the two. I’m just scratching the surface of this process here, but the detail I want to emphasize in this post is that I ask the students to describe the scope of the inference. What was the sampled population? What conclusions are reasonable based on this sampling design? This may seem straightforward, but students find it difficult, at least in part because the authors of the papers rarely come right out and acknowledge limitations on the scope of their inference. Authors expend considerable ink arguing that their findings have broad implication, but in so doing they often cross the line between inference and hypothesis with nary a word. This doesn’t just make life difficult for undergraduates. If we’re honest with ourselves, we should admit that it’s sloppy writing, and by extension, sloppy science. That said, I’m certainly guilty of this sloppiness, and part of the reason is that I face incentives to promote the relevance of my work. We’re in the business of selling our papers (for impact factors, for grant money, etc.). Is this sloppiness a trivial outcome or a real problem of the business of selling papers? I think it may lean towards the latter. Having to train students to filter out the hype is a bad sign. And more to the point of this post, it turns out that our failure to constrain inferences may hinder interpretation of evidence that accumulates across studies.

For years my work to encourage recognition of constraints on inference has been limited to my interaction with students in my class. That changed recently when I heard about a movement to promote the inclusion of ‘Constraints on Generality’ (COG) statements in research papers. My colleagues Fiona Fidler and Hannah Fraser made the jaunt from Melbourne over to the US to attend ESA in August (to join me in promoting and exploring replication in ecology), but they first flew to Virginia to attend the 2nd annual SIPS (Society for the Improvement of Psychological Science) conference where they heard about COG statements (there’s now a published paper on the topic by Daniel Simons, Yuichi Shoda, and Stephen Lindsay). In psychology there’s a lot of reflection and deliberation regarding reducing bias and improving empirical progress, and the SIPS conference is a great place to feel that energy and to learn about new ideas. The idea for a paper on COG statements apparently emerged from the first SIPS meeting, and the COG statement pre-print got a lot of attention in the 2nd meeting this year. It’s easy to see the appeal of a COG statement from the standpoint of clarity. But there’s more than just clarity. One of the justifications for COG statements comes from a desire to more readily interpret replication studies. A perennial problem with replications is that if the new study appears to contradict the early study, the authors of the earlier study can point to the differences between the two studies and argue that the second study was not a valid test of the conclusions of the original. This may seem true. After all, whenever conditions differ between two studies (and conditions ALWAYS differ to some extent), we can’t eliminate the possibility that the differences between the two studies result from the differences in conditions. However, we’re typically going to be interested in a result only if generalizes beyond the narrow set of conditions found in a single study. In a COG statement, the authors state the set of conditions under which they expect their finding to apply. The COG statement then sets a target for replication. With this target set, we can ask: What replications are needed to assess the validity of the inference within the stated COG? What work would be needed to expand the boundaries of the stated COG? As evidence accumulates, we can then start to restrict or expand the originally stated generality.

In a COG statement, authors will face conflicting incentives. Authors will still want to sell the generality of their work, but if they overstate the generality of their work, they increase the chance of being contradicted by later replication. That said, it’s important to note that a COG doesn’t simply reflect the whims of the authors. Authors need to justify their COG with explicit reference to their sampling design and to existing theoretical and experimental understanding. A COG statement should be plausible to experts in the field.

I started this post by discussing the scope of inference that’s reasonable from a given study, but although this is clearly related to the constraints on generality, a COG statement could be broader than a statement about the scope of inference. Certainly as presented by Simons et al., COG statements will typically expand the scope of generality beyond the sampled population. I haven’t yet resolved my thinking on this difference, but right now I’m leaning towards the notion that we should include both a scope of inference statement and a constraints on generality statement in our papers, and that they should be explicitly linked. We could state the scope of our inference as imposed by our study design (locations, study taxa, conditions, etc.), but then we could argue for a broader COG based on additional lines of evidence. These additional lines of evidence might be effects reported by other studies of the same topic, or might be qualitatively different forms of evidence, for instance based on our knowledge of the biological mechanisms involved. Regardless, more explicit acknowledgements of the constraints on our inferences would clearly make our publications more scientific. I’d love to have some conversations on this topic. Please share comments below.

Before signing off, I want to briefly mention practical issues related to the adoption of COG (and/or scope of inference) statements. Because scientists face an incentive to generalize, it seems that a force other than just good intentions of scientists may be required for this practice to spread. This force could be requirements by journals. However, many journals also face incentives to promote over-generalization from study results. That said, there are far fewer journals than there are scientists, so it might be within the realm of possibility to convince editors, in the name of scientific quality, to add requirements for COG statements. I can think of roles that funders could play here too, but these would be less direct and maybe less effective than journal requirements. I’m curious what other ideas folks have for promoting COG / scope of inference statements. Please share your thoughts!