Estimating the reproducibility of psychological science

http://www.sciencemag.org/content/349/6251/aac4716

We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available.

Ninety-seven percent of original studies had significant results (P < .05). Thirty-six percent of replications had significant results;

No single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility. Nonetheless, collectively these results offer a clear conclusion: A large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors, review in advance for methodological fidelity, and high statistical power to detect the original effect sizes. Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than variation in the characteristics of the teams conducting the research (such as experience and expertise).

44 thoughts on “Estimating the reproducibility of psychological science

  1. I am not at all surprised at this. There are good reasons why psychology is considered a “soft science.” There is good research done in psychology. But there’s also dubious work.

    There is poor work done in the hard sciences, too, but it is not as common.

    I would guess that it is harder to publish a “failed to replicate” study than to publish the original study. And maybe that’s a problem.

  2. Neil Rickert,

    You think the poor work done in hard science is not as common??

    “The study found that in a sample of 55 large trials testing heart-disease treatments, 57% of those published before 2000 reported positive effects from the treatments. But that figure plunged to just 8% in studies that were conducted after 2000.”

    http://www.nature.com/news/registered-clinical-trials-make-positive-findings-vanish-1.18181

    And you think the bad work done in psychology is even worse than this? Wow. How awful must it be?

    No wonder you guys call yourself skeptics.

  3. petrushka,

    “However, he says, this means that at least half of older, published clinical trials could be false positives. “Loose scientific methods are leading to a massive false positive bias in the literature,” he writes.”

    Well, the studies were funded by science organizations, so I agree with you, they inherently have a bias.

  4. phoodoo: Well, the studies were funded by science organizations, so I agree with you, they inherently have a bias.

    Science organizations? Name one.

  5. Phoodoo, on what basis do you trust the folks who failed to replicate the results?

  6. phoodoo:
    petrushka,
    What makes you think I trust them? I am not a skeptic!

    I understood your post to imply that one set of studies refuted the findings of another set. Sorry if I misunderstood.

    But replicating studies and getting negative results is good science. Replication is the way it’s supposed to go, particularly in medicine and mental health.

  7. There are good reasons why psychology is considered a “soft science.”

    You can add to that social “science”. It’s mushy too.

  8. phoodoo: You think the poor work done in hard science is not as common??

    I don’t consider medical medical science to be a hard science.

    As with the social sciences, the methodology is largely statistical testing. And, too often, the researchers don’t adequately understand the statistical tools that they are using.

  9. Medical science is mostly engineering. Mostly working on products and services to sell.

    Usually drawing on biological science done without regard to immediate financial rewards. There are gray areas.

    My quick and dirty cui bono method is to ask whether research is aiming toward a cure or prevention, or whether it is aimed at a palliative that will generate long term use.

  10. Neil Rickert: Some of it is.But some of it isn’t much different from voodoo.

    I’m a cynic. I suspect most bad clinical studies are driven by the desire to market a product. I know of a specific product that was found not useful in clinical trials. Te financial implications were enormous.

    One doesn’t have to be actively dishonest to engage in this. If you are a big company, you have lots of products in the hopper. Some studies will have positive results simply by the laws of statistics. And some will fail, even though they might be effective for some people.

  11. A couple of points here:

    1) Failure to replicate (the Open Science Collaboration paper cited in the OP) does not mean that the original study reported an incorrect conclusion. It does mean that the original study was ‘borderline’ and may be incorrect.

    2) The effect of mandatory pre-registration in clinicaltrials.gov (the Kaplan and Irvin study that phoodoo doesn’t understand) is probably a combination of improved cardiovascular care, viz:

    However, we do recognize that the quality of background cardiovascular care continues to improve, making it increasingly difficult to demonstrate the incremental value of new treatments. The improvement in usual cardiovascular care could serve as alternative explanation for the trend toward null results in recent years.

    and post hoc cherry-picking, viz:

    In addition, prior to 2000 and the implementation of Clinicaltrials.gov, investigators had the opportunity to change the p level or the directionality of their hypothesis post hoc. Further, they could create composite variables by adding variables together in a way that favored their hypothesis. Preregistration in ClinicalTrials.gov essentially eliminated this possibility.

    Happily, the FDA and the EMA do not tolerate any post hoc cherry-picking. There’s nothing wrong about only 8% of trials reporting positive results (unless your pension plan is heavily invested in biopharma, that is…)

    Take-home: there’s a fair amount of iffy research done in psychology. If this surprises you, you need to get out more. At least it isn’t as bad as social sciences [waves cheerily at Lithuania]…

  12. Thank God for operational definitions and objective subjects! Where would empirical studies be without them.

  13. Its relevant to origin science. Human incompetence is in all these things all the time.
    Evolutionary biology is case in point of dumb ideas prevailing without scientific foundations to hold it up. its incompetence of those who get paid to investigate bio origins.
    in actual tests it just comes out when someone retests.
    There are no tests in evo bio to retest.

  14. Neil Rickert:
    I am not at all surprised at this.There are good reasons why psychology is considered a “soft science.”There is good research done in psychology.But there’s also dubious work.

    There is poor work done in the hard sciences, too, but it is not as common.

    I would guess that it is harder to publish a “failed to replicate” study than to publish the original study.And maybe that’s a problem.

    I bridle, I have to say, at the term “soft science”. It’s only “soft” in the sense that people are way more complicated than most other phenomena in nature, and the way they behaviour more complicated still. So there’s always a huge amount of unexplained variance.

    That doesn’t excuse poor science. It is certainly a reason to distrust null hypothesis testing in psychology.

  15. stcordova: You can add to thatsocial “science”.It’s mushy too.

    Only in the sense that meteorology is mushy. Huge number of relevant variables and we can never measure more than a few of them, plus they interact so it’s deeply non-linear.

  16. One problem is of course publication bias. You’d expect a very high percentage of papers to report “statistically significant” results because it’s hard to get non-significant results published – not merely because of bias against non-significant results because they are not exciting, but because if you want to establish that two things are NOT related, say, you need a far larger and more expensive study than if you want to establish that they ARE. This is a fundamental problem with the concept of null hypothesis testing, and why it was so easy for Wakefield to publish a paper suggesting a “significant” association between MMR and autism, but it took a vast population study of a few million (IIRC) to establish that there is no relationship of any size worth worrying about.

    A second point is that even if an effect is real, and detected in one study as a “significant” finding with p <.05, you need a LARGER sample to have even an 80% chance of replication. And many replications are not sized to do this. Plus, almost all first studies overstate the effect size (because they reach “significance” at least partly due to noise-assisted effects). So even if you power your replication to have a 95% chance of detecting the first effect, you may miss it, because the real effect size is smaller than the one the first study reported.

    Which is why we need to do things like funnel-plots and explore “file-drawer” studies – and insist on pre-trial registration for intervention studies.

  17. This is all part of an ongoing critique of statistical studies. From my point of view it is not a problem unless we are using an iffy study to justify some draconian policy, or to sell some expensive medicine.

    It’s not a problem if the results are part of incremental research. Bad results will wash out.

  18. petrushka: Bad results will wash out.

    Sometimes but not always. My big beef is with fMRI studies that use “cluster significance” to report their findings, then also report “peak voxels” within the cluster. Then the next people come along and they use the “peak voxels” from the cluster as the “ROI” (region-of-interest) for theirs. Then someone replicates that too, and before you know it people are saying things like “we know that [some particular region]” is responsible for X” and it’s simply not true. It just happens to be the case that that region is active during the process along with loads of other regions, and may not be specific to X at all.

  19. Elizabeth: Which is why we need to do things like funnel-plots and explore “file-drawer” studies – and insist on pre-trial registration for intervention studies.

    All pretrial registration will do is insure that secret pre-pre-trial research is done. You can’t fix bias with rules.

    peace

  20. Bad results will eventually wash out because they are blind alleys. Whether this happen fast enough is a matter of opinion. Sorry about the mixed metaphor.

  21. petrushka: Why is that a problem?

    It’s not if all you are interested in is personal utility. I set the bar a little higher.

    peace

  22. petrushka:
    Why is that a problem?

    To participate in a quip that parallels “the file drawer problem,” it has to be a problem too.

    Otherwise, as my non-existence for 13.82 billion years before now doesn’t seem to have been objectionable, I don’t expect it to be a problem at all once that state of affairs resumes.

  23. fifthmonarchyman: All pretrial registration will do is insure that secret pre-pre-trial research is done. You can’t fix bias with rules.

    peace

    That’s not the point. They can do as many secret “pre-trials” as they like (though it would be a waste of funding and difficult to keep secret). They remain secret – i.e. out of the public domain. The point is that all published trials should be pre-registered. If journals refused to publish trial results that had not been pre-registered, then there’d be a huge incentive to preregister (as there is not at present). Then, even if there’s publication bias in favour of significant results, you can get at the entire population of trials that were potentially publishable.

    That’s the point of PRE registration. It means the company with profits to gain from a successful trial has to take a risk by announcing the trial before they know the results. Even if it isn’t published, we know it’s there.

  24. Elizabeth: That’s not the point. They can do as many secret “pre-trials” as they like (though it would be a waste of funding and difficult to keep secret). They remain secret – i.e. out of the public domain.

    The problem is what is discovered (or not) in the secret “pre-trials” influences what is investigated in the published “pre-trials”.

    There is no way to get away from personal bias.

    We all start with a bias, evidence that tends to confirm it gets our attention evidence that does not tends to slip our notice

    peace

  25. walto:
    Reciprocating Bill,

    RB, dunno if you saw it, but I put a question to you on the Ashley Madison thread that I’m curious to get your response on.Thx.

    No, I hadn’t been back there. My answer is, “Every Day.” Wait, that’s the Sixth Sense. My answer is,

    “Actually, I don’t. That’s a very good question that awakens me from the slumber of repeating old chestnuts on the topic. I’ll think about it, and maybe investigate.”

  26. fifthmonarchyman: The problem is what is discovered (or not) in the secret “pre-trials” influences what is investigated in the published “pre-trials”.

    Yes indeed. But you are missing the point.

    Let me take a live example: right now I am involved in developing a non-pharmacological treatment for ADHD. I haven’t registered my trials, because it’s not a clinical trial yet – it’s at proof-of-concept stage. If I get evidence that my intervention works, I will do some open-label trials (it’s difficult to find a good control for a non-pharmacological trials), possibly “secret”, possibly not. And if it looks like rubbish, it ends there. If it DOESN’T look like rubbish – if I get promising results, at that point, I register a trial, BEFORE starting the trial. In other word, I take a risk. And if I get a significant result from that, then I might do another, again registered BEFORE the trial starts. Or someone else might. The fact that there have been previous non-registered, even “secret” trials (right now, we don’t have anything worth publishing) doesn’t affect the public record. Anyone who wants to know whether my intervention *really* works ignores all the non-registered trials and only looks at the registered trials. The fact that I have only proceeded to registered-trial stage AFTER convincing myself that there is something in it, doesn’t “bias” the outcome. It simply biases (in a good way) my choice of what to trial. I won’t bother trialling something for which I have no good evidence for efficacy. I will bother trialling something for which I already do have some evidence. Indeed, if I want funding for a proper trial I will need to demonstrate that evidence.

    There is no way to get away from personal bias.

    No, but there are effective ways of minimising it, and of alerting readers to possible sources of bias. All of which is standard science methodology, not always adhered to. But if it isn’t, then caveat emptor. There are a lot of bad studies and less-than-rigorous journal editors out there.

    Fortunately there are methods of determining potential sources of bias, and identifying whether steps have been taken to minimise theses.

    We all start with a bias, evidence that tends to confirm it gets our attention evidence that does not tends to slip our notice

    Of course. That’s why we need the complex methodology we use – including null hypothesis testing, double-blind trials, inter-rater reliability measures, funnel plots, file-drawer searches, pre-trial registration, etc.

  27. Elizabeth: Of course. That’s why we need the complex methodology we use – including null hypothesis testing, double-blind trials, inter-rater reliability measures, funnel plots, file-drawer searches, pre-trial registration, etc.

    I’m not disagreeing that these things are good. I’m saying that they are not sufficient.

    I think what is warranted is a little skeptical realism. Science is a fallible methodology. That does not mean we should abandon it or that we should cease trying to improving it. We just need to understand that our efforts will always fall a little short

    It just means that the results of experiments that involve humans as subjects or observers should be viewed with at least a little suspicion.

    peace

  28. fifthmonarchyman: I think what is warranted is a little skeptical realism. Science is a fallible methodology. That does not mean we should abandon it or that we should cease trying to improving it. We just need to understand that our efforts will always fall a little short

    And we do. Which is why the methodology includes meta-methodologies, such as the study referenced in the OP.

    fifthmonarchyman: It just means that the results of experiments that involve humans as subjects or observers should be viewed with at least a little suspicion.

    All scientific studies involve humans as observers. The issue is not specific to psychology.

Leave a Reply