Examining whether science self-corrects using citations of replication studies

As scientists, we often hope that science self-corrects. But several researchers have suggested that the self-corrective nature of science is a myth (see e.g., Estes, 2012; Stroebe et al., 2012). If science is self-correcting, we should expect that, when a large replication study finds a result that is different from a smaller original study, the number of citations to the replication study ought to exceed, or at least be similar to, the number of citations to the original study. In this blog post, I examine this question in six “correction” studies in which I’ve been involved.[1] This exercise is intended to provide yet another anecdote to generate a discussion about how we, as a discipline, approach self-correction and is by no means intended as a general conclusion about the field.

Sex differences in distress from infidelity in early adulthood and later life.
In an article in 2004, Shackelford and colleagues (2004) reported that men, compared to women, are more distressed by sexual than emotional fidelity (total N = 446). The idea was that this effect generalize from young adulthood to later adulthood and this was taken as evidence for an evolutionary perspective. In our pre-registered replication studies (total N = 1,952) we did find the effect for people in early adulthood but not for later adulthood. In our replication study we also found that the disappearance of the effect was likely due to sociosexual orientation in the older adults that we sampled (in the Netherlands as opposed to the United States). In other words, the basic original effect seemed present, but the original conclusion was not supported.
How did the studies fare in terms of citations?
Original study (since 2014): 56 citations
Replication study (since 2014): 23 citations
Conclusion: little to no correction done (although perhaps it was not a conclusive non-replication given the ambiguity of the theoretical interpretation)

Does recalling moral behavior change the perception of brightness?
Banerjee, Chatterjee, and Sinha (2012; total N = 114) reported that recalling unethical behavior led participants to see the room as darker and to desire more light-emitting products (e.g., a flashlight) compared to recalling ethical behavior. In our pre-registered replication study (N = 1,178) we did not find the same effects.
How did the studies fare in terms of citations?

Original study (since 2014): 142 citations
Replication study (since 2014): 24 citations
Conclusion: correction clearly failed.

Physical warmth and perceptual focus: A replication of IJzerman and Semin (2009)
This replication is clearly suboptimal, as this was a self-replication. This study was conducted in the midst of the beginning of the replication crisis so we wanted to self-replicate some of our work. In the original study (N = 39), we found that when people are in a warm condition, they focus more on perceptual relationships than individual properties. In a higher-powered replication study (N = 128), we found the same effect (with a slightly different method to better avoid experimenter effects).
How did the studies fare in terms of citations?
Original study (since 2014): 323
Replication study (since 2014): 26
Conclusion: no correction needed (yet; but please someone other than us replicate this and the other studies, as these 2009 studies were all underpowered).

  • Perceptual effects of linguistic category priming
    This was a particularly interesting case as this paper was published after the first author, Diederik Stapel, was caught for data fabrication. All but one of the (12) studies were conducted before he got caught (but we could never publish them due to the nature of the field at the time). In the original (now retracted) article, Stapel and Semin reported that priming abstract linguistic categories (adjectives) led to more global perceptual processing, whereas priming concrete linguistic categories (verbs) led to more local perceptual processing. In our replication, we could not find the same effect.[2]
  • How did the studies fare in terms of citations?
  • Original study (since 2015): 12
  • Replication study (since 2015): 3
  • Conclusion: correction failed (although citations slowed down significantly and some of the newer citations were about Stapel’s fraud).

Does distance from the equator predict self-control?
This is somewhat of an outlier in this list, as this is an empirical test of a hypothesis in an theoretical article. The hypothesis of this article that was proposed is that people who live further away from the equator have poorer self-control and the authors suggested that this should be tested via data-driven methods. We were lucky enough to have a dataset (N = 1,537) to test this and took up the authors’ suggestion by using machine learning. In our commentary article, were unable to find the effect (equator distance as a predictor of self-control was just a little bit less important than whether people spoke Serbian).
How did the studies fare in terms of citations?
Original article (since 2015): 57
Empirical test of the hypothesis (since 2015): 3
Conclusion: correction clearly failed (in fact, the original first author published a very similar article in 2018 and cited the article 6 times).

A demonstration of the Collaborative Replication and Education Project: replication attempts of the red-romance effect project
Elliot et al (2010; N = 33) reported a finding that women were more attracted to men when their photograph was presented with a red (vs. grey) border. Via one of my favorite initiatives that I have been involved in, the Collaborative Replications and Education Project, 9 student teams tried to replicate this finding via pre-registered replications and were not able to find the same effect (despite very high quality control and a larger sample with total N = 640).
How did the studies fare in terms of citations?

Original study (since 2019): 17
Replication study (since 2019): 8
Conclusion: correction failed.

Social Value Orientation and attachment
Van Lange et al. (1996; N Study 1 = 573; N Study 2 = 136) reported two findings that people who are more secure in their attachment are also more prone to give more to other (fictitious) people in a coin distribution game. These original studies suffered from some problems: first, the reliabilities of the measurement instruments ranged between alpha = 0.46 and 0.68. Second, the somewhat more reliable scales (at alpha = 0.66 and 0.68) only produced marginal differences in a sample of 573 participants, when controlling for gender and after dropping items from the attachment scale (in addition, there were problems with one of the measure’s translation to Dutch). In our replication study (N = 768) that we conducted with better measurement instruments and in the same country, we did not find the same effects.
How did the studies fare in terms of citations?

Original study (since 2019): 110
Replication study (since 2019): 8
Conclusion: correction clearly failed (this one is perhaps a bit more troubling, as 1) the replication covered 2 out of the 4 studies, and the researchers from
ManyLabs2 were also not able to replicate Study 3. Again, the first author was responsible for some (4) of the citations). 

[EDIT March 8 2020]: Eiko Fried suggested I should plot the citations by year. If you want to download the data and R code, you can download them here.

A couple of observations:

  • 2020, of course, is not yet complete. I thus left it out of the graph as having 2020 in may be a bit misleading.  
  • When plotting per year, it became apparent that 2016 for Banerjee et al. had the “BBS effect” (the article was cited in a target article and received many ghost citations in Google Scholar for the commentaries that were published [but that did not cite the article; the citations for 2016 are thus inaccurate]. This does not take away from the overall conclusion).
  • Overall, there seemed to be no decline in citations.

Overall conclusion
Total for original studies (excluding 1 and 3): 338
Total for replication studies (excluding 1 and 3): 46
For the current set of studies, we clearly fail in correcting our beloved science. I suspect the same is true for other replication studies. I would love to hear more about experiences of other replication authors and I think it is time to generate a discussion how we can change these Questionable Citation Practices.

[1] I did not include any of the ManyLabs replication studies because they were so qualitatively different from the rest.

[2] Note: technically, this study was not a replication, as the original studies were never conducted. After Stapel was caught, the manuscript was originally submitted to the journal that originally published the effects. Their message was then that they would not accept replications. When we pointed out that these were not replication, the manuscript was rejected for the fact that we had found null effects. Thankfully, the times are clearly changing now.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: