10 M3U: Research Paradigms and Reproducibility

This chapter draws on material from:

Statistical Inference via Data Science: A ModernDive into R and the Tidyverse by Chester Ismay and Albert Y. Kim, licensed under CC BY-NC-SA 4.0
Open Education Science by Tim van der Zee and Justin Reich, licensed under CC BY-NC 4.0

Changes to the source material include removal of original material, the addition of new material, combining of sources, editing of original material for a different audience, and addition of first-person language from current author..

The resulting content is licensed under CC BY-NC-SA 4.0.

10.1 Introduction

One goal of this reading is helping readers understand the importance of reproducible analyses. The hope is to get readers into the habit of making their data science analyses reproducible from the very beginning. Reproducibility means a lot of things in terms of different scientific fields, but they are all concerned with a common goal: Is research conducted in a way that another researcher could follow the steps and get similar results? The importance placed on reproducibility comes in part from a growing belief that a substantial portion of published research findings may actually report false positives—or, at the very least, overestimate the importance of a study’s findings (Simmons et al., 2011). Ioannidis (2005) went so far as to title an article “Why Most Published Research Findings Are False,” describing several problems in medical research that lead to false positive rates.

If you can reproduce someone else’s research and get the same results, that suggests that the findings of the original study are on solid ground. However, if you try to follow their steps and get different results, there may be reason to question what the original researchers had to say. This is pretty straight forward, but reproducing someone else’s work isn’t as easy as it might seem. Most researchers in training eventually realize that short summaries of research methods in journal articles tidy over much of the complexity and messiness in research—especially research that involves humans. These summaries spare readers from trivial details, but they can also misrepresent important elements of the research process. However, if our goal is to provide practitioners, policymakers, and other researchers with data, theory, and explanations that lead to practical improvements, we need to provide them with a way to judge the quality and contextual relevance of research. That judgment depends in turn on how researchers choose to share the methods and processes behind their work.

In data science, there is a particular emphasis on what is known as computational reproducibility. This refers to being able to pass all of one’s data analysis, datasets, and conclusions to someone else and have them get exactly the same results on their machine. This has obvious benefits for scientific concerns about reproducibility! In fact, one of my friends in graduate school had a job doing reproducibility for a political science journal. Anyone submitting research to be published in the journal also had to submit all of their data analysis, datasets, and conclusions. The paper to be published would be submitted to traditional peer review, but my friend would also review all of the submitted materials to make sure that there weren’t any discrepancies or errors in the data, analysis, and results described in the paper.

However, there are also practical and personal benefits to computational reproducibility. In a traditional scientific workflow, you might do your statistical work and data visualization in RStudio and then write up your results in a word processor, pasting in helpful plots that you’d generated in RStudio. This isn’t a terrible demand on your time, but what if you discover a mistake partway through? Or what if someone gives you good advice on a late draft of your paper that leads you to make some changes to your data analysis? In a traditional workflow, you’d need to step through the entire process again: redo the analyses, redo the data visualization, rewrite the paper, repaste the plots. This is error prone and a frustrating use of time. In contrast, with the right software—and, more importantly, the right mindset—there are ways to include analysis and writing in the same document. Not only does this allow a colleague to easily reproduce all of the steps you’ve taken, but it might actually save you from having to redo work! This allows for data scientists to spend more time interpreting results and considering assumptions instead of spending time restarting an analysis from scratch or following a list of steps that may be different from machine to machine.

Really committing to reproducibility means building new habits; it takes practice and will be difficult at times. Hopefully, though, you’ll see just why it is so important for you to keep track of your code and document it well to help yourself later and any potential collaborators as well.

10.2 Reproducibility and Paradigms

I should make it clear from the beginning that I am an imperfect example of reproducibility. Part of this is because this is a difficult process—I’m still developing necessary habits and trying to get rid of bad habits. However, I also have philosophical differences with reproducibility as I’ve described it here. This isn’t to say that I disagree with the importance of reproducibility itself—I think it’s a fine idea, and I’m even going to enforce a certain level of it in this class! However, the reasons that reproducibility is important are often tied to basic assumptions that data scientists typically make about how the world works. While I draw heavily from data science techniques in my research, I also draw from research traditions that do not share the same assumptions as data science.

10.2.1 An Example Paradigm

To explain more what I mean by this, let’s think about some of the assumptions that are implicitly present in my description of reproducibility in the previous section:

Everyone ought to get the same results every time: As described above, reproducibility is related to a belief that different scientists should get similar results if individually performing the same analysis. If two scientists don’t come to the same answer when asking similar questions, that’s seen as a problem. People who work separately ought to come to the same conclusions.

Research should be a series of nice, neat decisions: The basic idea of reproducibility is that it should be possible to provide a clear checklist that describes the steps that scientists took when carrying out an analysis. This only works if individual decisions can be clearly distinguished from each other and then communicated to another in a methodical way.

Research is meant to inform practitioners and policymakers: Part of the reasoning behind making a research study reproducible is giving people confidence in the study’s findings; part of the reasoning behind giving people confidence in the study’s findings is so that other people can act on those findings. This perspective sees research as useful when it translates into specific actions that other stakeholders can take.

Each of these three assumptions about research can be understood as connecting to a deeper assumption about how the world works and the resulting role of research: The world—including things and people in it—work in consistent, predictable ways; the point of research (including data science) is to figure out how the world works so that we can get the outcomes we want. In research settings, these collections of assumptions (and underlying worldview) are often referred to as a paradigm.

These aren’t terrible assumptions, and this isn’t a bad paradigm! They’ve driven most of scientific progress and a lot of effective decision-making over the past couple of centuries. However, they’re not the only way of thinking about the world. Let me provide a recent example from my data science-adjacent research that rejects these assumptions in favor of a different set of assumptions.

10.2.2 Another Example Paradigm

In mid-2019, I used R (and other software) to collect screenshots of about 1,400 tweets associated with a specific Twitter hashtag. The organizers of the hashtag claim that people who use the hashtag are simply members of a particular church and that their goal is to defend that church against its critics. However, outside observers have argued that participants in the hashtag are heavily influenced by toxic, far-right online communities and believe that these tweets are more motivated by White nationalism than religious beliefs. Over the course of a couple of years, I worked with a colleague to carefully study each of these tweets so that we could come to some conclusions about what’s happening in this hashtag. Here are some of the assumptions that we had about our research:

Different people might come to different conclusions: My co-researcher and I understood while studying these tweets that we might understand them differently. I grew up in the church that this hashtag is associated with, so I pick up on a lot of subtle references that my colleague doesn’t. However, as a woman, she picks up on subtle misogyny that sometimes sails right over my (masculine) head. We have different lived experiences, and we know that if we analyzed this data separately, we would probably come to different conclusions—that’s exactly why we’re working together.

Research involves complex, messy decisions: What’s the point at which two researchers become confident that a 280-character post on Twitter includes a reference to White nationalism? The group that my colleague and I were studying is skilled at ambiguity and subtle references, so it’s not often the case that they clearly embrace extreme political views. It would be a mistake to ignore these views just because they’re not always easy to pin down, but it would also be irresponsible to exaggerate our conclusions. There’s no clear checklist for determining whether a tweet includes references to White nationalism—instead, we have to engage in lengthy discussions and make difficult calls.

The goal of research is to develop understanding: I hope that our research makes some kind of difference in the world, and I know that my colleague does too. However, after reviewing our data and our conclusions, it’s not as if we have a list of specific actions that someone can take in response to our research. Who is that “someone” anyway? Twitter users? The church that this group claims to be loyal to? The group itself? Any of those populations could hypothetically learn something from what we have to say, but it would be a stretch to say that there’s a clear and immediate takeaway from our research. Our goal is to simply to document our findings in the hopes of increasing readers’ understanding of this kind of phenomenon.

Each of our three assumptions about researchers can be understood as connecting to a paradigm we share: The world—including things and people in it—are rich, complex, and difficult to perfectly summarize; the point of research is to try to capture some of that richness so that we can better appreciate details and context.

Some scientists are hostile to the idea of any paradigm except the first one and would be dismissive of the kind of research that I’ve just described. However, I’m confident that many data scientists acknowledge the existence and value of different research paradigms. Their response might be that the first paradigm works particularly well for data science, but that there’s no harm in using the second paradigm in other kinds of research. In fact, they might suggest that even if I did use R for that research project, it doesn’t really count as data science (to be honest, I’d agree with this evaluation, but remember what we’ve already read about what gets to count as data science!).

10.2.3 Why Do Paradigms Matter?

In that case, what’s the point of making a big deal about paradigms here? If reproducibility is based on a paradigm about the predictability of the world, and if data science embraces that paradigm, why emphasize the existence of other paradigms? That’s a good question, but there’s an answer that’s just as good. In short, while it’s very useful to assume that the world is neatly ordered and predictable (remember all of those scientific and technological developments over the past couple of centuries?), it’s also possible to overemphasize that assumption—and overemphasis carries with it important ethical consequences.

Let’s consider a concrete example related to data science. In recent years, there’s been considerable interest in sentencing algorithms. In other words, human judges are fallible and biased and might hand out different sentences for the same crime based not on an objective interpretation of the law but on their subjective responses to the accused. Wouldn’t it be great if we could come up with a computer program that would take in the facts of a case and information about the criminal and spit out an appropriate, fair sentence that a potentially biased judge couldn’t interfere with? The influence of the first paradigm ought to be clear here: Decisions ought to be consistent, so let’s boil down sentencing to a neat checklist-like algorithm that uses data to make a recommendation and make a practical difference.

In theory, this sounds great—an application of data science that could make a real difference in the world. However, there are plenty of critics of this approach (including me). No one disputes that human judges are fallible and biased, and it would be fantastic to come up with a way of sentencing criminals that wouldn’t be as influenced by personal prejudices. The issue isn’t with the intent, it’s with the underlying paradigm. Critics of sentencing algorithms argue that in valuing consistency, predictable algorithms, and clear-cut recommendations, they overlook ways in which the world doesn’t actually work like that!

For example, a sentencing algorithm would likely be trained on sentences given by human judges, and use those as the basis for the supposedly objective decision. However, as we’ve already established, human judges are fallible and biased; for example, we might (and almost certainly would!) find that judges give harsher sentences to criminals of color than White criminals. A sentencing algorithm that used race as a factor in determining sentences would therefore treat “non-White” as a statistically-valid reason to recommend a harsher sentence, simply because that’s the pattern that it was trained on. Rather than replace human bias, the algorithm would learn from it—but because humans tend to believe computers are more objective than humans, that bias would be harder to argue against! In a sense, this is a worse outcome, because it inherits bias but disguises it as an objective measure.

This criticism implicitly adopts the second paradigm that I’ve described in this sentence: The process of sentencing a criminal is too rich and complex to distill down to a computational decision. More importantly, it points out a weakness of the first paradigm: A belief that the world is consistent and predictable is actually sometimes just a preference for things that seem consistent and predictable. That is, data scientists sometimes fall into the trap of believing that a consistent and predictable solution to a problem is better, when a solution that acknowledges complexity and nuance might feel less efficient but would do better at avoiding harm.

10.3 Supporting Reproducibility

If I am sometimes resistant to calls for reproducibility, it is because I think the paradigm that these calls assume doesn’t always hold up. However, there’s nothing half-hearted about my teaching you about reproducibility in this class—it’s impossible to deny that there are amazing benefits to making your data science work reproducible (so long as data scientists critically examine their own paradigms).

Let’s consider an example that I like a lot. Josh Rosenberg is one of my data science heroes. We went to grad school together, and 90% of what I know about R is thanks to him; if I have questions about R or data science, Josh is the first person I go to. When we were still in grad school, Josh participated in a study in which:

independent analysts used the same dataset to test two hypotheses regarding the effects of scientists’ gender and professional status on verbosity during group meetings (Schweinsberg et al., 2021, p. 230)

In other words, Josh and dozens of other researchers (the list of contributors to this study is literally a page long) were all given the same data and encouraged to test the same hypotheses. However, apart from the data and the hypotheses, the organizers of the study were intentionally skimpy on details. For example, it was up to individual researchers to determine how to best measure “professional status” using the available data, what statistical tests to use, etc. As a result,

Researchers reported radically different analyses and dispersed empirical outcomes, in a number of cases obtaining significant effects in opposite directions for the same research question. (Schweinsberg et al., 2021, p. 230)

In short, even though they were all given the same prompts and the same data, researchers made very different analytical decisions and came to very different results, sometimes producing conflicting (but individually compelling) findings. The article reports that

Subjective researcher decisions play a critical role in driving the reported empirical results, underscoring the need for open data, systematic robustness checks, and transparency regarding both analytic paths taken and not taken. (Schweinsberg et al., 2021, p. 230)

That is, scientists make lots of decisions when analyzing data. Even if they’re using the same data and asking the same questions, different, equally-qualified scientists might make wildly different decisions. In theory, any research article is supposed to contain a detailed section of how any analysis was completed, but as we’ve already determined, word limits and researchers’ attention spans are too short for articles to cover every decision that could potentially make a difference.

This is why reproducibility is important. If you can carefully document every decision you made, every line of code you wrote, and every other aspect of your analysis, other people will know not just what data you used and what questions you asked, but how—in very specific terms—you asked those questions. This is an undeniably good thing.

So, how do we support reproducibility in practical terms? Here are a few—with a focus on how they play out in traditional academic research:

10.3.1 Open Design

Research design is essential to any study as it dictates the scope and use of the study. This phase includes formulating the key research question(s), designing methods to address these questions, and making decisions about practical and technical aspects of the study. Typically, this entire phase is the private affair of the involved researchers. In many studies, the hypotheses are obscured or even unspecified until the authors are preparing an article for publication. Readers often cannot determine how hypotheses and other aspects of the research design have changed over the course of a study since usually only the final version of a study design is published.

Moreover, there is compelling evidence that much of what does get published is misleading or incomplete in important ways. A meta-analysis (that is, research on other research) found that 33% of authors admitted to questionable research practices, such as “dropping data points based on a gut feeling,” “concealment of relevant findings,” and/or “withholding details of methodology” (Fanelli, 2009). Given that these numbers are based on self-reports and thus suspect to social desirability bias, it is plausible that these numbers are underestimates.

For Open Design, researchers make every reasonable effort to give readers access to a truthful account of the design of a study and how that design changed over the duration of the study. Since study designs can be complex, this often means publishing different elements of a study design in different places. For instance, many academic journals publish short methodological summaries in the full text of an article and allow more detailed supplementary materials of unlimited length online. In addition, analytic code might be published in a linked GitHub account, and data might be published in an online repository. These various approaches allow for more detail about methods to be published, with convenient summaries for general readers and more complete specifics for specialists and those interested in replication and reproduction. There are also a variety of approaches for increasing transparency by publishing a time-stamped record of methodological decisions before publication: a strategy known as preregistration. Preregistration is the practice of documenting and sharing the hypotheses, methodology, analytic plans, and other relevant aspects of a study before it is conducted (Gehlbach & Robinson, 2018; Nosek et al., 2015).

10.3.2 Open Analysis

Open Analysis is the systematic reproduction of analytic methods conducted by other researchers—it’s the main support for reproducibility that has come up in this reading. Reproducing research is central to scientific progress because any individual study is generally insufficient to make robust or generalizable claims—the kinds that others could clearly act on. It is only after ideas are tested and replicated in various conditions and contexts and results “meta-analyzed” across studies that more durable scientific principles and precepts can be established.

One form of replication is a reproduction study, where researchers attempt to faithfully reproduce the results of a study using the same data and analyses. Such studies are dependent on Open Design, so that replication researchers can use the same methodological techniques but also the same exclusion criteria, coding schemes, and other analytic steps that allow for faithful replication. In recent years, perhaps the most famous reproduction study was by Thomas Herndon, a graduate student at UMass Amherst who discovered that two Harvard economists, Carmen Reinhart and Kenneth Rogoff, had failed to include five columns in an averaging operation in an Excel spreadsheet (The Data Team, 2016). After averaging across the full data set, the claims in the study had a much weaker empirical basis. Ouch!

In data science, where statistical code is central to conducting analyses, the sharing of that code is one way to make analytic methods more transparent. GitHub and similar repositories allow researchers to store code, track revisions, and share with others—GitHub’s importance for reproducibility is a major reason that we’re using it in this course. At a minimum, these repositories allow researchers to publicly post analytic code in a transferable, machine-readable platform. Used more fully, GitHub repositories can allow researchers to share preregistered codebases that present a proposed implementation of hypotheses, final code as used in publication, and all of the changes in between. Simply making code “available on request” will not be as powerful as creating additional mechanisms that encourage researchers to proactively share their analytic code: as a requirement for journal or conference submissions, as an option within study preregistrations, or in other venues. Reinhart and Rogoff’s politically consequential error might have been discovered much sooner if their analyses had been made available along with publication rather than after the idiosyncratic query of an individual researcher.

10.3.3 Open Publication

Open Access (OA) literature is digital, online, available to read free of charge, and free of most copyright and licensing restrictions (Suber, 2004). Most for-profit publishers obtain all the rights to a scholarly work and give back limited rights to the authors. With Open Access, the authors retain copyright for their article and the right to allow anyone to download and reprint provided that the authors and source are cited, for example under a Creative Commons Attribution License (CC BY 4.0). Opening access increases the ability of researchers, policymakers, and practitioners to leverage scientific findings for the public good.

Sharing a publicly accessible preprint can also be used to receive comments and feedback from fellow scientists, a form of informal peer review. Preprints are publicly shared manuscripts that have not (yet) been peer reviewed. A variety of peer-reviewed journals acknowledge the benefits of preprints. Across the physical and computer sciences, repositories such as ArXiv have dramatically changed publication practices and instituted a new form of public peer review across blogs and social media. In the social sciences, the Social Science Research Network and SocArXiv offer additional repositories for preprints and white papers. Preprints enable more iterative feedback from the scientific community and provide public venues for work that address timely issues or otherwise would not benefit from formal peer review.

Whereas historically peer review has been considered a major advantage over these forms of nonreviewed publishing, the limited amount of available evidence suggests that the typically closed peer-review process has no to limited benefits (Bruce et al., 2016; Jefferson et al., 2002). Public scholarly scrutiny may prove to be an excellent complement, or perhaps even an alternative, to formal peer review.

10.4 Conclusion

Data science is a field that values reproducibility, and we’re going to practice being reproducible as part of this course. It’s important to recognize that the value of reproducibility is based off of certain assumptions about what good research is and how the world works; those assumptions do not always hold up, and they can even be dangerous if we’re not careful about them. However, that doesn’t change the fact that there are real, powerful benefits to reproducibility.

10.5 References

Bruce, R., Chauvin, A., Trinquart, L., Ravaud, P., & Boutron, I. (2016). Impact of interventions to improve the quality of peer review of biomedical journals: A systematic review and meta- analysis. BMC Medicine, 14(1), 85. https://doi.org/10.1186/s12916-016-0631-5

The Data Team. (2016, September 7). Excel Errors and science papers. The Economist. Retrieved from https://www.economist .com/blogs/graphicdetail/2016/09/daily-chart-3

Fanelli, D. (2009). How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data. PloS One, 4(5), e5738. https://doi.org/10.1371/journal.pone.0005738

Gehlbach, H., & Robinson, C. D. (2018). Mitigating illusory results through preregistration in education. Journal of Research on Educational Effectiveness, 11, 296–315. https://doi.org/10.1080/1934574 7.2017.1387950

Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124

Jefferson, T., Alderson, P., Wager, E., & Davidoff, F. (2002). Effects of editorial peer review: A systematic review. JAMA, 287(21), 2784–2786. https://doi.org/10.1001/jama.287.21.2784

Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to increase the credibility of published results. Social Psychology, 45, 137–141. https://doi.org/10.1027/1864-9335/a000192

Schweinsberg, M., Feldman, M., Staub, N., van den Akker, O. R., van Aert, R. C. M., van Assen, M. A. L. M., Liu, Y., Althoff, T., Heer, J., Kale, A., Mohamed, Z., Amireh, H., Prasad, V. V., Bernstein, A., Robinson, E., Snellman, K., Sommer, S. A., Otner, S. M. G., … Ulhmann, E. L. (2021). Same data, different conclusions: Radical dispersion in empirical results when independent analysts operationalize and test the same hypothesis. Organizational Behavior and Human Decision Processes, 165, 228-249. https://doi.org/10.1016/j.obhdp.2021.02.003

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632

Suber, P. (2004, June 21). Open Access overview. Retrieved from https://legacy.earlham.edu/~peters/fos/overview.htm