13 M4U: The Value of Open Data
This chapter draws on material from Open Education Science by Tim van der Zee and Justin Reich, licensed under CC BY-NC 4.0.
Changes to the source material include removal of original material, the addition of new material, combining of sources, and editing of original material for a different audience.
The resulting content is licensed under CC BY-NC 4.0.
13.1 Introduction
This reading’s focus on open data is closely related to a previous reading’s focus on reproducibility, which means that there’s value in revisiting the idea of paradigms. I’ll use some different examples and language here, but you ought to be able to see how what I’ve written here connects with our previous discussion.
A great way to remember what paradigms are and how they differ is to compare molecular biology (my spouse’s major in college) and French (my major in college). A professor of molecular biology (or of any of the other “hard sciences”) assumes that there is a predictable reality behind what they’re studying. That is, organic molecules work in particular, quasi-universal ways, and if you can figure that out, you can introduce established causes to bring about desired effects. This is often called a positivist paradigm, with “positivist” having connotations of “rational” and “data-driven.”
In contrast, a professor of French (or any of the other disciplines in the humanities) assumes that what they’re studying is important, but largely arbitrary. French is decidedly not universal—it’s something that humans made up instead of discovered. Plus, it doesn’t apply to most humans and isn’t consistent across the humans that it does apply to (for example, French-speaking people from Switzerland count differently than French-speaking people from Canada—they can’t even agree on numbers). Thus, while you could try to talk about French in terms of cause and effect, most French professors are more interested in understanding and describing French than in predicting it. This is often called an interpretivist paradigm, underlining its focus on trying to understand humans’ semi-arbitrary meaning-making.
In between the hard sciences and the humanities, there’s a large, important mega-discipline we can call the social sciences, which includes library and information sciences, technology studies, education research, and many applications of data science. There’s a fair bit of dispute within the social sciences about whether they ought to be modeled after the hard sciences (with a positivist paradigm) or the humanities (with an interpretivist paradigm—or one of a few others we haven’t covered here). That is, when we’re studying people and people-related phenomena, can we assume that there are quasi-universal laws that govern and therefore predict human behavior in the same way we can of atoms, chemicals, and molecules? Or, is it safer to assume that people’s behavior is context-dependent rather than universal and something to be understood rather than predicted?
Based on what I’ve already shared, it shouldn’t surprise you that I am more of an interpretivist than a positivist—my research is much more interested in describing and understanding phenomena because I’m skeptical about the possibility or value of predicting human behavior. That said, I have a healthy respect for positivism, and I’ll be the first to admit that trying to determine cause and effect is more directly “useful” than trying to document contextual variations in human behavior. I prefer the interpretivist paradigm, but the fact is that we need positivist views, too.
13.2 Positivism and the Need for Data
Even the most committed positivist will acknowledge that even if there are universal laws governing the behavior of atoms, bacteria, humans, or organizations, it can be tremendously difficult to determine what those laws are, especially since a lot of the easy stuff was figured out centuries or decades ago by folks like Newton, Mendel, and Curie.
To identify a cause and effect relationship with a great degree of confidence requires a few things, including access to appropriate data. Appropriate data can cost a lot of money to gather, be difficult to gather, take a lot of time to gather, or all of the above. As we’ve touched on previously, modern information and communication technologies make it a lot easier to collect data than ever before, but just because one researcher has collected data doesn’t necessarily mean that they will share it with others. Many researchers—understandably—take the stance that “if I put all the effort into gathering this data, I’m not going to give it to others to analyze for free.”
Nonetheless, governments and research funding agencies are increasingly requiring the scientists they fund to share their data as a condition of that funding. In parallel, work is being done to think more comprehensively about how to collect, manage, and share data in appropriate, responsible ways. These open science efforts overlap considerably with the push for reproducibility that we’ve already discussed; if there is a difference, I would describe it as a difference between quality and quantity.
Reproducibility largely argues that:
- testable science leads to better science,
- sharing scientific materials allows for testing science, and
- sharing scientific materials therefore makes for better science.
Open science largely argues that
- more science leads to better science,
- sharing data allows for more science, and
- sharing data therefore makes for better science.
In general terms, I think that both of these arguments are sound and that locking down scientific data is morally questionable—however, we’ll see later that there are privacy issues closely connected with data sharing.
13.6 Benefits of Sharing Data
Open Data can improve the scientific process both during and after publication, in keeping with the connections to reproducibility and open science that we’ve made earlier.
Without access to the data underlying a paper that is to be reviewed, peer reviewers are substantially hindered in their ability to assess the evidential value of the claims. Allowing reviewers to audit statistical calculations will have a positive effect on reducing the number of calculation errors, unsupported claims, and erroneous descriptive statistics that are later found in the published literature (Nuijten, Hartgerink, van Assen, Epskamp, & Wicherts, 2016; Van der Zee, Anaya, & Brown, 2017).
Open Data also enables secondhand analyses and increases the value of gathering data, which require direct access to the data and cannot be performed using only the summary statistics typically presented in a paper. Data collection can be a lengthy and costly process, which makes it economically wasteful to not share this valuable commodity. Open Data is a research accelerator that can speed up the process of establishing new important findings (Pisani et al., 2016; Woelfle, Olliaro, & Todd, 2011).
13.7 Downsides of Sharing Data
Perhaps the strongest objection to Open Data sharing concerns issues of privacy protection. Safeguarding the identity and other valuable information of research participants is of utmost importance and takes priority over data sharing, but these are not mutually exclusive endeavors. Sharing data is not a binary decision, and there is a growing body of research around differential privacy that suggests a variegated approach to data sharing (Daries et al., 2014; Gaboardi et al., 2016; Wood et al., 2014).
Even when a data set cannot be shared publicly in its entirety, it may be possible to share de-identified data or, as a minimum, information about the shape and structure of the data (i.e., metadata). Daries et al. (2014) provided one case study of a de-identified data set from MOOC learners, which was too “blurred” for accurately estimating distributions or correlations about the population but could provide useful insights about the structure of the data set and opportunities for hypothesis generation. However, it should also be noted that it is sometimes easier to “re-identify” data than people think—especially in the world of big data:
According to a paper … in Scientific Reports … researchers at MIT and the Université Catholique de Louvain, in Belgium, analyzed data on 1.5 million cellphone users in a small European country over a span of 15 months and found that just four points of reference, with fairly low spatial and temporal resolution, was enough to uniquely identify 95 percent of them.
In other words, to extract the complete location information for a single person from an “anonymized” data set of more than a million people, all you would need to do is place him or her within a couple of hundred yards of a cellphone transmitter, sometime over the course of an hour, four times in one year. A few Twitter posts would probably provide all the information you needed, if they contained specific information about the person’s whereabouts. (Hardesty, 2013)
For textual data, such as transcripts from interviews and other forms of qualitative research, there are tools that allow researchers to quickly de-identify large bodies of texts, but textual data can also be deeply personal, so I’d personally have misgivings about sharing that data with others.
Even when a whole data set cannot be shared, subsets might be shareable to provide more insight into coding techniques or other analytic approaches. Privacy concerns should absolutely shape decisions about what researchers choose to share, and researchers should pay particular attention to implications for informed consent and data collection practices, but research into differential privacy shows that openness and privacy can be balanced in thoughtful ways.
Another concern with data sharing is “scooping” and problems with how research production is incentivized. Researchers are often under a lot of pressure to produce a lot of research, and research is nearly always judged on whether it contributes something new to human understanding. So, sharing your own data could potentially allow someone else to do an analysis that you were hoping to publish, and I don’t blame people for worrying about that. Furthermore, data scientists often work in the corporate world, and companies are probably even more protective of their data, since they don’t want competitors to see it.
However, it’s also possible to go too far with this concern. For example, in an editorial in the New England Journal of Medicine, Longo and Drazen (2016) stated that: “There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as ‘research parasites’” (para. 3). Specifically, these authors were concerned that scholars might “parasitically” use data gathered by others; they suggested that data should instead be shared “symbiotically,” for example by demanding that the original researchers will be given co-author status on all papers that use data gathered by them. This editorial, and especially the framing of scholars as “parasites” for reusing valuable data, sparked considerable discussion. In fact, this discussion resulted in the “Research Parasite Award,” which reclaimed the derisive reference in the service of genuinely celebrating rigorous secondary analysis.
13.8 Incentives for Sharing Data
These concerns—and their connection to how scientific research works—demonstrates the need to provide incentives for sharing data. The U.S. National Research Council (1997) has argued: “The value of data lies in their use. Full and open access to scientific data should be adopted as the international norm for the exchange of scientific data derived from publicly funded research” (p. 10).
There are various ways to make better use of the data that we have already generated, such as data sets with persistent identifiers, so they can be properly cited by whoever has reused the data. This way, the data collectors continue to benefit from sharing their data as they will be repeatedly cited and have proof of how their data have been fundamental to others’ research. There is evidence that Open Data increase citation rates (Piwowar, Day, & Fridsma, 2007), and other institutional actors could play a role in elevating the status of Open Data.
An increasing number of academic journals have started to award special badges that will be shown on a paper that is accompanied by publicly available data in an Open Access repository. Journal policies can also have a strong positive effect on the prevalence of Open Data (Nuijten et al., 2017). Scholarly societies and research foundations could create new awards for the contribution of valuable data sets in education research. Perhaps most importantly, promotion and tenure committees in universities should recognize the value of contributing data sets to the public good and ensure that young scholars can be recognized for those contributions.
13.9 Conclusion
Open data is an important component of the data science community, and it’s also important for supporting reproducibility and advancing science. Like other components of reproducibility, the value of open data emerges from a particular set of assumptions about how science works, and other scientific perspectives can help raise concerns about privacy and other issues. Nonetheless, so long as it is done responsibly, sharing data is a good and important thing.
13.10 References
Daries, J. P., Reich, J., Waldo, J., Young, E. M., Whittinghill, J., Ho, A. D., . . . Chuang, I. (2014). Privacy, anonymity, and big data in the social sciences. Communications of the ACM, 57(9), 56–63. https://doi.org/10.1145/2643132
Gaboardi, M., Honaker, J., King, G., Nissim, K., Ullman, J., & Vadhan, S. (2016). PSI: A private data sharing interface. Retrieved from https://arxiv.org/abs/1609.04340
Hardesty, L. (2013). How hard is it to ‘de-anonymize’ cellphone data? MIT News. https://news.mit.edu/2013/how-hard-it-de-anonymize-cellphone-data
King, G. (2007). An introduction to the dataverse network as an infrastructure for data sharing. Sociological Methods & Research, 36(2), 173-199. https://doi.org/10.1177/0049124107306660
Longo, D. L., & Drazen, J. M. (2016). Data sharing. New England Journal of Medicine, 374(3). https://doi.org/10.1056/NEJMe1516564
National Research Council. (1997). Bits of power: Issues in global access to scientific data. Washington, DC: National Academy Press.
Nuijten, M. B., Borghuis, J., Veldkamp, C. L. S., Alvarez, L. D., van Assen, M. A. L. M., & Wicherts, J. M. (2017, July 13). Journal data sharing policies and statistical reporting inconsistencies in psychology. Retrieved from https://osf.io/preprints/psyarxiv/sgbta
Nuijten, M. B., Hartgerink, C. H., van Assen, M. A., Epskamp, S., & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 48(4), 1205–1226. https://doi.org/10.3758/s13428-015-0664-2
Piwowar, H. A., Day, R. S., & Fridsma, D. B. (2007). Sharing detailed research data is associated with increased citation rate. PloS One, 2(3), e308. https://doi.org/10.1371/journal.pone.0000308
Pisani, E., Aaby, P., Breugelmans, J. G., Carr, D., Groves, T., Helinski, M., . . . Mboup, S. (2016). Beyond open data: Realising the health benefits of sharing data. BMJ, 355. https://doi.org/10.1136/bmj.i5295
Van der Zee, T., Anaya, J., & Brown, N. J. L. (2017). Statistical heartburn: An attempt to digest four pizza publications from the Cornell Food and Brand Lab. BMC Nutrition, 3(54). https://doi.org/10.1186/s40795-017-0167-x
Wicherts, J. M., Bakker, M., & Molenaar, D. (2011). Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PloS One, 6(11), e26828. https://doi.org/10.1371/journal.pone.0026828
Wicherts, J. M., Borsboom, D., Kats, J., & Molenaar, D. (2006). The poor availability of psychological research data for reanalysis. American Psychologist, 61(7), 726–728. https://doi.org/10.1037/0003-066X.61.7.726
Woelfle, M., Olliaro, P., & Todd, M. H. (2011). Open science is a research accelerator. Nature Chemistry, 3(10), 745–748. https://doi.org/10.1038/nchem.1149
Wood, A., O’Brien, D., Altman, M., Karr, A., Gasser, U., Bar- Sinai, M., . . . Wojcik, M. J. (2014). Integrating approaches to privacy across the research lifecycle: Long-term longitudinal studies. Retrieved from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2469848