13 M4U: The Value of Open Data | Introduction to Data Science

13.1 Introduction

This reading’s focus on open data is closely related to a previous reading’s focus on reproducibility, which means that there’s value in revisiting the idea of paradigms. I’ll use some different examples and language here, but you ought to be able to see how what I’ve written here connects with our previous discussion.

A great way to remember what paradigms are and how they differ is to compare molecular biology (my spouse’s major in college) and French (my major in college). A professor of molecular biology (or of any of the other “hard sciences”) assumes that there is a predictable reality behind what they’re studying. That is, organic molecules work in particular, quasi-universal ways, and if you can figure that out, you can introduce established causes to bring about desired effects. This is often called a positivist paradigm, with “positivist” having connotations of “rational” and “data-driven.”

In contrast, a professor of French (or any of the other disciplines in the humanities) assumes that what they’re studying is important, but largely arbitrary. French is decidedly not universal—it’s something that humans made up instead of discovered. Plus, it doesn’t apply to most humans and isn’t consistent across the humans that it does apply to (for example, French-speaking people from Switzerland count differently than French-speaking people from Canada—they can’t even agree on numbers). Thus, while you could try to talk about French in terms of cause and effect, most French professors are more interested in understanding and describing French than in predicting it. This is often called an interpretivist paradigm, underlining its focus on trying to understand humans’ semi-arbitrary meaning-making.

In between the hard sciences and the humanities, there’s a large, important mega-discipline we can call the social sciences, which includes library and information sciences, technology studies, education research, and many applications of data science. There’s a fair bit of dispute within the social sciences about whether they ought to be modeled after the hard sciences (with a positivist paradigm) or the humanities (with an interpretivist paradigm—or one of a few others we haven’t covered here). That is, when we’re studying people and people-related phenomena, can we assume that there are quasi-universal laws that govern and therefore predict human behavior in the same way we can of atoms, chemicals, and molecules? Or, is it safer to assume that people’s behavior is context-dependent rather than universal and something to be understood rather than predicted?

Based on what I’ve already shared, it shouldn’t surprise you that I am more of an interpretivist than a positivist—my research is much more interested in describing and understanding phenomena because I’m skeptical about the possibility or value of predicting human behavior. That said, I have a healthy respect for positivism, and I’ll be the first to admit that trying to determine cause and effect is more directly “useful” than trying to document contextual variations in human behavior. I prefer the interpretivist paradigm, but the fact is that we need positivist views, too.

13.2 Positivism and the Need for Data

Even the most committed positivist will acknowledge that even if there are universal laws governing the behavior of atoms, bacteria, humans, or organizations, it can be tremendously difficult to determine what those laws are, especially since a lot of the easy stuff was figured out centuries or decades ago by folks like Newton, Mendel, and Curie.

To identify a cause and effect relationship with a great degree of confidence requires a few things, including access to appropriate data. Appropriate data can cost a lot of money to gather, be difficult to gather, take a lot of time to gather, or all of the above. As we’ve touched on previously, modern information and communication technologies make it a lot easier to collect data than ever before, but just because one researcher has collected data doesn’t necessarily mean that they will share it with others. Many researchers—understandably—take the stance that “if I put all the effort into gathering this data, I’m not going to give it to others to analyze for free.”

Nonetheless, governments and research funding agencies are increasingly requiring the scientists they fund to share their data as a condition of that funding. In parallel, work is being done to think more comprehensively about how to collect, manage, and share data in appropriate, responsible ways. These open science efforts overlap considerably with the push for reproducibility that we’ve already discussed; if there is a difference, I would describe it as a difference between quality and quantity.

Reproducibility largely argues that:

testable science leads to better science,
sharing scientific materials allows for testing science, and
sharing scientific materials therefore makes for better science.

Open science largely argues that

more science leads to better science,
sharing data allows for more science, and
sharing data therefore makes for better science.

In general terms, I think that both of these arguments are sound and that locking down scientific data is morally questionable—however, we’ll see later that there are privacy issues closely connected with data sharing.

13.3 Sharing Data

In this chapter, I will use the term Open Data to the practice of proactively sharing the data associated with a study (often along with materials, analysis code, and other important elements of a study, in the spirit of reproducibility) on a public repository such as GitHub, the Open Science Framework, or others. Research data include all data that are collected or generated during scientific (or other) research. Although data science is chiefly concerned with quantitative data (that is, data in the form of numbers), many scientists will also share non-numeric qualitative data that they have collected, including texts, visual stimuli, interview transcripts, log data, diaries, and any other materials that were used or produced in a study.

I mentioned above that there’s a great deal of overlap between open science efforts and reproducibility efforts, so there shouldn’t be any surprise that Open Data supports both! In line with the statement that scholarly work should be verifiable, Open Data is the philosophy that authors should, as much as is practical, make all the relevant data publicly available so they can be inspected, verified, reused, and further built on. That is, you might use someone else’s research data in order to check their work—or you might be researching something similar, so the fact that they’ve already collected relevant data is going to save you time and effort. There are strong scientific ideals underpinning both: The U.S. National Research Council (1997) stated that “Freedom of inquiry, the full and open availability of scientific data on an international basis, and the open publication of results are cornerstones of basic research” (p. 2).

13.4 Sharing Data and Privacy

Sharing data is not a dichotomous issue: That is, it’s not a simple decision of “yes, I’m sharing my data” or “no, I’m not sharing my data.” Rather, researchers have to decide 1) what data to share 2) with whom and 3) when. Let’s look at some examples from the discipline of education, which is my professional background.

In the U.S., the National Center for Education Statistics (NCES) makes a wide variety of data sets publicly available for policymakers and researchers. This is good news! For several years, I’ve done research on how teachers use social media, and a few years ago, I put my head together with some other researchers to collectively ask whether teachers’ Twitter use in different U.S. states might have a connection with what education looks like in those states. We had the Twitter data, but it would have been really difficult for the three of us (especially with our barely existent budget) to come up with meaningful measures of what education policy looked like across different states. Luckly, there were gobs of data just waiting for us to download and consider on the NCES website!

However, it’s not hard to imagine educational data that shouldn’t be available for any researcher who wants it. Examples related to the National Assessment of Education Progress (a regular assessment of what students in the United States know) demonstrate the need to balance privacy and opennness.

For example, school-level data (which contain no personally identifiable information) are made easily accessible through public data sets. In contrast, student-level data are maintained with far stricter guidelines for accessibility and use—as they should be. However, statistical summaries, documentation of data collection methods, and codebooks of data are made easily and widely available.

13.5 How to Share Data

The most common approach to open data is simply making data “available on request.” A scientist mentions somewhere that they’ve collected some interesting data and asks people to get in touch with them if they want access to it.

However, this approach simply doesn’t work. In one study, researchers requested data from 140 authors with articles published in journals that required authors to share data on request, but only 25.7% of these data sets were actually shared (Wicherts, Borsboom, Kats, & Molenaar, 2006). What is even more worrying is that reluctance to share data is associated with weaker evidence and a higher prevalence of apparent errors in the reporting of statistical results (Wicherts, Bakker, & Molenaar, 2011). To increase the transparency of research, data should be shared proactively on a publicly accessible repository. GitHub can—and is—used for this, but other services are better equipped for it.

Long-term and discoverable storage is advisable for data that are unique (i.e., can be produced just once) or that involved a considerable amount of resources to generate. These features are often true for qualitative and quantitative data alike. The value of shared data depends on quality of its documentation. Simply placing a data set online somewhere, without any explanation of its content, structure, and origin, is of limited value. A critical aspect of Open Data is ensuring that research data are findable (in a certified repository) as well as clearly documented by metadata and process documents (this is why GitHub, which doesn’t explicitly support this, isn’t as good for this sort of thing).

In case research data cannot be shared at all, due to privacy issues or legal requirements, it is typically still possible to at least share metadata: information about the scope, structure, and content of the data set. For example, even though I’m generally excited about the idea of open science, I’m generally reluctant to share my Twitter data. For example, in the study I described above, we didn’t ask teachers for permission to collect their tweets before I studied them; that seems to me like a (perhaps justifiable) violation of privacy, so it’s important to me to protect these teachers’ privacy in other ways. So, I would never share the full set of data that I have, but when my co-authors and I did publish on this, we shared a summary of the data that leaves out any individual-level considerations.

In addition, researchers can share “process documents,” which outline how, when, and where the data were collected and processed. In both cases (metadata and process documentation), transparency can be increased even when the research data themselves are not shared. New data-sharing repositories like Dataverse allow institutions or individual researchers to create data projects and share different elements of that project under different requirements so that some elements are accessible publicly and others require data use agreements (King, 2007).

13.6 Benefits of Sharing Data

Open Data can improve the scientific process both during and after publication, in keeping with the connections to reproducibility and open science that we’ve made earlier.

Without access to the data underlying a paper that is to be reviewed, peer reviewers are substantially hindered in their ability to assess the evidential value of the claims. Allowing reviewers to audit statistical calculations will have a positive effect on reducing the number of calculation errors, unsupported claims, and erroneous descriptive statistics that are later found in the published literature (Nuijten, Hartgerink, van Assen, Epskamp, & Wicherts, 2016; Van der Zee, Anaya, & Brown, 2017).

Open Data also enables secondhand analyses and increases the value of gathering data, which require direct access to the data and cannot be performed using only the summary statistics typically presented in a paper. Data collection can be a lengthy and costly process, which makes it economically wasteful to not share this valuable commodity. Open Data is a research accelerator that can speed up the process of establishing new important findings (Pisani et al., 2016; Woelfle, Olliaro, & Todd, 2011).

13.7 Downsides of Sharing Data

Perhaps the strongest objection to Open Data sharing concerns issues of privacy protection. Safeguarding the identity and other valuable information of research participants is of utmost importance and takes priority over data sharing, but these are not mutually exclusive endeavors. Sharing data is not a binary decision, and there is a growing body of research around differential privacy that suggests a variegated approach to data sharing (Daries et al., 2014; Gaboardi et al., 2016; Wood et al., 2014).

Even when a data set cannot be shared publicly in its entirety, it may be possible to share de-identified data or, as a minimum, information about the shape and structure of the data (i.e., metadata). Daries et al. (2014) provided one case study of a de-identified data set from MOOC learners, which was too “blurred” for accurately estimating distributions or correlations about the population but could provide useful insights about the structure of the data set and opportunities for hypothesis generation. However, it should also be noted that it is sometimes easier to “re-identify” data than people think—especially in the world of big data:

According to a paper … in Scientific Reports … researchers at MIT and the Université Catholique de Louvain, in Belgium, analyzed data on 1.5 million cellphone users in a small European country over a span of 15 months and found that just four points of reference, with fairly low spatial and temporal resolution, was enough to uniquely identify 95 percent of them.

In other words, to extract the complete location information for a single person from an “anonymized” data set of more than a million people, all you would need to do is place him or her within a couple of hundred yards of a cellphone transmitter, sometime over the course of an hour, four times in one year. A few Twitter posts would probably provide all the information you needed, if they contained specific information about the person’s whereabouts. (Hardesty, 2013)

For textual data, such as transcripts from interviews and other forms of qualitative research, there are tools that allow researchers to quickly de-identify large bodies of texts, but textual data can also be deeply personal, so I’d personally have misgivings about sharing that data with others.

Even when a whole data set cannot be shared, subsets might be shareable to provide more insight into coding techniques or other analytic approaches. Privacy concerns should absolutely shape decisions about what researchers choose to share, and researchers should pay particular attention to implications for informed consent and data collection practices, but research into differential privacy shows that openness and privacy can be balanced in thoughtful ways.

Another concern with data sharing is “scooping” and problems with how research production is incentivized. Researchers are often under a lot of pressure to produce a lot of research, and research is nearly always judged on whether it contributes something new to human understanding. So, sharing your own data could potentially allow someone else to do an analysis that you were hoping to publish, and I don’t blame people for worrying about that. Furthermore, data scientists often work in the corporate world, and companies are probably even more protective of their data, since they don’t want competitors to see it.

However, it’s also possible to go too far with this concern. For example, in an editorial in the New England Journal of Medicine, Longo and Drazen (2016) stated that: “There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as ‘research parasites’” (para. 3). Specifically, these authors were concerned that scholars might “parasitically” use data gathered by others; they suggested that data should instead be shared “symbiotically,” for example by demanding that the original researchers will be given co-author status on all papers that use data gathered by them. This editorial, and especially the framing of scholars as “parasites” for reusing valuable data, sparked considerable discussion. In fact, this discussion resulted in the “Research Parasite Award,” which reclaimed the derisive reference in the service of genuinely celebrating rigorous secondary analysis.

13.8 Incentives for Sharing Data

These concerns—and their connection to how scientific research works—demonstrates the need to provide incentives for sharing data. The U.S. National Research Council (1997) has argued: “The value of data lies in their use. Full and open access to scientific data should be adopted as the international norm for the exchange of scientific data derived from publicly funded research” (p. 10).

There are various ways to make better use of the data that we have already generated, such as data sets with persistent identifiers, so they can be properly cited by whoever has reused the data. This way, the data collectors continue to benefit from sharing their data as they will be repeatedly cited and have proof of how their data have been fundamental to others’ research. There is evidence that Open Data increase citation rates (Piwowar, Day, & Fridsma, 2007), and other institutional actors could play a role in elevating the status of Open Data.

An increasing number of academic journals have started to award special badges that will be shown on a paper that is accompanied by publicly available data in an Open Access repository. Journal policies can also have a strong positive effect on the prevalence of Open Data (Nuijten et al., 2017). Scholarly societies and research foundations could create new awards for the contribution of valuable data sets in education research. Perhaps most importantly, promotion and tenure committees in universities should recognize the value of contributing data sets to the public good and ensure that young scholars can be recognized for those contributions.

13.9 Conclusion

Open data is an important component of the data science community, and it’s also important for supporting reproducibility and advancing science. Like other components of reproducibility, the value of open data emerges from a particular set of assumptions about how science works, and other scientific perspectives can help raise concerns about privacy and other issues. Nonetheless, so long as it is done responsibly, sharing data is a good and important thing.

13.10 References

Daries, J. P., Reich, J., Waldo, J., Young, E. M., Whittinghill, J., Ho, A. D., . . . Chuang, I. (2014). Privacy, anonymity, and big data in the social sciences. Communications of the ACM, 57(9), 56–63. https://doi.org/10.1145/2643132

Gaboardi, M., Honaker, J., King, G., Nissim, K., Ullman, J., & Vadhan, S. (2016). PSI: A private data sharing interface. Retrieved from https://arxiv.org/abs/1609.04340

Hardesty, L. (2013). How hard is it to ‘de-anonymize’ cellphone data? MIT News. https://news.mit.edu/2013/how-hard-it-de-anonymize-cellphone-data

King, G. (2007). An introduction to the dataverse network as an infrastructure for data sharing. Sociological Methods & Research, 36(2), 173-199. https://doi.org/10.1177/0049124107306660

Longo, D. L., & Drazen, J. M. (2016). Data sharing. New England Journal of Medicine, 374(3). https://doi.org/10.1056/NEJMe1516564

National Research Council. (1997). Bits of power: Issues in global access to scientific data. Washington, DC: National Academy Press.

Nuijten, M. B., Borghuis, J., Veldkamp, C. L. S., Alvarez, L. D., van Assen, M. A. L. M., & Wicherts, J. M. (2017, July 13). Journal data sharing policies and statistical reporting inconsistencies in psychology. Retrieved from https://osf.io/preprints/psyarxiv/sgbta

Nuijten, M. B., Hartgerink, C. H., van Assen, M. A., Epskamp, S., & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 48(4), 1205–1226. https://doi.org/10.3758/s13428-015-0664-2

Piwowar, H. A., Day, R. S., & Fridsma, D. B. (2007). Sharing detailed research data is associated with increased citation rate. PloS One, 2(3), e308. https://doi.org/10.1371/journal.pone.0000308

Pisani, E., Aaby, P., Breugelmans, J. G., Carr, D., Groves, T., Helinski, M., . . . Mboup, S. (2016). Beyond open data: Realising the health benefits of sharing data. BMJ, 355. https://doi.org/10.1136/bmj.i5295

Van der Zee, T., Anaya, J., & Brown, N. J. L. (2017). Statistical heartburn: An attempt to digest four pizza publications from the Cornell Food and Brand Lab. BMC Nutrition, 3(54). https://doi.org/10.1186/s40795-017-0167-x

Wicherts, J. M., Bakker, M., & Molenaar, D. (2011). Willingness to share research data is related to the strength of the evidence and the quality of reporting of statistical results. PloS One, 6(11), e26828. https://doi.org/10.1371/journal.pone.0026828

Wicherts, J. M., Borsboom, D., Kats, J., & Molenaar, D. (2006). The poor availability of psychological research data for reanalysis. American Psychologist, 61(7), 726–728. https://doi.org/10.1037/0003-066X.61.7.726

Woelfle, M., Olliaro, P., & Todd, M. H. (2011). Open science is a research accelerator. Nature Chemistry, 3(10), 745–748. https://doi.org/10.1038/nchem.1149

Wood, A., O’Brien, D., Altman, M., Karr, A., Gasser, U., Bar- Sinai, M., . . . Wojcik, M. J. (2014). Integrating approaches to privacy across the research lifecycle: Long-term longitudinal studies. Retrieved from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2469848