14 M5U: Numbers Don’t Speak for Themselves

This chapter draws on material from 6. The Numbers Don’t Speak for Themselves by Catherine D’Ignazio and Lauren Klein, licensed under CC BY 4.0.

Changes to the source material include light editing, adding new material, deleting original material, rearranging material, changing the citation style, adding links, replacing images, changing original authors’ voice to third person, and adding first-person language from current author.

The resulting content is licensed under CC BY 4.0.

14.1 Introduction

In April 2014, 276 young women were kidnapped from their high school in the town of Chibok in northern Nigeria. Boko Haram, a militant terrorist group, claimed responsibility for the attacks. The press coverage, both in Nigeria and around the world, was fast and furious. SaharaReporters.com challenged the government’s ability to keep its students safe. CNN covered parents’ anguish. The Japan Times connected the kidnappings to the increasing unrest in Nigeria’s northern states. And the BBC told the story of a girl who had managed to evade the kidnappers. Several weeks after this initial reporting, the popular blog FiveThirtyEight published its own data-driven story about the event, titled “Kidnapping of Girls in Nigeria Is Part of a Worsening Problem” (Chalabi, 2014). The story reported skyrocketing rates of kidnappings. It asserted that in 2013 alone there had been more than 3,608 kidnappings of young women. Charts and maps accompanied the story to visually make the case that abduction was at an all-time high—here’s a link to one prominent one.

Shortly thereafter, the news website had to issue an apologetic retraction because its numbers were just plain wrong. The outlet had used the Global Database of Events, Language and Tone (GDELT) as its data source. GDELT is a big data project led by computational social scientist Kalev Leetaru. It collects news reports about events around the world and parses the news reports for actors, events, and geography with the aim of providing a comprehensive set of data for researchers, governments, and civil society. GDELT tries to focus on conflict—for example, whether conflict is likely between two countries or whether unrest is sparking a civil war—by analyzing media reports.

However, as political scientist Erin Simpson pointed out to FiveThirtyEight in a widely cited Twitter thread (archived here, GDELT’s primary data source is media reports. The project is not at a stage at which its data can be used to make reliable claims about independent cases of kidnapping. The kidnapping of schoolgirls in Nigeria was a single event. There were thousands of global media stories about it. Although GDELT de-duplicated some of those stories to a single event, it still logged, erroneously, that hundreds of kidnapping events had happened that day. The FiveThirtyEight report had counted each of those GDELT pseudoevents as a separate kidnapping incident.

The error was embarrassing for FiveThirtyEight, not to mention for the reporter, but it also helps to illustrate some of the larger problems related to data found “in the wild.” First, the hype around “big data” leads to projects like GDELT wildly overstating the completeness and accuracy of its data and algorithms. On the website and in publications, the project leads have stated that GDELT is “an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what’s happening around the world, what its context is and who’s involved, and how the world is feeling about it, every single day” (Leetaru, n.d.).

14.2 Bigger is not Always Better

This boasting by GDELT is typical of projects that ignore context, fetishize size, and inflate their technical and scientific capabilities. In doing so, they tap into patriarchial, cis-masculinist, totalizing fantasies of world domination as enacted through data capture and analysis. In fact, D’Ignazio and Klein (the original authors of this material) make a pointed comparison between masculinity and data science in terms of obsession with… size… that I have edited out here. You can read their comparison at the link in the attributions above—or review Gieseking (2018), which uses similar wordplay to make an important critique of big data.

In GDELT’s case, the question is whether we should take its claims of big data at face value or whether the emphasis on size is trying to trick funding organizations into giving the project massive amounts of research funding. (This trick has worked many times before.) The GDELT technical documentation does not provide any more clarity as to whether it is counting media reports (as Simpson asserts) or single events.

The database FiveThirtyEight used is called the GDELT Event Database, which certainly makes it sound like it’s counting events. The GDELT documentation states that “if an event has been seen before it will not be included again,” which also makes it sound like it’s counting events. And a 2013 research paper related to the project confirms that GDELT is indeed counting events, but only events that are unique to specific publications. So it’s counting events, but with an asterisk. Compounding the matter, the documentation offers no guidance as to what kinds of research questions are appropriate to ask the database or what the limitations might be. People like Simpson who are familiar with the area of research known as event detection may know to not believe (1) the title of the database, (2) the documentation, and (3) the marketing hype. But how would outsiders, let alone newcomers to the platform, ever know that?

Here, we’re focusing on GDELT, but the truth is that it’s not very different from any number of other data repositories out there on the web. There are a proliferating number of portals, observatories, and websites that make it possible to download all manner of government, corporate, and scientific data. Consider application programming interfaces (APIs): programs that serve as go-betweens allowing individual users to access data stored by digital platforms. There are APIs that make it possible to write little programs to query massive datasets (like, for instance, all of Twitter) and download them in a structured way.

There are test datasets for network analysis, machine learning, social media, and image recognition. There are fun datasets, curious datasets, and newsletters that inform readers of datasets to explore for journalism or analysis: You may enjoy exploring datasets about dogs in Zurich, UFO sightings, or a list of abandoned shopping carts in Bristol rivers. My favorite is perhaps this dataset of all the murals in Brussels inspired by the proud Franco-Belgian comics tradition; in fact, this is such a perfect combination of my personal interests in French, comics, and data that I feel obligated to specify that the original authors of this material included this reference—I didn’t add it after the fact.

In our current moment, we tend to think of this unfettered access to information as an inherent good. And in many ways, it is kind of amazing that one can just google and download data on, for instance, pigeon racing, the length of guinea pig teeth, or every single person accused of witchcraft in Scotland between 1562 and 1736. Personally, I’ve benefited tremendously from this! I’m not exaggerating when I say I owe my career in great part to the ability to download truckloads and truckloads of tweets. (Though under current Twitter leadership, this ability has disappeared for me and many other researchers.)

However, it’s also easy to let this convenient access to data distract us from other important commitments in data science.

14.3 Theory

The word theory gets tossed around a lot when discussing science and scientific research, but sometimes people attribute different meanings or values to it. You may have heard someone disparage some scientific conclusion as “just a theory,” suggesting that the word has a flavor of “less than fact.” That’s not entirely wrong, but it’s easy to see how the word theory can—and has—been abused by over-emphasizing that flavor. In response, some scientists and science educators will emphasize that a theory is actually the result of tremendous amounts of scientific inquiry, and theories related to evolution and climate change are much more trustworthy than their detractors would have you believe. This is an appropriate and accurate response, but all of this talk about the truth values associated with theory (as compared to fact) actually sidesteps an important component of what theory actually is—and that component is key to our understanding and use of theory as it pertains to data science.

Let’s take a look at a definition from one of my old textbooks to bring our attention to the elements of theory that we care about. Maxwell (2013) suggests that we can understand theory as:

a conception or model of what is out there that you plan to study, and of what is going on with these things and why (p. 39)

That is, if a phenomenon is a thing we want to study (likely in relationship to other phenomena—that is, other things we want to study), a theory is a description of the relationship between all of those different things. For example, one of my preferred theoretical perspectives is that of literacies (as those of you who have taken my LIS 618 class will remember). In short, this theoretical perspective suggests that people in different contexts use different technologies (or the same technologies in different ways) to express and understand meaning. Thus, when I began my dissertation research and started working with data from 60ish different Twitter hashtags used by teachers in different U.S. states and Canadian provinces or territories, my literacies perspective suggested that there ought to be differences in the activity in those hashtags based on contextual differences. By trying to describe the differences between hashtags, I could get a glimpse at how teachers in different contexts valued different features of Twitter.

This example is a good start, but you might notice that this study was pretty descriptive and exploratory—it wasn’t interested in determining a clear cause-effect relationship. In positivist sciences, like data science, a theory isn’t just a relationship between phenomena, it’s a causal relationship between phenomena. More X causes more Y. More X causes more Y, but only if Z remains steady. More X causes more Y, and a combination of more Y and more Z causes more A. Remember that positivist sciences are concerned largely with predictions—theories give us the means by which to make and test predictions. In the context of research, Maxwell (2013) writes that

The function of this theory is to inform the rest of your design—to help you to assess and refine your goals, develop realistic and relevant research questions, select appropriate methods, and identify potential validity threats to your conclusions. It also helps you justify your research… (pp. 39-40)

In short, if a given theory suggests that more X causes more Y, then as a researcher, your job is to figure out how to measure X, how to measure Y, and what statistical tests can help you determine if more X does indeed cause more Y in the context that you’re doing research in. Your results will then add to our collective understanding of that theory. After all, it is true that a theory isn’t quite the same as fact—especially in the social sciences, theories are always being poked, prodded, tested, challenged, and refined. Theories are necessarily abstract and necessarily make particular assumptions—it’s worth asking (always in good faith) if something important has been abstracted out, or if a given assumption actually holds true.

That said, if there’s been a lot of research about a particular theory, then professionals may find it trustworthy enough to make practical decisions based on that theory. If researchers have compellingly demonstrated that more X causes more Y, and if more Y is desired in a particular workplace, then the obvious solution is more X.

14.4 The Importance of Theory

To oversimplify things a bit (okay, a lot), the traditional (positivist) research process involves two main phases. In the first phase, researchers collect some data, explore that data, and then use that data to generate a theory. In the second phase, other researchers ask a question, identify a theory that ought to predict the answer to that question, collect data to test that prediction, and publish their results, offering new insights into the theory. In the traditional research process, data is assumed to be difficult to come by, so—as we read earlier—researchers focus their efforts on collecting only that data that the theory directs them to.

However, as we’ve discussed in previous weeks, we live in a world in which data is not sparse. In a 2008 Wired article, “The End of Theory,” editor Chris Anderson made the now-infamous claim that “the numbers speak for themselves” (Anderson, 2008). His main assertion was that the advent of big data would soon allow data scientists to conduct analyses at the scale of the entire human population, without needing to restrict their analysis to a smaller sample. To understand his claim, you need to understand one of the basic premises of statistics, which we’ll revisit (though not uncritically) later in the semester.

Statistical inference is based on the idea of sampling: that you can infer things about a large-scale phenomenon by studying a random and/or representative sample of people or things making up that phenomenon and then mapping those findings back on the phenomenon as a whole. Say that you want to know who all of the voters in the US will vote for in a coming presidential election. You couldn’t contact all of them, of course, but you could call three thousand of them on the phone and then use those results to predict how the rest of the people would likely vote. There would also need to be some statistical modeling and theory involved, because how do you know that those three thousand people are an accurate representation of the whole population? This is where Anderson made his intervention: at the point at which we have data collected on the entire population, we no longer need modeling or any other “theory” to first test and then prove. We can look directly at the data themselves.

Now, you can’t write an article claiming that the basic structure of scientific inquiry is obsolete and not expect some pushback. Anderson wrote the piece to be provocative, and sure enough, it prompted numerous responses and debates, including those that challenge the idea that this argument is a “new” way of thinking in the first place (for example, in the early seventeenth century, Francis Bacon argued for a form of inductive reasoning, in which the scientist gathers data, analyzes them, and only thereafter forms a hypothesis; see Mazzocchi, 2015).

Lack of novelty was not the only critique, though. One of Anderson’s major examples is Google Search. Google’s search algorithms don’t need to have a hypothesis about why some websites have more incoming links—other pages that link to the site—than others; they just need a way to determine the number of links so they can use that number to determine the popularity and relevance of the site in search results. We no longer need causation, Anderson (2008) insists: “Correlation is enough.” But what happens when the number of links is also highly correlated with sexist, racist, and pornographic results?

The influence of racism, sexism, and colonialism is precisely what we see described in Algorithms of Oppression, information studies scholar Safiya Umoja Noble’s study of the harmful stereotypes about Black and Latinx women perpetuated by search algorithms such as Google’s. Noble demonstrates that Google Search results do not simply correlate with our racist, sexist, and colonialist society; that society causes the racist and sexist results. More than that, Google Search reinforces these oppressive views by ranking results according to how many other sites link to them. The rank order, in turn, encourages users to continue to click on those same sites.

Here, correlation without context is clearly not enough because it recirculates racism and sexism and perpetuates inequality. In fact, not only does correlation fail to capture the whole picture, but we can actually use theory to account for the racism, sexism, and colonialism that are associated with Google results! A theoretical perspective that sees technological platforms (like Google) as having specific values (rather than being neutral tools; see van Dijck, 2013) gives us the conceptual tools we need to ask ourselves what the Google software values (say, popularity in the form of hyperlinks) and what it doesn’t actively value (say, anti-racism and feminism). Taking the data at face value—like Anderson would have us do—misses a huge part of what’s going on here.

14.5 Research isn’t Just Empirical

Let’s continue along these lines. One of the things that sets scientific research apart is its empiricism. That is, research is empirical—it’s based on data. However, it’s easy to forget that research isn’t just empirical. Scientific research represents a particular branch of philosophy, so to be done properly, we need to have our philosophical caps on, not just our data ones. Berry (2011) argues that the sources of big data we now have access to:

provide destabilising amounts of knowledge and information that lack the regulating force of philosophy (p. 8)

When we use theory as a guiding force—instead of just relying on data—this helps us consider research from a philosophical point of view. It forces us to state our assumptions out loud and gives us additional means of evaluating research. To give a specific example, let’s consider a Sheriff’s Office in the Tampa, Florida area that received criticism for collecting student information from schools, matching it up with other data, and using data science techniques to predict which students “could ‘fall into a life of crime’” (Bedi & McGrory, 2020).

I admittedly don’t know all the details about what this Sheriff’s Office is doing, and I have a number of objections to this project based on what I do know, but let’s use this story as a starting point for a hypothetical situation. You’ll notice that some of the details I include in this hypothetical example are similar to those I raised when discussing a hypothetical example of an algorithm that makes sentencing recommendations for criminals. This isn’t to be redundant but rather to emphasize how widespread these issues are.

More data is generally more helpful for making predictions, so if I had the job of predicting future criminals (and I didn’t immediately resign from the job), I would use all the school, county, and other data I could find to start making these predictions. One of the first things I’d do is look at current criminals, collect all the data I could about them, and use that data to determine what to look for in students that might serve as a red flag.

Now, let’s imagine that people of color are disproportionately represented among the criminal population of the county I’m in. This could be due to a number of factors, including a relationship between race and socioeconomic status (that is, race predicts socioeconomic status because of centuries of White people enriching themselves at the expense of people of color, and socioeconomic status predicts criminal activity). Alternatively, the local police could more likely to arrest a person of color for something that they would give a White person a pass for. In short, it would be irresponsible and disgustingly racist to argue that there is something inherent to race that makes a person more likely to be a criminal.

Let’s say that these additional factors aren’t accounted for in my data but that my data do say that there are proportionally more people of color in the criminal records that I have access to. As a result, in my number crunching, it’s possible that I come up with a statistically-defensible argument that race predicts criminal activity. If I’m committed to theory, I’m going to have to explicitly argue to my supervisors that I’m operating off of a theoretical perspective that race causes criminal activity. At that point, I would hope that my supervisors would recognize that theory as inexcusably racist, dismiss that as a valid way of predicting future criminal activity, and send me back to the drawing board.

However, if I have Anderson-like faith that the data will tell me everything I need to know, that intermediary step of evaluating and justifying the theory may not be necessary. There’s no philosophical oversight of the causal and other relationships that I’m proposing, and it’s easy for terrible decisions to be rubber stamped in the name of objective data.

In fact, things get worse: With advanced data science techniques like machine learning, it’s possible to develop a tool that makes predictions but is not transparent about the criteria that it uses to make those predictions and therefore does not allow its “theory” to be evaluated. It’s entirely possible that these tools are based on racist “theory” that is not (or cannot be!) properly evaluated. This ought to worry us.

14.6 The Importance of Context

We should also consider the ways that the importance of context pushes back on Anderson’s argument. One of the central tenets of feminist thinking is that all knowledge is situated. A less academic way to put this is that context matters. When approaching any new source of knowledge, whether it be a dataset or dinner menu (or a dataset of dinner menus; see Muñoz & Rawson, 2019), it’s essential to ask questions about the social, cultural, historical, institutional, and material conditions under which that knowledge was produced, as well as about the identities of the people who created it.

Rather than seeing knowledge artifacts, like datasets, as raw input that can be simply fed into a statistical analysis or data visualization, a feminist approach insists on connecting data back to the context in which they were produced. This context allows us, as data scientists, to better understand any functional limitations of the data and any associated ethical obligations, as well as how the power and privilege that contributed to their making may be obscuring the truth.

The major issue with much of the data that can be downloaded from web portals or through APIs is that they come without context or metadata. If you are lucky you might get a paragraph about where the data are from or a data dictionary that describes what each column in a particular spreadsheet means, but more often than not, you don’t even get that.

Let’s consider an example of data about government procurement. The data not look very technically complicated, but you still have to figure out how the business process behind them works. How does the government run the bidding process? How does it decide who gets awarded a contract? Are all the bids published here, or just the ones that were awarded contracts? What do terms like competition, cooperation agreement, and terms of collaboration mean to the data publisher? Without answers to even some of these questions—to say nothing of the local knowledge required to understand how power is operating in this particular ecosystem—it would be difficult to even begin a data exploration or analysis project, even if the data are easily accessible and seem straightforward.

This scenario is not uncommon. Most data arrive on our computational doorstep context-free. And this lack of context becomes even more of a liability when accompanied by the kind of marketing hype we see in GDELT and similar projects. In fact, the 1980s version of these claims is what led Donna Haraway to propose the concept of situated knowledge in the first place (see Haraway, 1988).

Subsequent feminist work has drawn on the concept of situated knowledge to elaborate ideas about ethics and responsibility in relation to knowledge-making. Along this line of thinking, it becomes the responsibility of the person evaluating that knowledge, or building upon it, to ensure that its “situatedness” is taken into account. For example, information studies scholar Christine Borgman advocates for understanding data in relation to the “knowledge infrastructure” from which they originate. As Borgman (2015) defines it, a knowledge infrastructure is “an ecology of people, practices, technologies, institutions, material objects, and relationships.” In short, it is the context that makes the data possible.

There’s another reason that context is necessary for making sense of correlation, and it has to do with how racism, sexism, and other forces of oppression enter into the environments in which data are collected. The next example has to do with sexual assault and violence. If you do not want to read about these topics, you may want to wrap up your reading here.

In April 1986, Jeanne Clery, a student at Lehigh University, was sexually assaulted and murdered in her dorm room. Her parents later found out that there had been thirty- eight violent crimes at Lehigh in the prior three years, but nobody had viewed that as important data that should be made available to parents or to the public. The Clerys mounted a campaign to improve data collection and communication efforts related to crimes on college campuses, and it was successful: the Jeanne Clery Act was passed in 1990, requiring all US colleges and universities to make on-campus crime statistics available to the public.

So we have an ostensibly comprehensive national dataset about an important public topic. In 2016, three students in Catherine’s D’Ignazio data journalism class at Emerson College —Patrick Torphy, Michaela Halnon, and Jillian Meehan—downloaded the Clery Act data and began to explore it, hoping to better understand the rape culture that has become pervasive on college campuses across the United States. The term rape culture was coined by second-wave feminists in the 1970s to denote a society in which male sexual violence is normalized and pervasive, victims are blamed, and the media exacerbates the problem. Rape culture includes jokes, music, advertising, laws, words, and images that normalize sexual violence. In 2017, following the election of a US president who joked about sexual assault on the campaign trail and the exposé of Harvey Weinstein’s predatory behavior in Hollywood, high-profile women began speaking out against rape culture with the #MeToo hashtag. #MeToo, a movement started over a decade ago by activist Tarana Burke, encourages survivors to break their silence and build solidarity to end sexual violence.

These students soon became puzzled, however. Williams College, a small, wealthy liberal arts college in rural Massachusetts, seemed to have an epidemic of sexual assault, whereas Boston University (BU), a large research institution in the center of the city, seemed to have strikingly few cases relative to its size and population (not to mention that several high-profile sexual assault cases at BU had made the news in recent years). The students were suspicious of these numbers, and investigated further. After comparing the Clery Act data with anonymous campus climate surveys, consulting with experts, and interviewing survivors, they discovered, paradoxically, that the truth was closer to the reverse of the picture that the Clery Act data suggest. The students were suspicious of these numbers, and investigated further. After comparing the Clery Act data with anonymous campus climate surveys, consulting with experts, and interviewing survivors, they discovered, paradoxically, that the truth was closer to the reverse of the picture that the Clery Act data suggest.

For example, many of the colleges with higher reported rates of sexual assault were actually places where more institutional resources were being devoted to support for survivors. This is important context, and clearly adds considerations that the original data did not cover. Note, though that we still need “the regulating force of philosophy” (Berry, 2011, p. 8) to help us puzzle this out. Even with additional context, this data could be interpreted in a number of ways, and philosophy and theory are key for helping us figure out what’s going on here. For example, a bad faith actor might argue that “more support for survivors leads to more sexual assault” and conclude that universities should actually provide less support for survivors. This theory strikes me as unlikely (and irresponsible)—it doesn’t stand up to philosophical inquiry. Rather, it seems much more likely to me that providing more support for survivors creates an environment where people are more comfortable reporting sexual assault. In short, in a faint echo of the GDELT controversy, we have to understand reports of sexual assault as just that—not as actual instances of sexual assault. Context, theory, and data are all critical parts of coming to the right conclusion.

Indeed, the context and theory that we’ve considered also helps explain the colleges with lower numbers. The Clery Act requires colleges and universities to provide annual reports of sexual assault and other campus crimes, and there are stiff financial penalties for not reporting. But the numbers are self-reported, and there are also strong financial incentives for colleges not to report. No college wants to tell the government—let alone parents of prospective students—that it has a high rate of sexual assault on campus. This is compounded by the fact that survivors of sexual assault often do not want to come forward—because of social stigma, the trauma of reliving their experience, or the resulting lack of social and psychological support.

Mainstream culture has taught survivors that their experiences will not be treated with care and that they may in fact face more harm, blame, and trauma if they do come forward. In fact, my undergraduate alma mater—a conservative, religious university—has sometimes interpreted reports of sexual assault as admissions of sexual relations outside of marriage and disciplined survivors instead of supporting them. Is it any surprise that some students might hesitate to report sexual assault?

There are further power differentials reflected in the data when race and sexuality are taken into account. For example, in 2014, twenty-three students filed a complaint against Columbia University, alleging that Columbia was systematically mishandling cases of rape and sexual violence reported by LGBTQ students. Zoe Ridolfi-Starr, the lead student named in the complaint, told the Daily Beast, “We see complete lack of knowledge about the specific dynamics of sexual violence in the queer community, even from people who really should be trained in those issues” (Golden, 2014).

14.7 Conclusion

Data alone clearly do not show the whole picture, and even context without theory can be misleading. We need all three, and we must not fall into the trap of believing that numbers don’t speak for themselves.

14.8 References

Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired. https://www.wired.com/2008/06/pb-theory/

Bedi, N., & McGrory, K. (2020, November 19). Pasco’s sheriff uses grades and abuse histories to label schoolchildren potential criminals. The kids and their parents don’t know. Tampa Bay Times. https://projects.tampabay.com/projects/2020/investigations/police-pasco-sheriff-targeted/school-data/

Berry, D. M. (2011). The computational turn: Thinking about the digital humanities. Culture Machine, 12(.

Borgman, C. L. (2015). Big data, little data, no data: Scholarship in the networked world. MIT Press.

Chalabi, M. (2014, May 8). Kidnapping of girls in Nigeria is part of a worsening problem. FiveThirtyEight. https://fivethirtyeight.com/features/nigeria-kidnapping/

Gieseking, J. J. (2018). Size matters to lesbians, too: Queer feminist interventions into the scale of big data. Professional Geographer, 70((1), 150-156.

Golden, A. (2014, April 30). Is Columbia University mishandling LGBT rape cases? Daily Beast. https://www.thedailybeast.com/is-columbia-university-mishandling-lgbt-rape-cases?ref=scroll

Haraway, D. (1988). Situated knowledges: The science question in feminism and the privilege of partial perspective. Feminist Studies, 14(3), 575-599.

Leetaru, K. (n.d.). The GDELT Project. GDELT. https://www.gdeltproject.org/

Maxwell, J. A. (2013). Qualitative research design: An interactive approach (3rd edition). SAGE Publications, Inc.

Mazzocchi, F. (2015). Could big data be the end of theory in science? EMBO Reports, 16(10), 1250-1255.

Muñoz, T., & Rawson, K. (2016). Data dictionary. Curating Menus. http://curatingmenus.org/data_dictionary/

van Dijck, J. (2013). The culture of connectivity: A critical history of social media. Oxford University Press.