40 M13U: The Dangers of False Positives

This chapter draws on material from:

Changes to the source material include removal of original material, the addition of new material, combining of sources, and editing of original material for a different audience.

The resulting content is licensed under licensed under CC BY 4.0.

40.1 Introduction

Tools like confidence intervals and hypothesis testing are critical for work involving inferential statistics. They acknowledge that there’s always ambiguity in the data but give us some mathematical tools for making the most of the data that we do have. However, it’s very important that we don’t turn “making the most” into absolute confidence. Confidence intervals help us narrow in on an expected statistical range with which to make judgments, but as we discussed last week, they rely on assumptions about what is “normal” that may not hold up. Likewise, hypothesis testing helps us quantify the possibility of being wrong about new assumptions that we want to adopt; however, their very existence also acknowledges the possibility of wrongly adopting those assumptions.

Later this week, you’ll read about Type I errors, when we reject a null hypothesis (a term we’ll read more about shortly) even though that hypothesis is actually true. Another name for the Type I error is the false positive. For example, if I take a COVID-19 rapid test, the null hypothesis is that I don’t have COVID-19. While COVID-19 testing is important, generally trustworthy, and really helpful, it isn’t perfect. There’s always the possibility that the test will reject the null hypothesis (that I don’t have COVID-19) and lead me to believe that I’m sick when in reality, I’m actually perfectly healthy. Of course, in the case of COVID-19, what’s more dangerous is a false negative—when a test doesn’t find evidence to reject the null hypothesis but, in fact, the test-taker is actually COVID-positive.

False positives can crop up in many applications of data science, and they also serve as a helpful metaphor for other ways that we can read too much into our data. In this chapter, we’ll read about ways that educators can read too much into educational data (including data provided by platforms like Canvas!) and come to wrong conclusions. This is not a literal case of false positives, but the story remains an important reminder that we need to be humble and cautious about trusting data!

40.2 The Dartmouth Cheating Scandal

Like many schools, Dartmouth College has increasingly turned to technology to monitor students taking exams at home—this was especially true during the worst of the COVID-19 pandemic. While many universities have used proctoring tools that purport to help educators prevent cheating, Dartmouth’s Geisel School of Medicine went dangerously further in 2021. Apparently working under an assumption of guilt, the university launched a dragnet investigation of complicated system logs, searching for data that might reveal student misconduct, without a clear understanding of how those logs can be littered with false positives. Worse still, those attempting to assert their rights have been met with a university administration willing to trust opaque investigations of inconclusive data sets more than their own students.

The Boston Globe explained that the medical school administration’s attempts to detect supposed cheating became a flashpoint on campus, exemplifying a worrying trend of schools prioritizing misleading data over the word of their students. The misguided dragnet investigation has cast a shadow over the career aspirations of over twenty medical students.

40.3 What Was Wrong With Dartmouth’s Investigation

In March 2021, Dartmouth’s Committee on Student Performance and Conduct (CSPC) accused several students of accessing restricted materials online during exams. These accusations were based on a flawed review of an entire year’s worth of the students’ log data from Canvas, the online learning platform that contains class lectures and information. This broad search was instigated by a single incident of confirmed misconduct, according to a contentious town hall between administrators and students (the Electronic Frontier Foundation—or EFF—re-uploaded this town hall after the original recording was put behind a Dartmouth login screen). These logs show traffic between students’ devices and specific files on Canvas, some of which contain class materials, such as lecture slides. At first glance, the logs showing that a student’s device connected to class files would appear incriminating: timestamps indicate the files were retrieved while students were taking exams.

But after reviewing the logs sent to them by a student advocate, it became clear to the EFF that there was no way to determine whether this traffic happened intentionally (as a result of student action) or instead automatically (as background requests from student devices that were logged into Canvas but not in use). In other words, rather than the files being deliberately accessed during exams, the logs could have easily been generated by the automated syncing of course material to devices logged into Canvas but not used during an exam. It was simply impossible to know from the logs alone if a student intentionally accessed any of the files, or if the pings exist due to automatic refresh processes that are commonplace in most websites and online services. Most of us don’t log out of every app, service, or webpage on our smartphones when we’re not using them.

Much like a cell phone pinging a tower, the logs showed files being pinged in short time periods and sometimes being accessed at the exact second that students are also entering information into the exam, suggesting a non-deliberate process. The logs also reveal that the files accessed are largely irrelevant to the tests in question, also indicating an automated, random process. A UCLA statistician wrote a letter explaining that even an automated process can result in multiple false-positive outcomes. Canvas’s own documentation explicitly states that the data in these logs “is meant to be used for rollups and analysis in the aggregate, not in isolation for auditing or other high-stakes analysis involving examining single users or small samples.” Given the technical realities of how these background refreshes take place, the log data alone should be nowhere near sufficient to convict a student of academic dishonesty.

Along with The Foundation for Individual Rights in Education (FIRE), EFF sent a letter to the Dean of the Medical School on March 30th, explaining how these background connections work and pointing out that the university has likely turned random correlations into accusations of misconduct. The Dean’s reply was that the cases are being reviewed fairly.

The EFF disagreed, arguing that the administration was the victim of confirmation bias, turning fallacious evidence of misconduct into accusations of cheating. The school has admitted in some cases that the log data appeared to have been created automatically, acquitting some students who pushed back. However, other students have been sanctioned, apparently entirely based on this spurious interpretation of the log data. Many others are anxiously waiting to hear whether they will be convicted so they can begin the appeal process, potentially with legal counsel.

These convictions carry heavy weight, leaving permanent marks on student transcripts that could make it harder for them to enter residencies and complete their medical training. At this level of education, this is not just about being accused of cheating on a specific exam. Being convicted of academic dishonesty could derail an entire career.

40.4 Investigations of Students Must Start With Concrete Evidence

Though this investigation wasn’t the result of proctoring software, it was part and parcel of a larger problem: educators have used the pandemic as an excuse to comb for evidence of cheating in places that are far outside their technical expertise. Proctoring tools and investigations like this one flag students based on flawed metrics and misunderstandings of technical processes, rather than concrete evidence of misconduct.

Proctoring software that assumes all students take tests the same way—for example, in rooms that they can control, their eyes straight ahead, fingers typing at a routine pace—puts a black mark on the record of students who operate outside the norm. One problem that has been widely documented with proctoring software is that students with disabilities (especially those with motor impairment) are consistently flagged as exhibiting suspicious behavior by software suites intended to detect cheating. Other proctoring software has flagged students for technical snafus such as device crashes and Internet cuts out, as well as completely normal behavior that could indicate misconduct if you squint hard enough.

Throughout the pandemic, far too many schools ignored legitimate student concerns about inadequate, or overbroad, anti-cheating software. Across the country, thousands of students, and some parents, have created petitions against the use of proctoring tools, most of which (though not all have been ignored. Students taking the California and New York bar exams—as well as several advocacy organizations and a group of deans—advocated against the use of proctoring tools for those exams. As expected, many of those students then experienced “significant software problems” with the Examsoft proctoring software, specifically, causing some students to fail.

Many proctoring companies have defended their dangerous, inequitable, privacy-invasive, and often flawed software tools by pointing out that humans—meaning teachers or administrators—usually have the ability to review flagged exams to determine whether or not a student was actually cheating. That defense rings hollow when those reviewing the results don’t have the technical expertise—or in some cases, the time or inclination—to properly examine them.

Similar to schools that rely heavily on flawed proctoring software, Dartmouth medical school has cast suspicion on students by relying on access logs that are far from concrete evidence of cheating. Simply put: these logs should not be used as the sole evidence for potentially ruining a student’s career.

After press coverage from the New York Times, which also found that students’ devices could automatically generate Canvas activity data even when no one was using them, Dartmouth’s administration retracted the allegations—but not before students spent months fighting them and fearing what they could mean for their futures. Unfortunately, this has not stopped school officials from considering their interpretation of data to be above reproach.

40.5 A Problem with Companies and Transparency

Though Canvas and Blackboard have publicly stated that their logs should not be used for disciplinary decisions regarding high-stakes testing, we’ve heard other stories from students who were accused of misconduct—even well before the pandemic—based on inadequate interpretations of data from these platforms.

Since January of 2021, the Canvas instructor guide has explicitly stated: “Quiz logs should not be used to validate academic integrity or identify occurrences of cheating.” In February, an employee of Instructure (the company behind Canvas) commented in a community forum that “weirdness in Canvas Quiz Logs may appear because of various end-user [student] activities or because Canvas prioritizes saving student’s quiz data ahead of logging events. Also, there is a known issue with logging of ‘multiple answer’ questions.”. The employee concluded that “unfortunately, I can’t definitively predict what happened on the users’ end in that particular case.”

Despite the admitted and inherently unreliable nature of Canvas logs, and an outcry by accused students and digital rights organizations, schools continue to rely on Canvas logs to determine cheating—and Instructure continues to act as if nothing is wrong. Meanwhile, students’ educational careers are being harmed by these flimsy accusations.

In 2020, the administration of James Madison University lowered the grades of students who had been flagged as “inactive” during an exam according to Canvas test-taking logs. Students there spoke out to criticize the validity of the logs. And in the past few months, EFF has heard from other students around the country who have been accused of cheating by their schools, based solely or primarily on Canvas logs.

Cheating accusations can result in lowered grades, a black mark on student transcripts, suspension, and even expulsion. Despite the serious consequences students face, they often have very limited recourse. Disturbingly, Canvas provides logs to administrators and teachers, but accused students have been unable to see those same logs, either via the platform itself or from school officials. Students deserve better.

Instructure (and other companies) must do better. Admitting to the unreliability of Canvas logs in obscure webpages is not enough. Instructure ought to to issue a clear, public announcement that Canvas logs are unreliable and should not be used to fuel cheating accusations. The company should also allow students to access the same logs their schools are increasingly using to accuse them of academic misconduct—which is important because, when viewed in their entirety, Canvas logs often don’t rationally reveal activity consistent with cheating

Instructure has a responsibility to prevent schools from misusing their products. Taking action now would show the company’s commitment to the integrity of the academic process, and would give students a chance to face their accusers on the same footing, rather than resigning to an unjust and opaque process.

40.6 Not Just Dartmouth

In 2018, an undergraduate student at Brown University was penalized based on Canvas access logs and subsequently contacted the EFF. The student had attempted to access the Canvas database immediately before an exam, and had then left multiple browser windows open to the Canvas web address on her mobile phone. The phone remained unused in her backpack during the exam, during which time the log records (inaccurately) appeared to indicate that the site was being actively accessed by a user.

After being accused of cheating based on this Canvas log data, the student reached out to Canvas. Multiple Canvas technical support representatives responded by explaining that the log data was not a reliable record of user access. The student shared their statements with Brown. Notably, the student, who had a 4.0 record, had little motive to cheat on the exam at issue, rendering the cheating accusation—based as it was virtually entirely on the log data—all the more flimsy.

A Brown disciplinary panel nonetheless ruled against the student, and placed a permanent mark on her academic record, grounding its decision on the accuracy of the Canvas log data.

Last year, in the wake of Dartmouth’s apology to its wrongfully accused students, and Canvas’ more public acknowledgment that its logs should not be used as the basis for disciplinary decisions, the former student asked Brown to clear her academic record. Brown, however, refused even to consider voiding its disciplinary decision on the ground that the student had no remaining right to appeal. This is a common thread we’ve seen in these situations: the students at Dartmouth were also not afforded reasonable due process—they were not provided complete data logs for the exams, were given less than 48 hours to respond to the charges (and only two minutes to make their cases in online hearings), and were purportedly told to admit guilt.

In an implicit acknowledgment of its error, Brown now says that it will provide its former student with a letter of support if she applies to graduate school. An implicit admission of injustice, however, is not a sufficient remedy. Like Dartmouth, Brown should withdraw the record of the discipline it wrongfully imposed on this student, as well as any others who may have likewise been found responsible for cheating based on such unreliable log records.

Schools should accept that Canvas logs cannot replace concrete, dispositive evidence of cheating. In fact, the EFF has written a guide for educating administrators and teachers on these logs’ inaccuracy.

40.7 Conclusion

The “false positives” in these stories are not Type I errors as we’ll be discussing in the context of hypothesis testing. Nonetheless, they serve as powerful reminders that data cannot always be trusted. These universities didn’t necessarily act maliciously—indeed, it’s understandable that institutions of higher education would want to crack down on cheating! Instead, the main error that these schools made was assuming that data are always objective and trustworthy instead of being open to the ways in which seemingly straightforward data can fail us. As data scientists, it’s important that we avoid the same mistake.