co-written with Pippin Lee
The Leaders Prize is a $1M prize which incentivizes Canadian teams to develop solutions that will “automatically verify a series of claims, flag whether they are true, partly true, or false, and provide evidence to support their determination.”
Pippin Lee and I were initially excited about the problem statement, and about the opportunity to apply natural language processing solutions to a real-world problem. Understanding the detectability of synthetic media, the ways in which media is evolving, and disinformation in new mediums is something we’re deeply interested in.
This competition is a joint collaboration between Communitech and the Schulich Foundation, and is financially supported by the Leaders Fund. The competition aims to encourage the development of fully-automated strategies for fact-checking. It is made up of two parts:
Phase 1: Teams are given a series of claims and text from supporting documents and asked to determine the truth value of the claim. (June - December 2019)
Phase 2: Teams must generate reasons for their assessment of the claims being true or false. (January - June 2020)
According to David Stein, co-founder and Managing Partner of the Leaders Fund, “It is our goal that the solutions developed through this competition provide the foundation for new companies and solutions that have a distinctive edge and global applicability.”
Unfortunately, after spending time with the dataset and consulting with researchers in misinformation, we've concluded that the competition will not produce anything of value for the fact-checking community. What follows are the lessons learned from experts in the field, as well as some personal anecdotes, in hopes that these lessons can be used to guide future work.
While researching these problems and solutions, we found that Mevan Babakar from UK-based fact-checking organization Full Fact had written an excellent Twitter thread on the issues with the current implementation of this competition. This includes major oversights as to the way that the truth value of a source should be determined (i.e. claims often require confirmation by authoritative sources which often require an email or phone call).
The initial dataset is a collection of claims (i.e. "Viral posts claim that climate change is a made-up catastrophe.") along with text from 1-5 “supporting documents,” such as news articles. This simplifies the task to a semantic matching task given a pre-decided list of sources, which does not account for the nuance in selecting credible sources, as fact-checkers in the real world are required to do. This formulation reduces the task to “does this claim match up with any of the pre-supplied articles on the topic?", which is not at all representative of the complexity of a real-world fact-checking task.
A critical piece of the fact-checking workflow involves sourcing information and checking its validity against known ground truth, as well as other authoritative sources. Removing this element from the Leaders’ Prize problem by supplying a predefined set of so-called authoritative material reduces the task to a matching problem, which is neither novel or useful in the context of detecting the truth value of information.
If that's the case, then what should a fake news challenge solve for? Similar competitions in this space, such as the Fake News Challenge, recognized this complexity and decided against framing their task as a truth-value assessment after extensive consultation with journalists. Instead, they opted for a stance-detection task which aims to classify how a body of text supports or refutes a claim, as opposed to its absolute truth value. This enables fact-checkers to quickly assess how sources are standing on controversial issues.
After speaking with Aviv Ovadya, we recommended reframing the problem in the following way: given a (claim, document) pair, assess whether the extent to which a document would make a reader more likely to believe the claim. For example, a false claim might be “vaccines cause autism,” and the resulting system would help identify content which is pushing such misinformation. This helps fact checkers triage claims to check, assists their research, and directly enables a single fact-check to be matched against thousands of misinformation variants, which would be immediately impactful for researchers assessing the rate and ways which misinformation spreads.
We then declined to continue participating in the competition, fearing that it would not be a good use of our time, and instead raised these issues to the Leaders' Fund, along with our recommendations. We received the following response from one of the organizers, a computer science professor at the University of Waterloo. He does admit that building systems which are immediately applicable to the real-world is not a priority for this competition: “The competition has two goals: a) get the scientific community to work on a societal cause of great importance and b) advance the state of the art in NLP and ML. In that respect the competition is achieving those goals. Naturally, we all hope that the competition will lead to some new solutions that will help to reduce misinformation, but we will have to wait till after the competition (and perhaps several years after the competitions) to see the outcome of potential commercializations."
The million-dollar prize money is a significant amount to be spending on solutions which are unlikely to be deployed in the real world, which was built on a dataset which has limited practicality for the actual task in question, as deemed by misinformation researchers. We hypothesize that this funding may have been more efficiently used to accomplish the goal of improving fact-checking by funding towards fact-checking organizations with experience building complex fact-checking systems, or designing this task in a nuanced way to motivate solutions which create short-to-mid-term real-world value.
This is reminiscent of the way that many corporate-sponsored Kaggle competitions are run—without regard for the nuance that is required of the real-world task, with more weighting for how “cool” the results are (often by basis points), despite their irrelevance to the real world. Unfortunately, the Leaders' Prize competition does not even accomplish its goal of advancing NLP in any sort of meaningful way. The task has not been constructed to be meaningful and cannot be used as a benchmark in future NLP research, given the way that the task has been stripped of nuance, meaning, and its lack of fidelity to this problem in the real world.
There are also issues with the way that the competition has been organized, in technical and non-technical ways.
Limiting participation to Canadian teams: Fact-checking is a global issue. Media is not limited by country boundaries, and work on this issue would benefit most from collaboration from people across geographies and disciplines.
Untested, buggy submission platform: The DataCup platform is relatively new and has not hosted many competitions to-date. The Slack group created for the competition reflects this, with the majority of messages being from teams spending precious development time trying to understand cryptic error messages and failures from DataCup and working around platform issues and limitations which should have been ironed out prior to the competition
Misaligned incentives: The single $1M prize incentivizes the wrong reasons for participation—distributing the sizeable prize money among several would make more sense to get as many motivated people involved as possible. Despite several participant teams asking for the prize money to be split to encourage more participants to find the endeavour worthwhile, the organizers have declined to consider the option.
The results of Phase I were announced in December 2019. Several teams expressed that they were surprised by the standings, which found several leading teams significantly dropped in rank on the leaderboard despite good scores on the hidden validation dataset, and even the top scores dropping 10+% in F1, the evaluation metric. In response, the same organizer posted this justification:
It is critical to highlight that the test dataset is made up of 103 claims (the training dataset considered ~15k claims), only which 8 of them belong to the True class. The F1 metric is sensitive to the number of elements per class, which explains why there may be large discrepancies in scoring. A test dataset of this size does not provide statistical significance for the outcomes. The same organizer admits that it is possible for lucky high-scorers to be at the top of the leaderboard.
It is confusing to us that the organizers are justifying that “there was no guarantee that the test set would be statistically similar to the training set,” demonstrating a fundamental flaw in the competition design and a misunderstanding of the capabilities of offline machine learning systems. Given static training data and machine learning systems (i.e. as opposed to online learners), it should not come as a surprise that the submitted models would not generalize well to wildly out-of-distribution data. As a result, the final results are only slightly better than random, and it is understandable that teams would be frustrated that their efforts to use machine learning would be made irrelevant by the stochasticity of the test dataset.
More critically, the key problem to highlight is this: fully-automated end-to-end fact-checking is not a reasonable problem to solve solely with machine learning methods at this point in time. There are several sub-problems which could likely be tackled with machine learning if broken down into specific tasks, but to assume that the entire workflow could be automated is an erroneous assumption made out of technologist hubris. Julia Evans has an excellent blog post highlighting the missed complexity when real-world problems are simplified to fit into a Kaggle competition format.
Perhaps the clearest reflection of the failure of this competition is in the lukewarm discussion around it, even among participants. The official Slack group messages are made up almost entirely of participants complaining about errors on the DataCup platform, as opposed to any meaningful discussion around the problem nature itself or the task applicability to the real world, as you may expect from a participant group which is motivated by the mission.
It is disappointing for us to see Canadian leaders put significant funding behind competitions which seem to lack understanding of the domain, and do not account for lessons from previous fake news challenges. It is difficult not to feel like these organizations highlighted this important problem domain, but simply viewed them through machine learning lenses for virtue signaling.
As machine learning engineers, we’re excited about how ML can be useful in for high-impact societal problems—but as we’ve consistently found, solutions require complex cross-disciplinary collaboration with domain experts and machine learning alone will not solve these problems outright. Let’s work together.
Many thanks to Aviv Ovadya, Mevan Babakar, and Dean Pomerleau for facilitating conversations and providing feedback during this process.
These views belong to Pippin Lee and Helen Ngo and do not represent the views of any other individual or organizations.
For more information on the state of automated fact-checking, we recommend this report by UK-based fact-checking charity Full Fact.