Page:Citation Detective WikiWorkshop2020.pdf/1

This page has been proofread, but needs to be validated.
Citation Detective: a Public Dataset to Improve and Quantify Wikipedia Citation Quality at Scale

Ai-Jou Chou
National Chiao Tung University
Taiwan
ajchou@cs.nctu.edu.tw

Sam Walton
Wikimedia Foundation
United Kingdom
swalton@wikimedia.org

Guilherme Gonçalves
Google
Ireland
guilherme.p.gonc@gmail.com

Miriam Redi
Wikimedia Foundation
United Kingdom
miriam@wikimedia.org

ABSTRACT

Machine learning models designed to improve citation quality in Wikipedia, such as text-based classifiers detecting sentences needing citations (“Citation Need” models), have received a lot of attention from both the scientific and the Wikimedia communities. However, due to their highly technical nature, the accessibility of such models is limited, and their usage generally restricted to machine learning researchers and practitioners. To fill this gap, we present Citation Detective, a system designed to periodically run Citation Need models on a large number of articles in English Wikipedia, and release public, usable, monthly data dumps exposing sentences classified as missing citations. By making Citation Need models usable to the broader public, Citation Detective opens up new opportunities for research and applications. We provide an example of a research direction enabled by Citation Detective, by conducting a large-scale analysis of citation quality in Wikipedia, showing that article citation quality is positively correlated with article quality, and that articles in Medicine and Biology are the most well sourced in English Wikipedia. The Citation Detective data and source code will be made publicly available and are being integrated with community tools for citation improvement such as Citation Hunt.

KEYWORDS

datasets, neural networks, Wikipedia, data dumps

ACM Reference Format:

Ai-Jou Chou, Guilherme Gonçalves, Sam Walton, and Miriam Redi. 2020. Citation Detective: a Public Dataset to Improve and Quantify Wikipedia Citation Quality at Scale. In Proceedings of The Web Conference (Wiki Workshop’20). ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/nnnnnnn. nnnnnnn

1 INTRODUCTION

The core content policy of Verifiability[1] is one of the key mechanisms that Wikipedia communities adopt to monitor the quality of its content. The policy requires any information which is likely to be challenged to be backed by a citation to a reliable source.

One of the methods by which Wikipedia’s editor communities flag verifiability issues is by tagging material with a [Citation Needed] flag. This flag can apply to one or more sentences of text, alerting readers and fellow editors that the preceding content is missing a citation to a reliable source. Articles with any content tagged with [Citation Needed] are added to maintenance categories for review. On the English Wikipedia, as of February 2020, this category contains more than 380,000 articles.[2]

Wikipedia’s editor communities have created tools and workflows to address the backlog of unsourced content on the encyclopedia, particularly to aid in navigating and filtering the list. One such tool is Citation Hunt[3], a microcontribution tool that presents users with a single sentence or paragraph ending in a [Citation Needed] flag, allowing filtering of the selected articles by article topic. The user is asked to find a reliable source of information which could verify the content, and add it to the article. In this way, Wikipedia editors can address reference gaps one entry at a time, search for unsourced content by topic, or even use the tool as a simple entry point for new contributors, such as in the 1Lib1Ref campaign.[4]

At the time of writing there is no simple way for Wikipedia’s editor communities to monitor citation needs at scale across the encyclopedia, nor to find cases of content missing a citation without prior addition of a [Citation Needed] flag. The true extent of the encyclopedia’s unsourced content is therefore currently unknown.

A recent research work aimed to fill this gap by designing machine learning classifiers able to detect sentences needing citations in Wikipedia [7]: through a qualitative analysis of the citation guidelines in Wikipedia, the authors created a taxonomy of reasons why inline citations are required in Wikipedia, and then designed and open-sourced text-based classifiers to determine if a sentence needs a citation (“Citation Need" model), and why. While the “Citation Need" model is a first step towards understanding Wikipedia citation quality at scale, its usability is limitedPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Wiki Workshop’20, April 2020, Taipei, Taiwan

© 2020 Association for Computing Machinery.

ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00

https://doi.org/10.1145/nnnnnnn.nnnnnnn

  1. https://en.wikipedia.org/wiki/Wikipedia:Verifiability
  2. https://en.wikipedia.org/wiki/Category:All_articles_with_unsourced_statements
  3. https://tools.wmflabs.org/citationhunt
  4. https://meta.wikimedia.org/wiki/The_Wikipedia_Library/1Lib1Ref/Resources