Page:Citation Detective WikiWorkshop2020.pdf/2

This page needs to be proofread.

Wiki Workshop’ 20, Apri! 2020, Taipei, Taiwan

Ai-Jou Chou, Guilhecme Gongalves, Sam Walton, and Miriam Redi

to those researchers and practitioners who are familiar with lassifiers based on natural language processing. To overcome this issue, in this paper, we present a system called Citation Detective, which makes these models readily usable by the broader Wikipedia and research community. Citation Detective “productionizes” the Citation Need model by applying the classifier to a large number of articles from the English Wikipedia and periodically releasing a public dataset of unsourced statements on the encyclopedia.

The Citation Detective dataset enables a number of research works and applications. First, it enables to understand Wikipedia citation quality at scale, by allowing to quantify and track the proportion of unsourced and well-sourced content in Wikipedia articles. To show the potential of such a dataset for this task, in this paper we provide a large-scale analysis of the encyclopedia’s citation coverage, exploring the data along dimensions of topic, quality, and popularity. Second, the dataset produced by Citation Deiective can easily be integrated with tools such as Citation Hunt to improve community workflows, by surfacing unsourced content with no prior [Citation Needed] tag. At the time of writing, the Citation Hunt tool is being extended to accommodate sentence ‘suggestions from the Citation Detective dataset.>

In this paper we provide an overview of the relevant research, Gata, and tools, a summary of our work on the Citation Detec- tive dataset, and an analysis of the state of citation coverage on Wikipedia.

2 BACKGROUND AND STATE OF THE ART

This paper is closely related to the body of research and tools supporting efforts to improve citation coverage and quality in Wikipedia.

The Wikipedia editor community monitors the quality of in- formation and citations through various mechanisms, including templates such as [Citation Needed] or {Unreferenced}. However, recent studies estimate that many articles might still have a small number of references or no references at all, and that readers rarely verify statements by clicking on inline citations (5, 6].

Tools such as Citation Hunt provide user friendly interfaces to help contributors fixing sentences which are missing reliable sources, and initiatives such as The Wikipedia Library® help edi- tors find the right sources to cite. To further support researchers and editors in this task, the Wikimedia Foundation has recently released structured datasets to aid navigation of the citation space in Wikipedia. These datasets include a list of all citations with identifiers in Wikipedia, for all articles in all languages [3] and its extended version containing all citations with identifiers tagged with topics and accessibility labels [8].

Some recent publications have focused on machine-assisted rec- ‘ommendations for citation quality improvement. These efforts in- lude source recommendations for outdated citations [2], and au- tomatic detection of the citation span, namely the portion of a paragraph which is covered by an inline citation (1]. Redi et al. (7] designed a set of classifiers based on natural language processing that, given a sentence, can automatically detect whether it needs a citation (“Citation Need" classifier), and why (“Citation Reason").

See prototype at tps tools wallabsorg/aco-itatinbunt Spuapstlen wikipediaorg/wili/ Wikipedia:The_ Wikipedia, Library

In this paper, we extend the work in [7] in two ways, First, we design a framework to make the Citation Need model available to the public, by creating a system that periodically classifies a large number of sentences in English Wikipedia with the Citation Need model, and releases a dump of the sentences which are classified as needing citations. Second, we provide an analysis of citation quality in English Wikipedia by applying the Citation Need model at scale ona sample of articles.

3. CITATION DETECTIVE

‘We present here Citation Detective, a system that applies the Citation Need models to a large number of articles in English Wikipedia, producing a dataset which contains sentences detected as missing citations with their associated metadata (such as article name and revision id).

3.1 System Workflow

‘The workflow of producing the Citation Detective database includes the following steps.

3.11 Generating a List of Pages

. Given an article_sample_rate, we query the page table from Wikipedia SQL replicas’ to generate a page_id list. The Page ID is a unique identifier for Wikipedia articles preserved across edits and renames for pages in Wikipedia. ‘The result of this step is a random sample of articles from English Wikipedia (which can be replicated for any other Wikipedia).

3.1.2 Retrieving Page Content

. The page list is passed to the Me- diaWiki API° to retrieve the page content. For each page in the list, we query the MediaWiki API to get the title, revision ID, and content (text) of the article.

3.1.3 Constructing Model Input Data

. An input instance for the Citation Need model is a made of (1) set of FasTtext [4] word vectors representing the each word in a sentence, and (2) the average word vector for all the words in the section title where the sentence lies (see [7] for more details). The public code repository for the Citation Need model? provides pre-defined dictionaries of words and section titles based on FastText. In this step, we aim to extract individual sentences and their section titles, and transform them into Fasttext embeddings using the sentence dictionary and the section dictionary provided along with the Citation Need model.

First, we broke an article into sections by the highest level sec- tion titles, and we discard sections that do not need citations such as “See also”, “References”, “External links’. Then, we split a sec- tion paragraphs, and further divide it into sentences using NLTK's sentence tokenizer. Next, we split a sentence into words, and trans- form each word into its embedding by matching it with akey in the sentence dictionary. Similarly, we trasnform the section title into a section embedding using the section dictionary. If a words ora section title is not included in a dictionary, it will be assigned an average word embedding corresponding to unknown words (following the procedure in [7]). At the end of this step, we have, for each article, a set of sentences converted into word vectors and ready to be used as input data for the Citation Need model. Foatpsizw wovmediawiki org/wiki/ManualPage table

Shp wwsvmediali org sili APEMin page "hueps//github comd/mieryeetation-needed- paper