Page:Wikidata as a knowledge graph for the life sciences.pdf/4

This page needs to be proofread.

Feature Article

Science Forum Wikidata as a knowledge graph for the life sciences

databases with experimental data, such as MassBank (Horai et al., 2010; Wohlgemuth et al., 2016) and PDB Ligand (Shin, 2004), and toxicological information, such as the EPA CompTox Dashboard (Williams et al., 2017). Additionally, these items contain links to compound classes, disease indications, pharmaceutical products, and protein targets. Pathways: Wikidata has items for almost three thousand human biological pathways, primarily from two established public pathway repositories: Reactome (Fabregat et al., 2018) and WikiPathways (Slenter et al., 2018). The full details of the different pathways remain with the respective primary sources. Our bots enter data for Wikidata properties such as pathway name, identifier, organism, and the list of component genes, proteins, and chemical compounds. Properties for contributing authors (via ORCID properties; Sprague, 2017), descriptions and ontology annotations are also being added for Wikidata pathway entries. Diseases: Wikidata has items for over 16 thousand diseases, the majority of which were created based on imports from the Human Disease Ontology (Schriml et al., 2019), with additional disease terms added from the Monarch Disease Ontology (Mungall et al., 2017). Disease attributes include medical classifications, symptoms, relevant drugs, as well as subclass relationships to higher-level disease categories. In instances where the Human Disease Ontology specifies a related anatomic region and/or a causative organism (for infectious diseases), corresponding statements are also added. References: Whenever practical, the provenance of each statement added to Wikidata was also added in a structured format. References are part of the core data model for a Wikidata statement. References can either cite the primary resource from which the statement was retrieved (including details like version number of the resource), or they can link to a Wikidata item corresponding to a publication as provided by a primary resource (as an extension of the WikiCite project; Ayers et al., 2019), or both. Wikidata contains over 20 million items corresponding to publications across many domain areas, including a heavy emphasis on biomedical journal articles.

Bot automation To programmatically upload biomedical knowledge to Wikidata, we developed a series of computer programs, or bots. Bot development began by reaching a consensus on data

Waagmeester et al. eLife 2020;9:e52614. DOI: https://doi.org/10.7554/eLife.52614

modeling with the Wikidata community, particularly the Molecular Biology WikiProject. We then coded each bot to retrieve, transform, normalize and upload data from a primary resource to Wikidata via the Wikidata application programming interface (API). We generalized the common code modules into a Python library, called Wikidata Integrator (WDI), to simplify the process of creating Wikidata bots (https://github.com/SuLab/WikidataIntegrator; archived at BurgstallerMuehlbacher et al., 2020). Relative to accessing the API directly, WDI has convenient features that improve the bot development experience. These features include the creation of items for scientific articles as references, basic detection of data model conflicts, automated detection of items needing update, detailed logging and error handling, and detection and preservation of conflicting human edits. Just as important as the initial data upload is the synchronization of updates between the primary sources and Wikidata. We utilized Jenkins, an open-source automation server, to automate all our Wikidata bots. This system allows for flexible scheduling, job tracking, dependency management, and automated logging and notification. Bots are either run on a predefined schedule (for continuously updated resources) or when new versions of original databases are released.

Applications of Wikidata Translating between identifiers from different databases is one of the most common operations in bioinformatics analyses. Unfortunately, these translations are most often done by bespoke scripts and based on entity-specific mapping tables. These translation scripts are repetitively and redundantly written across our community and are rarely kept up to date, nor integrated in a reusable fashion. An identifier translation service is a simple and straightforward application of the biomedical content in Wikidata. Based on mapping tables that have been imported, Wikidata items can be mapped to databases that are both widely- and rarely-used in the life sciences community. Because all these mappings are stored in a centralized database and use a systematic data model, generic and reusable translation scripts can easily be written (Figure 2). These scripts can be used as a foundation for more complex Wikidata queries, or the results can be

4 of 15