Wikisource:WikiProject Open Access/Programmatic import from PubMed Central/Ten Simple Rules for Digital Data Storage

Ten Simple Rules for Digital Data Storage
Edmund M. Hart; Pauline Barmby; David LeBauer; François Michonneau; Sarah Mount; Patrick Mulrooney; Timothée Poisot; Kara H. Woo; Naupaka B. Zimmerman; Jeffrey W. Hollister, edited by Scott Markel
PLoS Computational Biology , vol. 12, iss. p.

IntroductionEdit

Data is the central currency of science, but the nature of scientific data has changed dramatically with the rapid pace of technology. This change has led to the development of a wide variety of data formats, dataset sizes, data complexity, data use cases, and data sharing practices. Improvements in high-throughput DNA sequencing, sustained institutional support for large sensor networks [[1],[2]], and sky surveys with large-format digital cameras [[3]] have created massive quantities of data. At the same time, the combination of increasingly diverse research teams [[4]] and data aggregation in portals (e.g., for biodiversity data, GBIF.org or iDigBio) necessitates increased coordination among data collectors and institutions [[5],[6]]. As a consequence, “data” can now mean anything from petabytes of information stored in professionally maintained databases, to spreadsheets on a single computer, to handwritten tables in lab notebooks on shelves. All remain important, but data curation practices must continue to keep pace with the changes brought about by new forms of data and new data collection and storage practices.

While much has been written about both the virtues of data sharing [[7],[8]] and the best practices to do so [[9],[10]], data storage has received comparatively less attention. Proper storage is a prerequisite to sharing, and indeed inadequate storage contributes to the phenomenon of data decay or to “data entropy,” in which data, whether publicly shared or not, becomes less accessible through time [[11][12]]. Best practices for data storage often begin and end with this statement: “Deposit your data in a community standard repository.” This is good advice, especially considering your data is most likely to be reused if it is available on a community site. Community repositories can also provide guidance for best practices. As an example, if you are archiving sequencing data, a repository such as those run by the National Center for Biotechnology Information (NCBI) (e.g., GenBank) not only provides a location for data archival but also encourages a set of practices related to consistent data formatting and the inclusion of appropriate metadata. However, data storage policies are highly variable between repositories [[13]]. A data management plan utilizing best practices across all stages of the data life cycle will facilitate transition from local storage to repository [[14]]. Similarly, having such a plan can facilitate transition from repository to repository if funding runs out or requirements change. Good storage practices are important even (or especially) in cases when data may not fit with an existing repository, when only derived data products (versus raw data) are suitable for archiving, or when an existing repository may have lax standards.

This article describes ten simple rules for digital data storage that grew out of a long discussion among instructors for the Software and Data Carpentry initiatives [[15],[16]]. Software and Data Carpentry instructors are scientists from diverse backgrounds who have encountered a variety of data storage challenges and are active in teaching other scientists best practices for scientific computing and data management. Thus, this paper represents a distillation of collective experience, and hopefully will be useful to scientists facing a variety of data storage challenges. We additionally provide a glossary of common vocabulary for readers who may not be familiar with particular terms.

Rule 1: Anticipate How Your Data Will Be UsedEdit

One can avoid most of the troubles encountered during the analysis, management, and release of data by having a clear roadmap of what to expect before data acquisition starts. For instance:

  1. How will the raw data be received? Are they delivered by a machine or software, or typed in?# What is the format expected by the software used for analysis?# Is there a community standard format for this type of data?# How much data will be collected, and over what period of time?

The answers to these questions can range from simple cases (e.g., sequencing data stored in the FASTA format, which can be used “as is” throughout the analysis), to experimental designs involving multiple instruments, each with its own output format and processing conventions. Knowing the state in which the data needs to be at each step of the analysis can help to (i) identify software tools to use in converting between data formats, (ii) orient technological choices about how and where the data should be stored, and (iii) rationalize the analysis pipeline, making it more amenable to re-use [[17]].

Also key is the ability to estimate the storage volume needed to store the data, both during and after the analysis. The required strategy will differ for datasets of varying size. Smaller datasets (e.g., a few megabytes in size) can be managed locally with a simple data management plan, whereas larger datasets (e.g., gigabytes to petabytes) will in almost all cases require careful planning and preparation (Rule 10).

Lastly, early consideration and planning should be given to the metadata of the project. A plan should be developed early as to what metadata will be collected and how it will be maintained and stored (Rule 7). Also be sure to consider community software tools that can facilitate metadata curation and repository submission. Examples in the biological sciences include Morpho for ecological metadata [[18]] and mothur [[19]] for submitting to NCBI’s Sequence Read Archive.

Rule 2: Know Your Use CaseEdit

Well-identified use cases make data storage easier. Ideally, prior to beginning data collection, researchers should be able to answer the following questions:

  1. Should the raw data be archived (Rule 3)?# Should the data used for analysis be prepared once or re-generated from the raw data each time (and what difference would this choice make for storage, computing requirements, and reproducibility)?# Can manual corrections be avoided in favor of programmatic or self-documenting approaches (e.g., Jupyter notebook or R markdown)?# How will changes to the data be tracked, and where will these tracked changes be logged?# Will the final data be released, and if so, in what format?# Are there restrictions or privacy concerns associated with the data (e.g., survey results with personally identifiable information [PII], threatened species, or confidential business information)?# Will institutional validation be required prior to releasing the data?# Does the funding agency mandate data deposition in a publicly available archive, and if so, when, where, and under what license?# Does the target journal mandate data deposition?

None of these questions have universal answers, nor are they the only questions to ask before starting data acquisition. But knowing the what, when, and how of your use of the data will bring you close to a reliable roadmap on how to handle data from acquisition through publication and archival.

Rule 3: Keep Raw Data RawEdit

Since analytical and data processing procedures improve or otherwise change over time, having access to the “raw” (unprocessed) data can facilitate future re-analysis and analytical reproducibility. As processing algorithms improve and computational power increases, new analyses will be enabled that were not possible at the time of the original work. If only derived data are stored, it can be difficult for other researchers to confirm analytical results, to assess the validity of statistical models, or to directly compare findings across studies.

Therefore, data should always be kept in raw format whenever possible (within the constraints of technical limitations). In addition to being the most appropriate way to ensure transparency in analysis, having the data stored and archived in their original state gives a common point of reference for derivative analyses. What constitutes sufficiently “raw” data is not always clear (e.g., ohms from a temperature sensor or images of an Illumina sequencing flowcell are generally not archived after the initial processing). Yet the spirit of this rule is that data should be as “pure” as possible when they are stored. If derivations occur, they should be documented by also archiving relevant code and intermediate datasets.

A cryptographic hash (e.g., SHA or MD5) of the raw data should be generated and distributed with the data. These hashes ensure that the dataset has not suffered any silent corruption and/or manipulation while being stored or transferred (see Internet2 Silent Data Corruption). For large enough datasets, the likelihood of silent data corruption is high. This technique has been widely used by many Linux distributions to distribute images and has been very effective with minimal effort.

Rule 4: Store Data in Open FormatsEdit

To maximize accessibility and long-term value, it is preferable to store data in formats that have freely available specifications. The appropriate file type will depend on the data being stored (e.g., numeric measurements, text, images, video), but the key idea is that accessing data should not require proprietary software, hardware, or purchase of a commercial license. Proprietary formats change, maintaining organizations go out of business, and changes in license fees make access to data in proprietary formats unaffordable and risky for end-users. Examples of open data formats include comma-separated values (CSV) for tabular data, hierarchical data format (HDF) [[20]] and NetCDF [[21]] for hierarchically structured scientific data, portable network graphics (PNG) for images, KML (or other Open Geospatial Consortium [OGC] format) for spatial data, and extensible markup language (XML) for documents. Examples of closed formats include DWG for AutoCAD drawings, Photoshop document (PSD) for bitmap images, Windows Media Audio (WMA) for audio recording files, and Microsoft Excel (XLS) for tabular data. Even if day-to-day processing uses closed formats (e.g., due to software requirements), data being stored for archival purposes should be stored in open formats. This is generally not prohibitive; most closed-source software products enable users to export data to an open format.

Not only should data be stored in an open format but it should also be stored in a format that computers can easily use for processing. This is especially crucial as datasets become larger. Making data easily usable is best achieved by using standard data formats that have open specifications (e.g., CSV, XML, JSON, HDF5), or by using databases. Such data formats can be handled by a variety of programming languages, as efficient and well-tested libraries for parsing them are typically available. These standard data formats also ensure interoperability, facilitate re-use, and reduce the chances of data loss or mistakes being introduced during conversion between formats. Examples of machine-readable open formats that would not be easy to process include data included in the text of a PDF file or scanned images of tabular data from a paper source.

Rule 5: Data Should Be Structured for AnalysisEdit

To take full advantage of data, it can be useful for it to be structured in a way that makes use, interpretation, and analysis easy. One such structure for data stores each variable as a column, each observation as a row, and each type of observational unit as a table (Fig 1). The technical term for this structure is “Codd’s 3rd normal form,” but it has been made more accessible as the concept of tidy data [[22]]. When data is organized in this way, the duplication of information is reduced and it is easier to subset or summarize the dataset to include the variables or observations of interest.

File:Pcbi.1005097.g001
Example of an untidy dataset (A) and its tidy equivalent (B).Dataset A is untidy because it mixes observational units (species, location of observations, measurements about individuals), the units are mixed and listed with the observations, more than one variable is listed (both latitude and longitude for the coordinates, and genus and species for the species names), and several formats are used in the same column for dates and geographic coordinates. Dataset B is an example of a tidy version of dataset A that reduces the amount of information that is duplicated in each row, limiting chances of introducing mistakes in the data. By having species in a separate table, they can be identified uniquely using the Taxonomic Serial Number (TSN) from the Integrated Taxonomic Information System (ITIS), and it makes it easy to add information about the classification of these species. It also allows researchers to edit the taxonomic information independently from the table that holds the measurements about the individuals. Unique values for each observational unit facilitate the programmatic combination of information using “join” operations. With this example, if the focus of the study for which these data were collected is based upon the size measurements of the individuals (weight and length), information about “where,” “when,” and “what” animals were measured can be considered metadata. Using the tidy format makes this distinction clearer.

One axiom about the structure of data and code holds that one should “write code for humans, write data for computers” [[23]]. When data can be easily imported and manipulated using familiar software (whether via a scripting language, a spreadsheet, or any other computer program that can import these common files), data becomes easier to re-use. Furthermore, having the source code for the software doing the analysis available provides provenance for how the data is processed and analyzed. This makes analysis more transparent, since all assumptions about the structure of the data are implicitly stated in the source code. This also enables extraction of the analyses performed, their reproduction, and their modification.

Interoperability is facilitated when variable names are mapped to existing data standards. For instance, for biodiversity data, the Darwin Core Standard provides a set of terms that describe observations, specimens, samples, and related information for a taxa. For earth science and ecosystem models and data, the Climate Forecasting Conventions are widely adopted, such that a large ecosystem of software and data products exist to reduce the technical burden of reformatting and reusing large and complex data. Because each term in such standards is clearly defined and documented, each dataset can use the terms consistently; this facilitates data sharing across institutions, applications, and disciplines. With machine-readable, standards-compliant data, it becomes easier to build an Application Programming Interface (API) to query the dataset and retrieve a subset of interest, as outlined in Rule 10.

Rule 6: Data Should Be Uniquely IdentifiableEdit

To aid reproducibility, the data used in a scientific publication should be uniquely identifiable. Ideally, datasets should have a unique identifier such as a Digital Object Identifier (DOI), Archival Resource Key (ARK), or a persistent URL (PURL). An increasing number of online services, such as Figshare, Zenodo, or DataOne, are able to provide these. Institutional initiatives also exist and are known to your local librarians. Some repositories may require specific identifiers, and these could change with time. For instance, NCBI sequence data will in the future only be identified by “accession.version” IDs. The “GI” identifiers (in use since 1994) will be retired in late 2016 [[24]].

Even as identifier standards may change over time, datasets can evolve over time as well. In order to distinguish between different versions of the same data, each dataset should have a distinct name, which includes a version identifier. A simple way to do this is to use date stamps as part of the dataset name. Using the ISO 8601 standard avoids regional ambiguities: it mandates the date format

  1. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref001
  2. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref002
  3. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref003
  4. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref004
  5. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref005
  6. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref006
  7. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref007
  8. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref008
  9. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref009
  10. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref010
  11. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref011
  12. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref014
  13. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref015
  14. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref016
  15. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref017
  16. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref018
  17. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref019
  18. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref020
  19. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref021
  20. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref022
  21. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref023
  22. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref024
  23. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref025
  24. Cite error: Invalid <ref> tag; no text was provided for refs named pcbi.1005097.ref026