This page needs to be proofread.
128
The Digital Public Domain

community. As well as metadata for specific works, we can also have metadata for large collections of knowledge resources. We believe that this is integral to the greater reuse and recombination of knowledge resources.

2. Componentization and Open Knowledge

The collaborative production and distribution of data is gradually progressing towards the level of sophistication displayed in software. Data licensing is important to this progression, but is often over-examined. Instead we believe the crucial development is componentization, which can be defined as the process of atomizing (breaking down) resources into separate reusable packages that can be easily recombined. By focusing on the packaging and distribution of data in a shared context, one can resolve issues of rights, report-back, attribution and competition. Looking across different domains for “spike solutions”, we see componentization of data at the core of common concern.

For those familiar with the Debian distribution system for Linux, the initial ideal is of a “debian of data”. Through the “apt” package management engine, when one installs a piece of software, all the libraries and other programs which it needs to run are walked through and downloaded with it. The packaging system helps one “divide and conquer” the problems of organising and conceptualising highly complex systems. The effort of a few makes reuse easier for many; sets of related packages are managed in social synchrony between existing software producers.

3. Code got there first

In the early days of software, there was little arms-length reuse of code because there was little packaging. Hardware was so expensive, and so limited, that it made sense for all software to be bespoke and little effort to be put into building libraries or packages. Only gradually did the modern complex, though still crude, system develop. These days, to package is to propagate, and to be discoverable in a package repository is critical to utility.

The size of the data set with which one is dealing changes the terms of the debate. Genome analysis or Earth Observation data stretches to petabytes. Updates to massive banks of vectors or of imagery impact many tiny changes across petabytes. At this volume of data, it helps to establish