More Than a Collection: Applied Uses of Supplemental Data

The importance of supplemental data in reproducibility has gained renewed focus with the development of online technology. New questions facing science publishing include these: What is the value of traditional peer review in the presentation of data? How can various fields address their specific needs and limitations? Who maintains data (independently of corresponding research reports) and how? In this session, Christine Laine, Liz Williams, and William Michener gave real-world examples of how supplemental data are used and maintained to address those questions.

Laine began with a summary of guidelines of the Annals of Internal Medicine that require each research article to be published with a Reproducible Research Statement that indicates whether and under what conditions study materials can be shared. The policy is not a mandate that data be made publicly available but rather constitutes a standardization of how data availability is noted; concerns about patient confidentiality prevent strict blanket requirements that may be possible in other fields.

Last year, the Yale Open Data Access Project (YODA) conducted twin meta analyses of previously gathered primary patient-level data on rhBMP-2, a product used in spinal-fusion surgery; the Annals of Internal Medicine conducted separate reviews of the two resulting articles and kept the information in the reports completely independent until publication. The scope and conclusions of the YODA project highlight the importance of neutral third-party data analysis, but the experiment also highlights the importance of peer review as a curator of scientific research. All review materials and manuscript versions were published alongside the papers, and this allowed readers to see the development of initial findings into the authoritative final versions. More information on the project and its conclusions are described in an editorial: http://annals.org/article.aspx?articleid=1696651.

Williams discussed another field-specific concern and a major limitation faced at The Journal of Cell Biology (JCB): images are the primary data produced in some research, but images in a PDF or online figure lose complexity. They become flat and static regardless of how large or high resolution a single image is, and a reader cannot interact with or further analyze the image. In response, JCB developed and maintains the JCB DataViewer, a browserbased image repository that allows users to zoom in on, download, and otherwise interact with the original image files, including ultralarge images and large datasets that are the basis of the published paper. Those primary data are considered supplemental data for the paper and are assigned a unique digital object identifier (DOI).

Michener addressed the question of data accessibility and archiving. Until fairly recently, the responsibility for archiving data underlying research reports was left largely to the authors themselves and was therefore not reliably and consistently retained. Michener presented two projects, Dryad Digital Repository and DataONE, both aimed at archiving data, promoting discoverability, and encouraging outside analysis and reproducible science. The former is a data repository that allows authors and publishers to deposit datasets in a variety of formats; data are assigned a unique DOI for easy referencing, are archived in CLOCKSS, and can be integrated with journal submission systems and compliant with journal embargo policies. DataONE seeks to connect networks of data to maximize discoverability and indexing and to promote data use and analysis among institutions and countries.

Dryad, DataONE, and the JCB DataViewer support and illustrate the necessity of persistent access to underlying research data. At the end of the session, Michener and Williams indicated that authors tend to gather data and then to use the data for a short period—as little as 1 to 2 years, according to Michener. The research community stands to benefit greatly from continued and standardized access to data long after initial experiments. The Internet has given us several powerful tools for easily sharing and analyzing all types of data, but peer review remains crucial for validating results, standardizing presentation, and providing context to the data. When data are hosted outside a journal, they should be easy to reference and discover.