Big Data Science: Challenges and Opportunities for Authors, Reviewers, Editors, and Publishers

It is no secret that scientists are generating a deluge of data. The publishing landscape has to evolve—quickly—to keep up not just with the terabytes but with helping scientists in their exploration of, interpretation of, access to, and analyses of this information.

First in the session to discuss those challenges and opportunities was Eleonora Presani, particle physicist and publisher for Elsevier’s journals of nuclear and high-energy physics, who opened with a powerful quote from Kirk Borne, chair of information and statistics for the Large Synoptic Survey Telescope: “We don’t have a big data problem. Data storage isn’t a problem. The volume of data isn’t a problem. Our problem is pulling meaningful insights out of the data avalanche.”

Presani emphasized that publishers must allow scientists to express problems, communicate data, and get the right information out. Publishers can provide tools, move to a more interactive way to communicate science both to scientists and to the public, and provide streamlined methods of downloading data in usable formats.

More than a data-storage problem, today’s challenges involve interpreting the data and providing the right information at the right time and to the right audience.

Veronique Kiermer, executive editor and head of researcher services at Nature Publishing Group (NPG), spoke on behalf of Ruth Wilson, head of publishing services at NPG.

The scientific community is starting to see publications dedicated to disseminating data about data. In May 2014, NPG launched Scientific Data, an open-access, online-only publication for descriptions of scientifically valuable datasets that features a new type of content called data descriptors, which are designed to make data more discoverable, interpretable, and usable.

Kiermer showed a graph of growth in research articles indexed in PubMed and growth of available data (in repositories) over time. The mass of data is not only catching up to the number of publications but surpassing the capacity of individual journals to properly host and curate the data. Kiermer, a molecular biologist, recalled that the landmark 1953 Watson and Crick paper describing the structure of DNA¹ contained no actual data (as we have come to define it) and in fact started an entire field with a single page of text!

Fastforwarding to 2012, broad collaborations and the large scale of these big data projects had become the norm. The Encode Project included 30 papers, 3 journals, 442 consortium members, and 15 terabytes of data. Team science and big data have unique challenges and many stakeholders, including funders.

Kiermer mentioned the Royal Society’s report Science as an Open Enterprise² and the idea that although open data are useful, access alone is not sufficient. Reaping the full benefits requires substantial investment of time and effort and a paradigm shift. Data repositories (such as the National Center for Biotechnology Information, Dryad, figshare, and BioSharing) are crucial for providing data access, but the ecosystem is still fragmented, and publishers can help in several ways.

The publishing community must consider two critical elements of data sharing: replication and reproducibility and the ability to build on research and have access to everything described in published research. Journals can help by having clear data-sharing policies with specific recommendations of where to deposit data. Kiermer cautioned that data policies need to be constructed in collaboration with the communities of scientists represented by the journals.

One trend on the rise is data citation, which raises the level of credit for data production and the visibility, usability, and utility of a dataset, which in turn will spur researchers to make data available. Journals may help by integrating supplemental information, which imparts importance to how data are perceived and presented, and by making data behind figures downloadable and verifiable.

Session moderator Christine G Casey noted that one size doesn’t fit all when it comes to data policies. Because data repositories are fragmented, access alone is not the “be-all and the end-all”. Newly launched data journals have found a niche, and publishers can aid in providing tools for authors related to storage, presentation, and communication of data. Casey also noted that funder policies are a key driver in making research available, although journals can and indeed do influence data deposits. Casey raised an interesting question about how to elucidate and document the difference between authors and data contributors. She offered an example from physics, in which 8,000 people may have contributed to an experiment (hence perhaps creating the arduous task of defining and agreeing on individual contributions).

If publishers provide guidelines on authorship and contributorship, individual scientific communities may determine how best to fit the guidelines to their disciplines. It was clear from the overflowing session that big-data science is on the minds of publishers, which, regardless of field or size, may shape data policies, access to data, and the pace of discovery.

References

Watson JD, Crick FHA. 1953. A structure for deoxyribose nucleic acid. Nature 171:737–738.
The Royal Society. 2012. Science as an open enterprise. http://royalsociety.org/policy/projects/science-public-enterprise/report/ (accessed 20 May 2014).