Karen Stocks

and 11 more

The Rolling Deck to Repository (R2R; www.rvdata.us) program is entering its second decade of managing underway data from US-operated academic research vessels to ensure preservation of, and access to, these national oceanographic research assets. Reflecting on the move from decentralized data submission by chief scientists to an operational centralized facility has brought insights that may inform other communities with distributed networks of data acquisition providers with diverse practices and resources. 4,000 cruises and 100+TB of data later, here are lessons R2R has learned. - Managing data via a central aggregating system where both curation and domain data expertise can be optimally leveraged promotes more complete and efficient data preservation. - Identifying key organizing elements for the data, and implementing persistent identifiers and metadata for those elements, facilitates management and usability. R2R developed authoritative DOIs and standard metadata for cruises to organize R2R data for discoverability and access, and facilitate reciprocal linking to related data in external repositories. When data submissions from diverse providers are heterogeneous, standardizing data at ingest supports data aggregation and synthesis that promote broad data re-use. - Providing tools and expertise to assist with standardization, such as recommended data structures and best practice guidance for data acquisition, reduces heterogenoeus practices over time even when compliance is voluntary. - Developing organized and persistent communication mechanisms with all main stakeholders is central to success. R2R has annual community-level meetings, as well as more frequent individual interactions, with vessel operators/technicians, the NOAA National Centers for Environmental Information staff, and oceanographic research scientists. These communications have been critical to informing high level priorities, overall approaches, and specific technical details and decisions.
Over time, I have worked on a suite of data infrastructure projects, ranging from small to large and simple to complex. Some have been successful, others less so, and a couple are so notorious they are generally not mentioned in polite company. Sometimes the outcome was predictable—well-organized and well-managed projects are likely to succeed, and those with clear flaws often don’t rise above them—but sometimes it was unexpected. Building effective data infrastructure is not a solved problem, it is an area of research in and of itself, and the larger the project, the more difficult and uncertain it is. For each of my current and past projects, I consider their greatest strength, their showcase best practice, their weaknesses, and their epic fails, as well as the external and internal factors that contributed to positive or negative outcomes. Despite the heterogeneity, certain patterns emerge as lessons learned: Listening to your users is critical, and there are no short-cuts; it requires a long-term investment, including cultivating individual relationships. Good is better than more; hardening infrastructure is time consuming but critical for adoption. Have a clear mission with concrete benefits to a defined user community, and expand thoughtfully from there. In spite of our best intentions, meeting emerging community expectations usually requires a catalyst. External groups identifying best practices, mini-grants for implementation, and multi-project groups collaborating to rise together give needed nudges. Interestingly, I have never been involved with a project that failed because it picked the wrong technology. It can cause pain and consume resources, but I have not seen a terminal impact. Finally, invest in the people behind the infrastructure. Committed, engaged staff supported by ongoing professional development, rational management, and sufficient autonomy can, and regularly do, accomplish the impossible.