The research data repository of the Environmental Data Initiative (EDI) is a signatory of the FAIR Data Principles. Building on over 30 years of data curation research and experience in the NSF-funded US Long-Term Ecological Research program (LTER), it provides mature functionalities, well established workflows, and support for ‘long-tail’ environmental data publication. High quality scientific metadata are enforced through automatic checks against community developed rules and the Ecological Metadata Language (EML) standard. Although the EDI repository is far along the continuum of making its data FAIR, representatives from EDI and the LTER Information Management community have recently been developing best practices for the edge cases in environmental data publishing. Here we discuss and seek feedback on how to best handle the publication of these ‘long-tail’ data when extensive additional data are available along with e.g., genomics data, physical specimens, or flux tower data. While these latter data are better handled in other discipline-specific repositories such as NCBI, iDigBio, and AmeriFlux, they are frequently associated with other data collected at the same time and location, or even from the same samples. This is particularly relevant across the LTER Network, where sites represent integrative research projects. Questions we address (and seek community input from) include: How to archive documents and images when they are data, e.g., field notebooks, or time-lapse photographs of plant phenology? How to deal with data from Unmanned Vehicle (e.g., drones and underwater gliders), acoustic data, or model outputs, which may be several terabytes in size? How should processing scripts or modeling code be associated with data? Overall, these best practices address issues of Findability and Accessibility of data as well as greater transparency of the research process.
Assessing and understanding the extent and trajectory of change in inland waters is a great challenge, due in part to both differing methods — and cultures — of agencies that provide synoptic observations of Earth’s systems as well as the community of lake scientists whose research generates heterogeneous and distributed in situ data. Advancements require socio-technological initiatives that harness the resources of the highly diverse and distributed community of ecologists, as well as the products and expertise of the satellite remote sensing community. Here we describe a prototype for linking in situ and remotely sensed data for lakes through the collaborative efforts of the Global Lake Ecological Observatory Network (GLEON), the Environmental Data Initiative (EDI), and NASA. GLEON provides a community of lake scientists and data from lake observatories. EDI curates and publishes data and ensures conformity to rigorous FAIR principles. NASA provides the expertise and workflows to deliver remotely sensed data products on demand. The integration of the data and the communities provides a foundation for a new generation of lake science.
Data repositories and research networks worldwide are publishing a diverse array of long-term and experimental data for meaningful reuse, repurpose, and integration. However, in synthesis research the largest time investment is still in discovering, cleaning and combining primary datasets until all are completely understood and converted to a usable format. To accelerate this process, we have developed an approach to define flexible domain specific data models and convert primary data to these models using a light-weight and distributed workflow framework. The approach is based on extensive experience in synthesis research workflows, takes into account the distributed nature of original data curation, satisfies the requirement for regular additions to the original data, and is not determined by a single synthesis research question. Furthermore, all data describing the sampling context are preserved and the harmonization may be performed by data scientists that are not specialists in each specific research domain. Our harmonization process is 3-phased. First, a Design Phase captures essential attributes, considers already existing standardization efforts, and external vocabularies that disambiguate meaning. Second, an Implementation Phase publishes the data model and best practice guides for reference, followed by conversion of relevant repository contents by data managers, and creation of software for data discovery and exploration. Third, a Maintenance Phase implements programmatic workflows that run automatically when parent data are revisioned using event notification services.In this presentation we demonstrate the harmonization process for ecological community survey data and highlight the unique challenges and lessons learned. Additionally, we demonstrate the maintenance workflow and data exploration and aggregation tools that plug in to this data model
In this era of open data and reproducible science, graduate students need to learn where and how to publish their data and to be conversant with the challenges inherent when re-using someone else’s data. The Environmental Data Initiative partnered with UNM Libraries and Florida Coastal Everglades LTER to organize a 1-credit, semester-long distributed graduate seminar to learn if this approach could be an effective mechanism for transmitting such information. Each week during the Spring 2021 semester, an informatics specialist spoke remotely to students at University of New Mexico, Florida International University, and University of Wisconsin-Madison on topics ranging from FAIR principles to data security, team science to data provenance. Students prepared for the lecture with one or more readings, and in-class exercises reinforced the material covered. Student assignments included writing quality metadata for their own data and archiving their data in the EDI Repository. The capstone writing assignment, a data management plan for their own research project, allowed the students to integrate much of what they had learned. Student response to this class was positive, and students indicated that they learned a lot of immediately useful information without the course being a significant time-sink. The low registration numbers at UNM and FIU (6 and 7 students, respectively), however, where the seminar was not required, suggest a need to better inform both students and their advisors of the opportunity and the value provided by the training. Instructors also learned that it would be easier to create a cohesive flow to the course, without repetition, if the group of instructors took turns lecturing, rather than bringing in specialists on each subject. It was also apparent from student comments that many felt this information should be integrated, at an introductory level, into undergraduate classes or classes for new graduate students.
1. The Environmental Data Initiative (EDI) is a trustworthy, stable data repository and data management support organization for the environmental scientist. In a bottom-up community process EDI was built with the premise that freely and easily available data are necessary to advance the understanding of complex environmental processes and change, to improve transparency of research results, and to democratize ecological research. 2. EDI provides tools and support that allow the environmental researcher to easily integrate data publishing into the research workflow. 3. Almost ten years since going into production, we analyze metadata to provide a general description of EDI’s collection of data and its data management philosophy and placement in the repository landscape. We discuss how comprehensive metadata and the repository infrastructure lead to highly findable, accessible, interoperable, and reusable (FAIR) data by evaluating compliance with specific community proposed FAIR criteria. 4. Finally, we review measures and patterns of data (re)use, assuring that EDI is fulfilling its stated premise.