Making Materials Science and Engineering Data More Valueable Research Products

Charles H. Ward1, James A. Warren2, Robert J. Hanisch3


  1. Air Force Research Laboratory, Wright-Patterson AFB, OH

  2. National Institute of Standards and Technology, Gaithersburg, MD

  3. Space Telescope Science Institute, Baltimore, MD

Abstract

Both the global research community and federal governments are embracing a move toward more open sharing of the products of research. Historically, the primary product of research has been the peer-reviewed journal article for fundamental research and government technical report for applied research and engineering for government sponsored research. However, advances in information technology, new “open access” business models, and government policies are working to make publications and supporting materials much more accessible to the general public. These same drivers are obscuring the distinction between the data generated through the course of research and the associated publications. These developments have the potential to significantly enhance the value of both publications and the supporting digital research data, turning them into valuable assets that can be shared and reused by other researchers. The confluence of these shifts in the research landscape leads one to the conclusion that technical publications and their supporting research data must become bound together in a rational fashion. However, bringing these two research products together will require establishment of new policies and a supporting data infrastructure that have essentially no precedent in the materials community, and indeed are stressing many other fields of research. This document raises the key issues that must be addressed in developing these policies and infrastructure, and suggests a path forward in creating the solutions.

Keywords

materials data, data policy, data repository, ICME, MGI, Integrated Computational Materials Engineering, Materials Genome Initiative, data archiving

Introduction

Reliance on shared digital data in scientific and engineering pursuits—whether the data are derived from computation or experiment—is becoming more commonplace within the materials science and engineering (MSE) community. Concurrently, government policies across the globe are embracing an “open science” model which sets a requirement for sharing digital data generated from research. A recent MRS-TMS survey on “Big Data” in materials science and engineering showed 74% of respondents would be willing to participate in sharing their data if it was encouraged as a term and condition of funding or publishing, assuming the proper safeguards were in place.(TMS-MRS 2013) However, it is fair to say that the MSE community currently lacks the strategy, framework, and standards needed to support materials data curation and sharing. A unified approach is needed to meet the growing demands of the community and a plan to meet government mandated requirements for broad access to digital data. It is clear that the peer reviewed journals and government technical reports serving the MSE community can be an essential component to the solution and there is now an opportunity to proactively plan how they may best serve the growing needs of their constituency.

Review

Global Context

The 2008 NRC report on Integrated Computational Materials Engineering (ICME) highlighted the importance digital data will play in the future of materials science and engineering.(NRC 2008) MSE’s ever increasing reliance on computational modeling and simulation will demand digital data as the feedstock for solutions in both science and engineering.

In the US, the National Institutes of Health have long promoted a policy of open access to data generated from their grants.(NIH) In the mid-1990’s the Human Genome Initiative spawned the Bermuda Principles which called for immediate public posting of sequences of the human genome.(Conference) More recently, the National Science Foundation has adopted a requirement that grantees provide a Data Management Plan in grant proposals.(Foundation) Specific to the materials community, the sharing of digital data is a key strategy component of the US’s Materials Genome Initiative, and mechanisms to foster and enable sharing are actively under consideration.(House)

The European Union has been very proactive in studying the impacts of a digitally-linked world on the scientific community. The EU Framework Programme 7 has funded a project called Opportunities for Data Exchange that has produced several relevant reports on publishing digital data in the scientific community.(Access) In June 2012 the Royal Society published “Science: An Open Enterprise” which promotes free and open access to scientific results, including data.(Science as an Open En...) These studies are now broadly informing government policy. For example, recent policy in the UK in July 2012 calls for government funded research to be published in open access journals, and requires access to supporting research data.(UK, UK) In February 2013, Dr. John Holdren, Director of the Office of Science and Technology Policy (OSTP), issued a directive to all Federal agencies to develop plans to make the results of Federally-funded research more accessible to the public. A key component of this directive is a call for agency plans to include a means by which the digital data resulting from research can be made available to the public.(Science) In support of this policy, the White House has established a useful web site providing resources supporting the establishment of open data.(Housea) US Government funding agencies have provided their plans to address OSTP’s open research policy and results are imminent.

Other technical communities have addressed the challenges of access to digital data with a variety of approaches. Indeed, the biology community has implemented a number of differing approaches, for example, the approach taken in genetics versus that adopted by evolutionary biology.(NIH-Genbank, Datadryad) In other disciplines, one subfield of thermodynamics has already adopted a very structured approach to archiving data, while the earth sciences community is embarked on an effort to define its approach.(NIST, Leichester) The astronomy community has dedicated international resources to the development of the Virtual Observatory, an infrastructure that enables global data discovery and access across hundreds of distributed archives.(International Virtual...) Despite the differing mechanics of implementation, all the approaches were rooted in a community-led effort to define the path best suited for that particular technical field.

In response to these trends, technical communities and publishers have developed and implemented Open Access journals and data archiving policies. Again, the field of biology appears to be leading the way on both these fronts. Perhaps the best example of this trend is Database: The Journal of Biological Databases and Curation, an Open Access journal dedicated to the discussion of digital data in biology.(Press) And in a recent development, Nature Publishing Group is launching a new open access journal, titled Scientific Data, which will be dedicated to publishing descriptions of scientific datasets and their acquisition.(Group) It will initially focus on the life, biomedical and environmental science communities. The Public Library of Science recently strengthened its policy on data access: “PLOS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception.”1

In order to begin a dialogue within the MSE community, NIST convened a workshop on digital materials data in May of 2012 under the auspices of MGI. The workshop identified a number of barriers that needed to be addressed during creation of a data strategy for materials, they include: Materials schema/ontology; Data and metadata standards; Data repositories/archive; Data quality; Incentives for data sharing; Intellectual property; and Tools for finding data.(Warren 2012) Notable among these for this discussion are data repositories and incentives for data sharing. Other disciplines, notably evolutionary biology, have demonstrated peer-reviewed journals have the potential to contribute solutions to these barriers to data sharing. (Whitlock 2010)


  1. http://www.plosone.org/static/policies.action#sharing

Benefits of Archiving Materials Science and Engineering Data

There is a growing realization within the global scientific community that the data generated in the course of research is an oft-overlooked asset with considerable residual value to other scientists and engineers, and that often a significant portion of the data is stored but not used. The following are several benefits of increasing access to materials science and engineering data in digital form:

Data Reuse

  • Scientific productivity and return on investment in research infrastructure

  • Secondary hypothesis testing

  • Reduction/elimination of paying for data generation multiple times

  • Comparisons with previous studies

  • Integration with previous and future work

  • Reproducing and checking analyses

  • Simplifying and enhancing subsequent systematic reviews and meta-analyses

  • Interdisciplinary research

  • Teaching

Incentives

  • Increasing academic credit (citations)

  • Access to one’s own data at a future date

  • Convenience and security of cloud storage

Other

  • Validated reference datasets for testing algorithms/computations

  • Meeting funding agency requirements to share data

  • Reducing the potential for duplication of effort

  • Reduction of error and fraud

The MRS-TMS “Big Data” survey asked participants to evaluate whether given attributes would act as impediments or motivators to sharing data, Figure \ref{fig:IMPEDE}. (TMS-MRS 2013) The bottom of the graph shows the largest impediments, which are primarily driven by legal considerations. The top of the graph demonstrates the strongest positive motivators are the increased attention and credit a researcher may draw for one’s work.

The impact on research productivity owing to the provision of well-calibrated, well-documented archival data products is clearly demonstrated in the case of NASA’s Hubble Space Telescope. Initially archival data was not used very extensively; the data suffered from spherical aberration, of course, resulting in a factor of 10 decrease in sensitivity from expectations. But in the early 1990s there was also somewhat of a stigma attached to using archival data for research: this was somehow not as good or pure as collecting one’s own data at a telescope. But times have changed, and HST archival data is now used in more than half of all peer-reviewed publications, by astronomers not affiliated with the teams who proposed for the original observations (see Figure \ref{fig:HST}). There are a number of reasons for the big increase in archival data use. HST observing time is very difficult to get, with typically a seven-to-one oversubscription ratio in the proposal process. All HST data is routinely pipeline processed, yielding an archive of “science ready” data products. All HST data becomes public after a nominal twelve-month proprietary period. And HST data taken for one purpose can often be utilized for studies of a substantially different intent. While this high level of re-use may not be achieved for all research experiments, the HST example clearly shows that a substantially improvement in research productivity can be achieved, at a very modest incremental cost, when proper care is taken in designing the data management system.

\label{fig:IMPEDE} Do you consider the following items to be impediments or motivation for you to share your data with the world? X-axis shows response rate to the three choices: Impediment, Neutral, Motivation.

\label{fig:HST} Data in the archive of the Hubble Space Telescope is used more than twice as often in research papers written by scientists with no connection to the original investigators proposing the research. This more than doubles the productivity of HST at a marginal extra cost of providing well-calibrated data in an easy to access archive.

Background for Data Archiving

A common approach to archiving materials data has several benefits, but its primary value would be to provide unified, consistent guidance and expectations throughout the scientific and engineering community. However, while development of the archiving policy itself may be relatively straightforward, the infrastructural issues necessary to support policy implementation are extraordinarily complex. These issues include the establishment of viable:

  • Repositories for materials data

  • Standards for data exchange

  • Citation and attribution protocols

  • Data quality metrics

  • Clear intellectual property and liability determination

Characteristics of an Archiving Solution

In order for a data archiving solution to be of lasting value to researchers and maintain the rigorous, archival standards of relevant publications it should have the following minimum set of characteristics:

  • Persistent citation

  • Data discoverability

  • Open access (for journals)

  • Ease of use

  • Minimal cost

To Archive or Not to Archive?

The most critical question to be answered in setting policies for publications is “what data should be archived?” The answer is essential in providing clear expectations for authors, editors, and reviewers, as well as determining the size of the data repositories needed. Other disciplines have already embarked on this journey and have devised a variety of approaches that suit the data needs of their community for their stage of “digital maturity.” Two ends of the spectrum in addressing this question are presented here. The first assumes all data supporting a publication are worthy of archiving. This criterion is found most often in peer reviewed journals that have narrow technical scope and generally deal with very limited data types. For example, journals in crystallography and fluid thermodynamics have very stringent data archiving policies that prescribe formats and specific repositories for the data submitted.(Notes for authors 201..., Koga 2013) Other journals that cover broader technical scope, and therefore deal with more heterogeneous data, have implemented more subjective criteria for data archiving and a distributed repository philosophy. Earth sciences and evolutionary biology have typically taken this approach. It is likely that the approach adopted by MSE publications may also span a similar spectrum, depending on the scope of the publication.

The MRS-TMS “Big Data” survey provided insight into the community’s perspective on the relative value of access to various types of materials data, shown in Figure \ref{fig:COMPLEX}. It’s interesting to note that as the complexity of the data and metadata increase (generally) toward the right-hand side of the chart, the community’s perceived need to have access to this data decreases. This could be due to many factors including the difficulty in assuring the quality of such data as well as the lack of familiarity with tools to handle the data complexity. However, with complexity comes a richness of information that if properly tapped could be extraordinarily valuable. In astronomy, for example, the Sloan Digital Sky Survey created a very complex database of attributes of stars, galaxies, and quasars. The wealth of information and immense discovery potential led many in the research community to become expert users of SQL, and for the survey to yield nearly 6,000 peer-reviewed publications.1

For those publications with wide technical scope, it will be difficult to provide a universal answer to “what data should be archived?” In these cases, the decision for what data to archive may best be left to the judgment of the authors, peer reviewers, and editors. A particularly useful metric might be the cost/effort to produce the data. For example, the “exquisite” experimental data associated with a high energy diffraction microscopy experiment provide very unique, expensive, and rich datasets with great potential use to other researchers. Clearly, based on these factors the dataset should be archived. On the other hand, the results from a model run on commercial software that takes five minutes of desktop computation time may not be worthy of archiving as long as the input data, boundary conditions, and software version were well defined in the manuscript. Of course, one must account for the perishable nature of code, particularly old versions of commercial code. However, even the data from a simple tensile test may be worthy of archiving as publications do not typically provide the entire curve; while the paper may report only yield strength, another researcher may be interested in work hardening behavior. Having the complete dataset in hand allows another researcher to explore alternative facets of the material’s behavior. The basic elements of criteria for determining the data required for archiving could include:

  • Are the data central to the main scientific conclusions of the paper?

  • Are the data likely to be usable by other scientists working in the field?

  • Are the data described with sufficient pedigree and provenance that other scientists can reuse them in their proper context?

  • Is the cost of reproducing the dataset substantially larger than the cost of archiving the fully curated dataset?

  • Is the dataset reproduceable at all, or does it stem from a unique event or experiment?

Data itself can come in a variety of “processed” levels including “raw”, “cleaned”, and “analyzed”. Such characterizations are subjective, though some disciplines have adopted quite rigorous definitions. Nonetheless, given the diversity of materials data, care will need to be taken in determining the appropriate amount of processing performed on a dataset to be archived. While raw or cleaned data is much preferred for its relative simplicity in reuse, it is probably much more important at this stage of our digital maturity that the metadata accompanying the dataset provide sufficient pedigree and provenance to make the data useful to others, including definition of the post-acquisition (experiment or computation) processing performed.

Another factor to consider in setting guidelines for which data need to be archived is the expected annual and continuing storage capacity required. A very informal survey of 15 peer-reviewed journal article authors in NIST and AFRL found that most articles in the survey had less than 2 GB of supporting data per paper. Currently the time and resources required to upload (by authors) and download (by users) data files less than 2 GB are quite reasonable. However, those papers reporting on emerging characterization techniques such as 3-D serial sectioning and high energy diffraction microscopy were dependent on considerably larger datasets, approximately 500 GB per paper. Other disciplines have established data repositories to support their technical journals. Experience to date indicates that datasets of up to approximately 10 GB can be efficiently and cost effectively curated.(Vision 2012) Repositories such as www.datadryad.org, show that datasets of this magnitude can be indefinitely stored at a cost of $80 or less.(Datadryad) However, datasets approaching 500 GB will very likely require a different approach for storage and access. Thus a data repository strategy needs to consider this range in distribution of datasets. An additional factor when considering long-term storage requirements is the high global rate of growth in materials science and engineering publications. Figure \ref{fig:GROWTH} shows the dramatic growth in the number of MSE journal articles published over the past two decades, indicating a commensurate amount of accompanying data.


  1. This is based on a query to the Astrophysics Data System, http://adsabs.harvard.edu/, for peer-reviewed papers mentioning either “SSDS” or “Sloan” in the title or abstract of the paper. A query executed on April 9, 2014 resulted in 5,825 papers.

\label{fig:COMPLEX} What scientific/technical databases and data mining tools would be most useful if they could be created?

\label{fig:GROWTH} Number of materials science and engineering publications per year.

Data Repositories

Aside from crystallographic data repositories, there are at this time perhaps no dedicated materials data repositories that meet the required characteristics defined above. The materials science and engineering community does have numerous publically-accessible data repositories; however, the majority of these are associated with specific projects or research groups, and their persistence is therefore dependent on individual funding decisions. These repositories are primarily established to house and share the research data generated within a specific project or program. They generally don’t follow uniform standards for data and metadata, nor provision for data discoverability and citation. There are very few repositories established with the explicit objective of providing MSE with public repositories for accessible digital data. In short, publically accessible, built-for-purpose repositories and the associated infrastructure for access, safe storage and management still need to be developed and sustainably funded—this is the largest impediment to implementing viable data archiving policies. (See, for example, “Sustaining Domain Repositories for Digital Data: A White Paper”.(Ember 2013))

Evolutionary biology, for example, allows a mix of repositories that meet established criteria. Such criteria may be as simple as requiring data cited to be permanently archived in data repositories that meet the following conditions:

  1. Publically accessible throughout the world

  2. Committed to archiving data sets indefinitely

  3. Allow bi-directional linking between paper and dataset

  4. Provide persistent digital identifier

One tempting option might be to take advantage of the on-line storage capability several journals already offer for supplementary materials accompanying journal articles. However, as presently constructed these are not amenable to best practices for dataset storage as they generally are not independently discoverable, searchable, separately citable, nor aggregated in one location. In fact, some publishers are reducing or eliminating supplementary file storage due to the haphazard structure and rules associated with their use. Further, new global government policies promoting open access to research works have the publishing industry in a state of flux with regard to their long-standing, subscription-based business model. Publishers have been extremely reticent in taking on a data archiving responsibility given the economic uncertainties in the publishing marketplace.(Ward 2012) Also, there is a risk that for-profit publishers might restrict access to digital data assets that are co-located with the journal.

As alluded to in the previous section, a fundamental consideration in repository design and/or selection is the level to which the repository will present structured versus unstructured data. Structured technical databases tend to be more useful to a technical community due their uniformity, as evidenced by their data reuse rate.(Acharya 2012) A perfect construct would see the vast majority of materials data resident within structured repositories. A disciplined data structure provides enormous advantages to the researcher both in terms of data discoverability and confidence in its use. However, this structure must be enabled by the application of broader and deeper standards for data and metadata, standards that do not currently exist.

In all likelihood, like biology, MSE publications will be dependent on a collection of repositories that are tailored to specific materials data. For example, NIST is building and demonstrating a data file repository for CALPHAD and interatomic potentials.(NIST 2012) These may be expandable and largely sufficient for thematic publications such as those devoted to thermodynamics and diffusion. However, repositories such as this will only fill a relatively small niche need in MSE. Integrating Materials and Manufacturing Innovation is piloting an effort to link articles with their supporting data using the NIST repository according to the criteria outlined above, an example can be found in an article by Shade et al.(Shade 2013)(Shade 2013a)

Finally, a business model for sustainably archiving materials data is required. Other technical fields, such as earth sciences, can at least partially rely on government-provided repositories for large and complex datasets. Without these types of repositories to build on, MSE will need to establish viable repository solutions. In response to funding agency requirements for data management plans some universities, Johns Hopkins for example, are beginning to provide centrally-hosted data repositories, but these are not yet common.(University) Private fee-for-service repository services, such as labarchives and figshare, are also evolving to meet growing demand for accessible data storage.(labarchives, Figshare) Additionally, ASM International is working to create a prototype materials data repository through its close association with Granta Design. Termed the Computational Materials Data Network (CMDN), this is an interesting option as the data repository will provide a structured database specifically for materials data, but the business model for CMDN has not yet been solidified.(Network) A key open question remains how funding agencies will respond to the OSTP open research policy memo, and how they will fund activities making data open to the public.

Standards Enabling Data Discoverability, Exchange and Reuse

As noted in the previous section, standards for data and metadata provide the basis for a structured data archive, enabling the rapid discovery of data and assisting in determining the data’s relevance and usefulness. At the most basic level, good data practice generally requires the generation, and acceptance, of a vocabulary defining the terms used to describe reported data. This assures the data user they precisely understand the context of the data they are reviewing. From this level, other attributes, features, or requirements can be levied on a data management system including ontologies, schema and formats.

Other fields have studied these issues as a community, and MSE is now starting to develop a concerted effort to define its approach to setting data standards. The European Union is studying the creation of standards for exchange of engineering materials data through the European Committee for Standardization.(Austin 2013) The target for these standards is structural materials with an early emphasis on aerospace applications. And the European Commission is funding a broader activity called the Integrated Computational Materials Engineering expert group (ICMEg) with the aim of developing the standards and protocols needed to support the digital exchange materials data needed to conduct ICME.(Schmitz 2014) ASTM International had issued data standards relevant to materials in the early to mid-1990’s, but those standards have since been abandoned, likely because they were ahead of true need. However, ASTM International has been reviving its efforts in providing guidance on the digitization of materials test data by exploring the re-establishment of it’s Computerization and Networking of Materials Databases Symposium Series.(Rumble 2014) These two efforts address a relatively narrow, but industrially important, segment of materials data. Several recent papers are starting to propose standards for other types of materials data to include thermodynamic and image based data.(Jackson 2014)(Campbell 2014) There are also closed-loop approaches to materials data standardization that exist within commercial data management software packages, Granta Design is one example, but these are not generally available to the public.

While the field of information technology is continuously evolving to provide solutions to more productively use unstructured data, at present there is no community-wide accepted practice for MSE data and metadata standards. Near-term solutions for governing the archiving of materials data will need to be relatively loose, flexible, and evolutionary with a drive toward more standardization. While publishers may not be able to directly provide data repository services, they are reasonably well positioned and willing to aid the community in establishment of data standards. Concerning the pursuit of standardization across a technical field, Michael Whitlock, a primary champion of journal data archiving in the field of evolutionary biology, offered this quote from Voltaire based on his experience: “the perfect is the enemy of the good”.(Whitlock 2012) It is perhaps much more important at this stage of our digital maturity that MSE first implement data archiving with the best guidance available, and work to build in standardization over time.

Data Citation and Attribution

Well developed and uniform data citation standards are required to ensure linkages between publications and datasets are enduring and that creators of digital datasets receive appropriate credit when their data are used by others. Standards for data citation practices and implementation provide the mechanism by which digital datasets can be reliably discovered and retrieved. Closely related to data citation, other challenges include the ability to reliably identify, locate, access, interpret, and verify the version, integrity, and provenance of digital datasets.(Paul E. Uhlir 2012) Any data archiving policy must concern itself not only with how publications should appropriately cite the datasets used, but must also require attribution to authors of datasets outside the document.

Numerous organizations in the EU and US have studied this issue, and are continuing to refine technology solutions and best practices. For example, CODATA and the National Academy of Sciences released an in-depth international study and recommendations on citation of technical data.(Standards 2013) Recently, these transnational initiatives have coalesced to produce a unified Joint Declaration of Data Citation Principles that is appropriate for any type of technical publication.(Joint Declaration of ...) The eight principles define the purpose, function, and attributes of data citations and address the need for citations to be both understood by humans and processed by machines. With a slightly different perspective focused more on the mechanics of linking published articles with data repositories, DataCite and the International Association of Scientific, Technical and Medical Publishers have issued a joint statement recommending best practices for citation of technical datasets in journals:(DataCite 2012)

  1. To improve the availability and findability of research data, encourage authors of research Papers to deposit researcher validated data in trustworthy and reliable Data Archives.

  2. Encourage Data Archives to enable bi-directional linking between Datasets and publications by using established and community endorsed unique persistent identifiers such as database accession codes and Digital Object Identifiers (DOIs). DOI was approved as ISO Standard 26324:2012 in May 2012

  3. Encourage publishers to make visible or increase visibility of these links from publications to datasets.

  4. Encourage Data Archives to make visible or increase visibility of these links from datasets to publications.

  5. Support the principle of data reuse and for this purpose actively participate in initiatives for best practice recommendations for the citation of datasets.

  6. Invite other organizations involved in research data management to join and support this statement.

An outstanding technical issue yet to be resolved concerns the granularity of the datasets used in a publication, both spatially and temporally. Spatial granularity refers to a subset of the dataset used in the research. Temporal granularity can refer to either the version of the dataset used, or the temporal state of the dataset used if the dataset itself is dynamic.

Data Quality

A key concern in linking datasets to publications is the provision of quality metrics, that is, can the data’s ultimate reliability be assessed in a meaningful manner? Materials data can be provided as two basic types: experimental and computational; both types assume underlying models. In order for data and these associated models to be usable, their quality must be ascertained. In this context, it is useful to define the following for data and models:

  • Pedigree – Where did the information come from?

  • Provenance – How was the information generated (protocols and equipment)? This metadata should be sufficient to reproduce the provided data.

In addition to these qualitative descriptors of the data, there are any number of meaningful quantitative measures of the data’s quality. However, in general the following metrics are a strong basis for such an assessment:

  • Verification – (Applies to computational data only). How accurately does the computation solve the underlying equations of the model for the quantities of interest?

  • Validation – How much agreement is there between realizations of a model in experiment and computational, or, rarely, analytic, results?

  • Uncertainty – What is the quantitative level of confidence in our predictions?

  • Sensitivity – How sensitive are results to changes in inputs or upon assumed boundary conditions?

Similar, and perhaps more difficult problems pertain to simulation data. While such data may be perfectly precise in a numerical sense, simulations typically rely on many parameters, assumptions, and/or approximations. In principle, if the above are specified, and the quantitative metrics meet user requirements, the data can be used with high level of confidence. A similar approach to defining data quality was recently proposed within the context Nanotechnology Knowledge Infrastructure Signature Initiative within the National Nanotechnology Initiative (Data Readiness Levels...).

An often posed question in the research community with regard to data associated with peer-reviewed journal articles is that of peer-review of the data itself. Indeed, it has been reported that approximately 50% of data being reviewed for submission to the The American Mineralogist Crystal Structure Database contained errors (Downs 2003). The elements defined above represent the key criteria by which to judge the quality of the data. General pedigree and provenance information are typically conveyed in most research articles, though they may be provided in insufficient detail to reproduce the data. The remaining elements of validation, verification, uncertainty and sensitivity are relatively loosely defined within materials science and engineering, and best practices have not generally been developed for each element, or, where developed, are not in widespread use.

Intellectual Property and Liability

There is quite a lot of confusion, complexity, and even ambiguity with regard to the legal protections governing scientific data.(Paul F. Uhlir 2012) In general, scientific data are treated as facts and therefore not copyrightable under US law. However, the aggregation of the data into a single compilation or database may be copyrightable in the US. Additionally, and importantly, the codes, formats, metadata, data structures or any ‘added value’ to the data could also be subject to copyright. Laws in other parts of the globe, particularly the European Union, add complexity to the situation. The EU’s Database Directive, for example, protects the wholesale use of databases by other parties without permission.

There may be instances where the authors of a document may not want their data released immediately on publication of the supported manuscript. They may have very good, justifiable grounds to protect their data for some period following publication. One likely reason may be additional time required to file an invention disclosure related to the data. Another case may be that the authors are in the midst of writing another manuscript dependent on the same data. To account for these special cases, the publication should have allowance to grant the author an ‘embargo’ period to protect the data for a short time after document publication. Typically by granting an embargo the author must post the supporting data to a repository prior to manuscript publication, but the data is not released to the public until the embargo period has expired. This is a standard practice in other technical disciplines, with limits of 12 months being typical and at the discretion of the editor.

Proprietary and export control restrictions may also affect the release of the metadata associated with the dataset, and could warrant embargo or even permanent withholding of the entire metadata description. Take a researcher that’s been provided a quantity of material by an industrial partner. The researcher may be free to report on a newly observed deformation phenomenon in the material with respect to its microstructure, but may be restricted by the partner in providing proprietary details about how the material was processed. In this case, the metadata may not contain the full pedigree and provenance needed to reproduce the experimental results. Export control provides an analogous situation; the data may not be restricted, but the metadata needed to provide full pedigree and provenance may reveal export controlled information.(Ward 2013) Allowances for the withholding of metadata from publication must be in place and these decisions to either accept the embargo or reject the dataset should be left to the reviewer and editor. It should be noted in publication policy that authors take full responsibility for review and release of proprietary and export controlled information.

Given the discussion above regarding intellectual protection of data, policy regarding the requirements for licensure of data for reuse should be made clear. Of course, one must also consider where the data repository resides, so any policy may have somewhat limited scope. One desirable route is to require all new data be covered by a CC-BY license, as defined by Creative Commons.(Commons) A CC-BY grants free use of data by all parties, including for commercial use, but does require attribution. Still unanswered questions linger regarding any liability issues with making data accessible. Again, consideration must be given to where the data reside (who is making it available) as to liability determination.

Archiving Policy

We advocate establishing a working group from the MSE community to craft a common data archiving policy. The policy must address:

  1. A general definition of data to be archived; flexible to meet specific publication needs

  2. Criteria for suitable repositories

  3. Expectations or requirements to follow data or metadata standards

  4. Definition of standards for data citation and attribution

  5. Requirements and/or measures for data quality

  6. Clarity on intellectual property and liability issues

  7. Areas of opportunity for targeting pilot data archiving efforts (e.g. thermodynamic data)

Repositories

We also suggest establishing a complementary working group from the MSE community to develop a plan to provide supporting repositories for the MSE community. Some anticipated tasks and options include:

  1. Catalogue and explore the suitability of and potential for existing materials repositories to host datasets associated with peer-reviewed journals (e.g. NIST CALPHAD database)

  2. Explore the use of other established journal data repositories for their suitability for MSE data (e.g. www.datadryad.org)

  3. Engage funding agencies for help in establishing a specialized MSE data repositories.

  4. Develop a time-phased strategy to provide well-structured materials repository architectures.

  5. Consider business models that would sustain these repository services over the long-term.

Conclusion

The era of Open Science is upon us, and the MSE community must generate a response that best suits the needs of not only the individual researcher but the larger community including academia, industry, and government. It’s becoming clearer with the advance of materials research that supporting data can no longer be kept invisible from a technical publication. We have highlighted the key issues that will need to be considered as the community develops an approach to data archiving supporting publications. Charting the right course will take time and much effort as it is quite complex. Fortunately, other technical disciplines have begun a path for us from which we can learn and capitalize. We have outlined some suggested community actions that would help pave the way in setting a common approach to archiving of materials data.

Competing interests

The authors declare that they have no competing interests.

Authors’ Contributions

CHW structured the flow of the paper, CHW and JAW contributed a substantive portion of the manuscript, while RJH added valuable complementary perspectives from outside materials science and engineering throughout the subsections in the paper.

Acknowledgements

The authors wish to thank Clare Paul and Jeff Simmons for helpful discussions in preparing this manuscript.

References

  1. TMS-MRS. TMS-MRS Survey on Big Data and Open Data: Preliminary Results. (2013). Link

  2. Committee on Integrated Computational Materials Engineering NRC. Integrated Computational Materials Engineering: A Transformational Discipline for Improved Competitiveness and National Security. The National Academies Press, 2008. Link

  3. NIH. Final NIH Statement on Sharing Research Data. Link

  4. Bermuda Conference. Wikipedia Bermuda Principles. Link

  5. National Science Foundation. Dissemination and Sharing of Research Results. Link

  6. The White House. Materials Genome Initiative. Link

  7. Alliance for Permanent Access. Opportunities for Data Exchange. Link

  8. Science as an Open Enterprise. The Royal Society, 2012. Link

  9. Research Council UK. Common Principles on Data Policy. Link

  10. Research Council UK. Common Principles on Data Policy. Link

  11. Office of Science, Technology Policy. OSTP Public Access Memo. Link

  12. The White House. Project Open Data. Link

  13. NIH-Genbank. GenbankProject Open Data. Link

  14. Datadryad. Datadryad. Link

  15. NIST. The Thermodynamic Research Center. Link

  16. University of Leichester. Peer REview for Publication & Accreditation of Research. Link

  17. International Virtual Observatory Alliance. Link

  18. Oxford University Press. The Journal of Biological Databases and Curation. Link

  19. Nature Publishing Group. Scientific Data. Link

  20. James A. Warren, Ronald F. Boisvert. Building the Materials Innovation Infrastructure: Data and Standards. National Institute of Standards and Technology, 2012. Link

  21. Michael C. Whitlock, Mark A. McPeek, Mark D. Rausher, Loren Rieseberg, Allen J. Moore. Data Archiving. The American Naturalist 175, 145–146 University of Chicago Press, 2010. Link

  22. Notes for authors 2012. Acta Crystallographica Section C 68, e3–e11 (2012). Link

  23. N. Koga, C. Schick, S. Vyazovkin. New procedures for articles reporting thermophysical properties. Thermochimica Acta 555, iii Elsevier BV, 2013. Link

  24. Todd Vision. Discussion regarding www.datadryad.org, private communication. (2012).

  25. Carol Ember, Robert Hanisch. Sustaining Domain Repositories for Digital Data: A White Paper. (2013). Link

  26. Charles H. Ward, James A. Warren. Discussion with AAP, STM, AIP, ACS, Elsevier. (2012).

  27. A. Acharya. Private Commmunication with Discussion with Acharya, Google, Inc.. (2012).

  28. NIST. NIST File Repositories. (2012). Link

  29. Paul A Shade, Michael A Groeber, Jay C Schuren, Michael D Uchic. Experimental measurement of surface strains and local lattice rotations combined with 3D microstructure reconstruction from deformed polycrystalline ensembles at the micro-scale. Integrating Materials and Manufacturing Innovation 2, 5 Springer Science + Business Media, 2013. Link

  30. Paul A. Shade, Michael A. Groeber, Jay C. Schuren, Michael D. Uchic. 3D microstructure reconstruction of polycrystalline nickel micro-tension test. (2013). Link

  31. Johns Hopkins University. Johns Hopkins University Data Management Services. Link

  32. labarchives. Lab Archives. Link

  33. Figshare. Figshare. Link

  34. Computational Materials Data Network. Computational Materials Data Network. Link

  35. Tim Austin, Chris Bullough, Dimitri Gagliardi, David Leal, Malcolm Loveday. Prenormative Research into Standard Messaging Formats for Engineering Materials Data. International Journal of Digital Curation 8, 5–13 Edinburgh University Library, 2013. Link

  36. Georg J Schmitz, Ulrich Prahl. ICMEg – the Integrated Computational Materials Engineering expert group – a new European coordination action. Integrating Materials and Manufacturing Innovation 3, 2 Springer Science + Business Media, 2014. Link

  37. John Rumble. E-Materials Data. Standardization News ASTM International, 2014. Link

  38. Michael A Jackson, Michael A Groeber, Michael D Uchic, David J Rowenhorst, Marc De Graef. h5ebsd: an archival data format for electron back-scatter diffraction data sets. Integrating Materials and Manufacturing Innovation 3, 4 Springer Science + Business Media, 2014. Link

  39. Carelyn E Campbell, Ursula R Kattner, Zi-Kui Liu. The development of phase-based property data using the CALPHAD method and infrastructure needs. Integrating Materials and Manufacturing Innovation 3, 12 Springer Science + Business Media, 2014. Link

  40. Michael C. Whitlock. Private Commmunication with M. Whitlock, U. British Columbia. (2012).

  41. Rapporteur; Board on Research Data Paul E. Uhlir, Information; Policy, Global Affairs; National Research Council. For Attribution – Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop. The National Academies Press, 2012. Link

  42. CODATA-ICSTI Task Group on Data Citation Standards, Practices. Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data. Data Science Journal 12, CIDCR1-CIDCR75 (2013).

  43. Joint Declaration of Data Citation Principles. Data Citation Synthesis Group Link

  44. DataCite. DataCite Joint Statement. (2012). Link

  45. Data Readiness Levels. (2013). Link

  46. Robert T. Downs, Michelle Hall-Wallace. The American Mineralogist cyrstal structure database. American Mineralogist 88, 247-250 Mineralogical Society of America, 2003.

  47. Rapporteur; Board on Research Data Paul F. Uhlir, Information; Policy, Global Affairs; National Research Council. The Future of Scientific Knowledge Discovery in Open Networked Environments: Summary of a Workshop. The National Academies Press, 2012. Link

  48. Charles H. Ward. Implications of Integrated Computational Materials Engineering with Respect to Export Control. Air Force Research Laboratory, 2013. Link

  49. Creative Commons. Creative Commons License 3.0. Link

[Someone else is editing this]

You are editing this file