Introduction

Scientific experiments produce data that then leads to the scientific conclusions that are generally published in articles. Although new data is always coming (at an accelerating rate [ref]) it is important to be able to go back to the old data and do the analysis again. There are several reasons for this, of which only a few ones are listed here:
  1. Checking that the analysis was done well in the first place. This is of importance especially when new experiments do not fully agree with the conclusions.
  2. Using the old data as a comparison in order to check how data produced by a newly developed methodology matches with the old data.
  3. Using the old data for new types of analyses that were not originally thought of at the time.
  4. Using the old data as part of larger meta-analysis that aggregates data from many sources.
  5. Using the data as a basis for student exercises.
In order to facilitate this, the original data must be available, the original analysis code must be available, and also the meta information about the data must be available.
In non-trivial cases, it is not possible to store all the meta data. In particular, one cannot store information about why each particular choice was made in the analysis, although such details could be of big importance in understanding the validity of the analysis. A suitable meta data to store would allow an expert to check what has been done, repeat the analysis, and check the validity of the conclusions. If there are some aspects that are not understood (for example a choice of a particular filter to remove noise) then in a repeat analysis a thorough way would be to check whether such arbitrary choice was crucial or is the result robust in terms of these aspects.
If one strives to have such a goal that all published results are accompanied by the full datasets and sufficient meta data to repeat the experiment or the analysis, then we have probably sufficiently good practices to support development of science. However in such case datasets that have not yet been analyzed and published may be neglected. 

What to store

We are approaching this question from the point of view of experimental X-ray techniques where the properties of some form of material have been studied using X-rays. There are at least three categories of  data that need to be stored:
  1. Meta data about what is the sample
  2. Meta data about the experimental procedure
  3. Meta data about the analysis
Finally, in many cases several different kind of studies will be done on the same sample, using X-rays and other techniques, and the link between these different studies must be stored also.
With such information, it would probably be possible to repeat the complete experiment and analysis. However, this contains a lot of information that do not fit into a meta data field of some data storage place, instead a log book would be needed for each sample or sample set that describes in detail and in chronological manner all the steps that have been taken to arrive at the results that are quoted in the publication. At present it is customary that each participant of research stores (hopefully) information about what they have done with the sample, what are the results, and how they have analyzed the data. To have a full log summarizing all the steps is not in parctice possible at least in those research collaborations that the author has been working in.
Later in this document we will discuss ideas how to unify the meta data and logs so that everything is in one place, but at this point we now line out a strategy of storing meta data for a particular experimental technique so that this particular piece of the work can be repeated or the results re-analyzed.

Metadata for a givenexperiment with X-rays

Identification of the sample. The sample has to be identified in such a way that it can be relocated unambiguously. The metadata could optionally contain a brief description of the sample so that it can be used for searching experiments done on certain types of samples.  Such descriptions are difficult to standardize.
Experimental parameters. At the first level these parameters describe the experimental apparatus and parameters to such detail that a person familiar with the particular instrument will be able to repeat the experiment. This first level also ensures that the person doing the analysis will be able to make correct choices about the required analysis steps. At the second level the description should be so detailed that a person unfamiliar with the given instrument will be able to plan a similar experiment on another instrument (for example possible to estimate SNR, resolution required, etc.) or to analyze the data. 
Analysis macros. The analysis macros should be stored in a version controlled system, and a link to the commit of the version that was used shold be stored along with the actual data. In more extreme cases the whole computational environment should be stored as a virtual machine. When the general macros are updated, a non-regression test could be run against the old version and if the results match this newer compatible version could be additionally indicated in the metadata (automatically). If the results do not match, this may call into question the validity of the old macros, and should be investigated.
Human input in the analysis. The human input is important in many cases, such as choosing a threshold, creating a mask of good data portion, indicating position of the direct beam, etc. These are fundamental parameters that are required in the analysis. They should be stored along with the original data, and the analysis macros should be able to read this input together with the actual data, so that the repeat analysis does not need to give the human input again.