This replacement is justified by the fact that the MUE has similar content to the root mean square error (which, up to a constant in the definition, is the standard deviation of the distribution of the errors) but it is usually preferred in the DFT literature as an indicator of functional performance. While Goerigk et al. warned not to use their weighted MUE indicators as an estimation of statistical error for specific chemical problems,\cite{goerigk_look_2017} the connection with the coefficient of variation makes wMUE useful beyond classification purposes as a balanced measure of the empirical risk of a functional, as demonstrated by the results presented below. It is important to keep in mind though that—in accordance with Goerigk et al.’s suggestion—weighted MUE values for different databases should never be compared in absolute terms, because they intrinsically depend on the molecules that are included in each database, and their main purpose is to provide a basic criterion for the ranking of functionals.
\(f(n,p)\) in Eq. \ref{eqn:9} is an additive penalty function that depends on the number of training data, \(n\) , and the degrees of freedom (number of free parameters) of the fitted function, \(p\) . Assuming a gaussian distribution of the errors, the penalty function can be calculated for regression models as:
\(\begin{equation}\label{eqn:13} f\left(n,p\right)=\frac{2p}{n}\sigma^{2}, \end{equation}\)
with:
\(\begin{equation}\label{eqn:14}\sigma^{2}=\frac{n}{n-p}R_{\text{emp}}\ ,\end{equation}\)
resulting in a final formula for AIC that is:
\(\begin{equation}\label{eqn:15}\text{AIC}=w\text{MUE}\cdot\left(1+\frac{2p}{n-p}\right). \end{equation}\)
Among the second class of methods the Vapnik–Chervonenkis criterion (VCC) \cite{vapnik_uniform_1971} can be selected. This criteria inflates the empirical risk by a multiplicative penalty function related to Vapnik’s measure:
\(\begin{equation}\label{eqn:16}\text{VCC}=w\text{MUE}\cdot\left(1-\sqrt{p-p\ln p+\ln(n/2n)\ }\right)^{-1}. \end{equation}\)
For both AIC and VCC, \(n=200\) when evaluated using ASCDB, while \(p\) is an estimation of the degrees of freedom (DoF) that is equal to the number of fitted parameters for fitted functionals, while it is set to 1 for non-fitted ones (Table \ref{182001}).
The definition of a resampling criterion for xc functionals is unfortunately not as straightforward, since in several cases it might be difficult to find data that can be used as an external, unbiased, validation set to be used in cross-validation methods. As such, cross-validation criteria are intrinsically dependent on the data set that is used to obtain them \cite{geisser_predictive_1993,devijver_pattern_1982}, and particular effort has to be devoted to creating a criterion that is representative and transferable across as many functionals and data sets as possible. The purpose of cross-validation is, in practice, to highlight inconsistencies in the treatment of external data, when compared to the data that are used for the training of the parameters. Therefore, every overfitted model present a large difference between the errors for the training set and those for the validation set. The major hurdle in the evaluation of xc functionals from different sources and development philosophies is to find two appropriate and independent sets of data that can function as a “training set” and as a “validation set”. On the one hand, the first 12 subsets of ASCDB include chemical systems that are conventionally used to train and evaluate computational methods. While none of the existing functionals is specifically trained on all molecules of these subsets, most of the modern fitted functionals were trained on similar systems (e.g. the Minnesota and wB97 families), and even non‑fitted ones (e.g. revTPSS and SCAN) have been subject to convergent evolution to provide at least reasonable results for those basic chemical systems. On the other hand, the last four subsets of ASCDB contain unconventional systems that are very far from what current functionals have been designed or trained for and represent a good dataset for validation. (The three main subsets in this category comes from the mindless benchmark database of Grimme and coworkers, while the last one includes the energies of atoms on a per-electron basis.) A simple cross-validation measurement of the overfitting of a functional can then be obtained from the ratio between the MUE of the unbiased calculation—used as a “validation set”—and the overall wMUE of ASCDB—used as the “training set”. Interpreting this last quantity as a cross-validation estimate of the unknown noise variance of the distribution of the errors, \(\sigma^{2}\), the cross-validation criterion (CVC) can then be calculated by inflating the empirical risk using eqs. \ref{eqn:9} and \ref{eqn:13}, as:
\(\begin{equation}\label{eqn:17}\text{CVC}=w\text{MUE}+\frac{2h}{n}\frac{\text{MU}E_{\text{UC}}}{w\text{MUE}}, \end{equation}\)
where the wMUE of the full ASCDB database is used at the denominator in place of the MUE (or weighted MUE) of the first twelve subsets of ASCDB because numerical evidence showed no significant differences in the rankings when this transformation was performed. Apart from a much simpler formula to calculate CVC, the main advantage of using the weighted MUE of the entire database is that eq. \ref{eqn:17} also becomes extensible to other databases. For example, a straightforward extension of CVC to the GMTKN55 database is obtained by using the overall WTMAD‑2 at the numerator, and the MUE (MAD using Goerigk et al. notation) for the mindless benchmark subset at the denominator:
\(\begin{equation}\label{eqn:18}\text{CVC}^{\text{GMTKN55}}=\text{WTMAD}2+\frac{2h}{n}\frac{\text{MAD}_{\text{MB1643}}}{\text{WTMA}D2}.\end{equation}\)
As for the wMUE case discussed before, it is important to keep in mind that, despite providing very similar rankings, CVC values from different databases are difficult to compare in absolute terms because they intrinsically depend on the molecules that are included in each database.