This replacement is justified by the fact that the MUE has similar
content to the root mean square error (which, up to a constant in the
definition, is the standard deviation of the distribution of the errors)
but it is usually preferred in the DFT literature as an indicator of
functional performance. While Goerigk et al. warned not to use their
weighted MUE indicators as an estimation of statistical error for
specific chemical problems,\cite{goerigk_look_2017} the connection with the coefficient
of variation makes wMUE useful beyond classification purposes as a
balanced measure of the empirical risk of a functional, as demonstrated
by the results presented below. It is important to keep in mind though
that—in accordance with Goerigk et al.’s suggestion—weighted MUE
values for different databases should never be compared in absolute
terms, because they intrinsically depend on the molecules that are
included in each database, and their main purpose is to provide a basic
criterion for the ranking of functionals.
\(f(n,p)\) in Eq. \ref{eqn:9} is an additive penalty function that depends on the
number of training data, \(n\) , and the degrees of freedom (number of
free parameters) of the fitted function, \(p\) . Assuming a gaussian
distribution of the errors, the penalty function can be calculated for
regression models as:
\(\begin{equation}\label{eqn:13} f\left(n,p\right)=\frac{2p}{n}\sigma^{2}, \end{equation}\)
with:
\(\begin{equation}\label{eqn:14}\sigma^{2}=\frac{n}{n-p}R_{\text{emp}}\ ,\end{equation}\)
resulting in a final formula for AIC that is:
\(\begin{equation}\label{eqn:15}\text{AIC}=w\text{MUE}\cdot\left(1+\frac{2p}{n-p}\right). \end{equation}\)
Among the second class of methods the Vapnik–Chervonenkis criterion
(VCC) \cite{vapnik_uniform_1971} can be selected. This criteria inflates the empirical risk by a multiplicative penalty
function related to Vapnik’s measure:
\(\begin{equation}\label{eqn:16}\text{VCC}=w\text{MUE}\cdot\left(1-\sqrt{p-p\ln p+\ln(n/2n)\ }\right)^{-1}. \end{equation}\)
For both AIC and VCC, \(n=200\) when evaluated using ASCDB, while \(p\) is an estimation of the degrees of freedom (DoF) that is equal to the number
of fitted parameters for fitted functionals, while it is set to 1 for non-fitted ones (Table \ref{182001}).
The definition of a resampling criterion for xc functionals is unfortunately not as
straightforward, since in several cases it might be difficult to find
data that can be used as an external, unbiased, validation set to be
used in cross-validation methods. As such, cross-validation criteria are
intrinsically dependent on the data set that is used to obtain them \cite{geisser_predictive_1993,devijver_pattern_1982}, and
particular effort has to be devoted to creating a criterion that is
representative and transferable across as many functionals and data sets
as possible. The purpose of cross-validation is, in practice, to
highlight inconsistencies in the treatment of external data, when
compared to the data that are used for the training of the parameters.
Therefore, every overfitted model present a large difference between the
errors for the training set and those for the validation set. The major
hurdle in the evaluation of xc functionals from different sources
and development philosophies is to find two appropriate and independent
sets of data that can function as a “training set” and as a
“validation set”. On the one hand, the first 12 subsets of ASCDB
include chemical systems that are conventionally used to train and
evaluate computational methods. While none of the existing functionals
is specifically trained on all molecules of these subsets, most of the
modern fitted functionals were trained on similar systems (e.g. the
Minnesota and wB97 families), and even non‑fitted ones (e.g. revTPSS and
SCAN) have been subject to convergent evolution to provide at least
reasonable results for those basic chemical systems. On the other hand,
the last four subsets of ASCDB contain unconventional systems that are
very far from what current functionals have been designed or trained for
and represent a good dataset for validation. (The three main subsets in
this category comes from the mindless benchmark database of Grimme and
coworkers, while the last one includes the energies of atoms on a
per-electron basis.) A simple cross-validation measurement of the
overfitting of a functional can then be obtained from the ratio between
the MUE of the unbiased calculation—used as a “validation set”—and
the overall wMUE of ASCDB—used as the “training set”.
Interpreting this last quantity as a cross-validation estimate of the
unknown noise variance of the distribution of the errors, \(\sigma^{2}\), the cross-validation criterion (CVC) can then be
calculated by inflating the empirical risk using eqs. \ref{eqn:9} and \ref{eqn:13}, as:
\(\begin{equation}\label{eqn:17}\text{CVC}=w\text{MUE}+\frac{2h}{n}\frac{\text{MU}E_{\text{UC}}}{w\text{MUE}}, \end{equation}\)
where the wMUE of the full ASCDB database is used at the denominator in place of the MUE (or weighted MUE) of the first twelve subsets of ASCDB because numerical evidence showed no significant
differences in the rankings when this transformation was performed. Apart from a much simpler formula to
calculate CVC, the main advantage of using the weighted MUE of the
entire database is that eq. \ref{eqn:17} also becomes extensible to other
databases. For example, a straightforward extension of CVC to the
GMTKN55 database is obtained by using the overall WTMAD‑2 at the
numerator, and the MUE (MAD using Goerigk et al. notation) for the
mindless benchmark subset at the denominator:
\(\begin{equation}\label{eqn:18}\text{CVC}^{\text{GMTKN55}}=\text{WTMAD}2+\frac{2h}{n}\frac{\text{MAD}_{\text{MB1643}}}{\text{WTMA}D2}.\end{equation}\)
As for the wMUE case discussed before, it is important to keep in mind
that, despite providing very similar rankings, CVC values from different
databases are difficult to compare in absolute terms because they
intrinsically depend on the molecules that are included in each
database.