Abstract:
Ecologists often rely on observational data to understand causal relationships. Although observational causal inference methodologies exist, model selection based on information criterion (e.g., AIC) remains a common approach used to understand ecological relationships. However, such approaches are meant for predictive inference and is not appropriate for drawing causal conclusions. Here, we highlight the distinction between predictive and causal inference and show how model selection techniques can lead to biased causal estimates. Instead, we encourage ecologists to apply the backdoor criterion, a graphical rule that can be used to determine causal relationships across observational studies.
As ecologists, we are often interested in answering causal questions about human impacts on the natural world, such as the effect of climate-induced bleaching events on coral reef ecosystems (e.g., Graham et al. 2015), the impact of deforestation on biodiversity (e.g., Brook et al. 2003), or the effect of conservation and management responses on restoring ecosystem services (e.g., Sala et al. 2018). Often, randomized controlled experiments are unfeasible, and ecologists instead rely on observational data to answer fundamental causal questions in ecology (MacNeil, 2008). Recently, new advances in technology such as remote-sensing and animal-borne sensors, as well as increased availability of citizen science and electronic data have further increased opportunities to answer causal questions from observational data (Sagarin and Pauchard 2010).
In recent years, researchers have advocated for the increased application of causal inference in ecology for answering cause and effect relationships from observational data (e.g., Larsen et al. 2019; Laubach et al. 2021) but these approaches have yet to be widely adopted. Instead, drawing causal conclusions from observational data is typically taboo, with Pearson’s oft-cited “correlation doesn’t equal causation” used to block attempts to do so (Glymour 2009). This misconception – that causality cannot be inferred using observational data – has resulted in a culture where ecologists dependent on observational data for understanding causal relationships avoid explicitly acknowledging the causal goal of research projects and instead use coded language that implies causality without explicitly saying so (Hernan 2018; Arif et al. 2021).
A common strategy used to understand ecological relationships is to apply model selection, using information metrics such as Akaike’s information criterion (AIC; Akaike 1973). Such approaches select the ‘best’ model among a candidate set and subsequently make inferences from parameters that are of ecological interest within the winning model. Often, these inferences are tied up with causal language, implying that having selected the best model, one can proceed to using causal language in reference to it (Table 1). However, model selection is not a valid method for inferring causal relationships – rather, these techniques aim to select the best model for predicting a response variable of interest. For example, AIC approximates a model’s out of sample predictive accuracy, using only within-sample data (Akaike 1973). Although numerous model selection criteria exist (e.g., BIC, Schwarz et al. 1978; DIC; Spiegelhalter et al. 2002; WAIC, Watanabe 2013; LOO-CV; Vehtari et al. 2017), they are all used to compare models based on predictive accuracy (McElreath 2020; Laubach et al. 2021; Tredennick et al. 2021). Thus, model selection is appropriate for predictive inference (i.e., which model best predicts Y?), which is fundamentally distinct from causal inference (i.e., what is the effect of X on Y?).
To demonstrate this distinction, the directed acyclic graph (DAG) in Figure 1 shows the causal structure of a hypothetical ecological system. DAGs can be used to visualize causal relationships, where variables (nodes) are connected to each other via directed arrows, pointing from cause to effect (Elwert 2013). For example, forestry effects species Y both directly (there is a directed arrow between them) and indirectly, via the directed arrow from forestry to species A and from species A to species Y (Fig 1). To illustrate the difference between model selection and causal inference we created a simulated dataset that matches the linear causal structure of this DAG, setting the total (i.e., direct and indirect) causal effect of forestry on species Y to -0.75 (Appendix S1). We further specified candidate linear regression models that included all possible covariate combinations where species Y is a response. Using our simulated data and our candidate models, both AIC and BIC selected a ‘best’ model where forestry, species A, human gravity, climate, and invasive species Z were included as covariates (Appendix S1). However, interpreting the coefficients of this model can provide biased causal estimates. For example, the effect of forestry on species richness is shown to be -0.36 [-0.38, -0.33], instead of -0.75 (Appendix S1).
In this scenario, there are two statistical biases at play. The first is overcontrol bias, which occurs when the inclusion of intermediate variables along a causal pathway removes the indirect causal effect between predictor and response (Cinelli et al. 2021). Here, the inclusion of the intermediate variable species A removes the indirect effect between forestry and species Y. Second, the inclusion of invasive species Z as a covariate leads to collider bias, which can result from adjusting for a variable that is caused by both predictor and response (Cinelli et al. 2021). Here, the inclusion of invasive species Z induces an additional, but non-causal, association between forestry and species Y.
It is worth noting that although the true predictive model (i.e., the data-generating model for species Y, where all direct predictor variables – human gravity, species A, forestry, and climate were included as covariates) was included as a candidate model, both AIC and BIC selected a more complex model with invasive species Z as a covariate. Here, even though invasive species Z is not a predictor variable for species Y, its statistical (non-causal) association with species Y increased predictive accuracy, resulting in better out of sample predictive accuracy. Indeed non-causal associations including collider bias and reverse causation has been shown to increase predictive accuracy (e.g., Luque-Fernandez et al. 2019; Griffith et al. 2020). Thus, a model selected based on predictive accuracy should not be assumed to be causally accurate.
A more subtle point is that even if a model captures the data generating process for a response variable of interest, it may not be appropriate for answering specific causal queries. For example, if we want to know the total effect of forestry on species Y, a model with all direct predictor variables – human gravity, species A, forestry, and climate – included as covariates, returns a causal estimate of -0.21[-0.23, -0.18] instead of -0.75 (Appendix S1). Here, the inclusion of species A as a covariate leads to overcontrol bias between forestry and species Y, removing this indirect effect. As well, this model cannot be used to determine the causal estimates of other distal drivers, such as climate or fire. Ultimately, causal models must be built based on the specific causal question at hand, as well as through the careful consideration of the overall causal structure, including how different predictor variables may be related to one another.