Method Description and comments Threats to validity of causal inference Relevance of evaluation evidence Practical / logistical considerations
Experimental
Randomized Controlled Trial (RCT) [Duflo et al., 2007]
RCTs are generally not feasible for network water infrastructure, as such interventions are clustered, directional, and designed to serve population at scale or to address known (selected) system deficiencies. Some complementary interventions (information campaigns) can be evaluated using this approach. Smaller-scale rural infrastructure (e.g., condominial sewerage, village-scale piped water) can be evaluated with cluster RCTs, or step-wedge RCTs.
Confounding due to unbalanced randomization Spillovers (violation of the stable unit treatment value assumption, or SUTVA), whereby some units benefit as a result of other units’ uptake. Vulnerable to selective attrition Typically artefactual, w/ limited evaluation questions Treatment effect can be representative “Gold standard” for causal researchers Results are not conditioned by assumptions Statistical power is a design feature, but usually sufficient for a few pre-identified outcomes Cost: High, especially when powered for multiple outcomes or interventions Contamination risk: Moderate, as pressure to help “untreated” units increases over time Coordination: Mainly pertains to maintaining integrity of randomization Interpretation: Intuitive and highly transparent Pre-intervention data needs: Low to none Flexibility to adapt: Very low
Experimental encouragement design [Katz et al., 2001]
Subsidies or other assistance to customers can generate exogenous variation in the take-up of infrastructure connections, for use as an instrumental variable for isolating impacts. The resulting local average treatment effect is specific to those who respond to the encouragement [Heckman et al., 2006].
Same as above Same as above, except that the treatment effect only applies to the population that responds to the encouragement
Quasi-experimental
Natural experiment [J Angrist et al., 2002]
Some infrastructure placements are determined by geographic or other factors that are “as good as random” in determining exposure to improvements, such that they provide researchers with “natural experiments” [Cerdá et al., 2012], that give rise to comparable treatment and control groups. Another version of this is an interrupted time series analysis where a time-dependent event (e.g., rehab of one part of a water network) gives rise to a sharp change that only affects some households or others.
Confounding by geographic / other factors determining exposure may also confound outcomes Spillovers (i.e., violation of SUTVA) outside of treatment area Evidence arises directly from the real world Treatment effect is representative but contingent on natural experiment conditions Generally accepted by researchers Results are not conditioned by assumptions Statistical power: Difficult to anticipate ex ante Cost: Low to moderate, depending on data collection needs Contamination risk: Low Coordination: Moderate; mainly in combining with other methods (DiD) to strengthen validity Interpretation: Intuitive but not always transparent Pre-intervention data needs: Low to none Flexibility to adapt: Impossible Other: Natural experiment can be hard to anticipate
Difference-in-differences (DiD) [Card and Krueger, 2000]
In this approach, impacts are estimated by subtracting out the trend in an unexposed sample, which represents the counterfactual, from that in an exposed sample. Such samples are created using variation in spatial targeting or other eligibility criteria, which are common for network water infrastructure extension or rehabilitation. The validity of the comparison relies on pre-treatment trends being similar in the groups, and can be enhanced using matching or econometric models that control for differences in baseline covariates.
Confounding by time-varying unobservables Spillovers (i.e., violation of SUTVA) Vulnerable to selective attrition Evidence arises directly from the real world Treatment effect is usually representative (unless combined w/other methods) Generally accepted by researchers, subject to showing parallel trends Results are not conditioned by assumptions Statistical power is a design feature Cost: Moderate to high, depending on data collection needs Contamination risk: Moderate to high Coordination: Moderate; mainly in combining with other methods (matching) to strengthen validity Interpretation: Intuitive and transparent Pre-intervention data needs: Moderate to high (parallel trends) Flexibility to adapt: Moderate
Matching or synthetic control [Abadie and Gardeazabal, 2003; Rosenbaum and Rubin, 1985]
These methods are best when combined with DiD analysis, but can be used to improve comparability when targeting is correlated with baseline characteristics. Various matching approaches enhance comparability by sampling untreated observations that can approximate the treatment counterfactual. For example, propensity score matching (PSM) finds treated and untreated observations that have a similar probability of being treated, from a regression of participation on observables. Synthetic control uses a time series of pre-intervention observations to “train” an algorithm that identifies weights for a pool of observations with similar counterfactual trends as one or more treated units.
Confounding by unobservables (Conditional Independence Assumption), worse when match quality is low Spillovers (i.e., violation of SUTVA) Evidence arises directly from the real world Treatment effect only applies to units with suitable comparisons (common support region) Researchers are often skeptical that the CIA has been met Results are conditioned by assumptions of the matching algorithm Statistical power is a design feature Cost: Moderate to high, depending on data collection needs Contamination risk: High Coordination: Moderate; mainly in combining with other methods (DiD) to strengthen validity Interpretation: Intuitive, but matching may lack transparency Pre-intervention data needs: Moderate (matching) Flexibility to adapt: Moderate
Instrumental variables (IV) [J D Angrist and Krueger, 2001]
An instrumental variable is a factor that predicts exposure to or participation in an intervention, but that does not affect outcomes directly through channels other than that effect on participation. This creates exogenous variation in the intervention that can be leveraged to determine its impacts. The impact measure is a local average treatment effect that measures the effect of the intervention on those (“compliers”) whose participation is affected by the instrument. Program placement rules or constraints may give rise to valid instruments.
Confounding: For many interventions and outcomes, there are few plausibly “exogenous” assignments of this type, at least in a statistical sense Spillovers (i.e., violation of SUTVA) Evidence arises directly from the real world Treatment effect (LATE) is not representative, and not always for the most relevant population Researchers are often skeptical about exclusion restriction Results are conditioned by exogeneity assumptions Statistical power is often reduced by 2-stage estimation Cost: Low to moderate, depending on data collection needs Contamination risk: Not applicable Coordination: Low Interpretation: Unintuitive, lacks transparency Pre-intervention data needs: Low Flexibility to adapt: High Other: Suitable IV may not exist
Regression discontinuity (RD) [Imbens and Lemieux, 2008; Thistlethwaite and Campbell, 1960]
RD exploits discontinuities in eligibility for an intervention with respect to an assignment variable. For example, population thresholds, or a poverty line threshold for subsidy eligibility.
Confounding: Eligibility rule violations or manipulation, or “fuzzy” discontinuities that are difficult to characterize well Spillovers (i.e., violation of SUTVA) Vulnerable to selective attrition Evidence arises directly from the real world Treatment effect is limited to units very near the discontinuity Generally accepted by researchers Results are conditioned on proximity to eligibility cutoff Statistical power may be limited Cost: Low to moderate, depending on data collection needs Contamination risk: Moderate, depending on rigor with which eligibility is assessed Coordination: Low Interpretation: Intuitive, but transparency may be lacking due to definition of the RD bandwidth Pre-intervention data needs: Low Flexibility to adapt: Low
Other
Ex post regression
Statistical comparison of treated and untreated units, with statistical control for observed differences between the groups. Also commonly called “observational” comparisons.
Selection: Units that participate are systematically different than those that do not Confounding by unobservables Spillovers (i.e., violation of SUTVA) Evidence arises directly from the real world Treatment effect is usually representative Causal researchers are typically highly skeptical of results Results are conditioned on controls Statistical power: Difficult to anticipate ex ante Cost: Low to moderate, depending on data collection needs Contamination risk: Not applicable Coordination: Low Interpretation: Intuitive, but transparency may be lacking (contingent on choice of controls) Pre-intervention data needs: None Flexibility to adapt: High
Counterfactual modeling [Balke and Pearl, 2013]
Complex water resources systems evolve stochastically according to both human and environmental influences. This approach leverages systems understanding from socio-hydrological or hydro-economic models to conduct “with” and “without” simulations of interventions, for construction of model-based comparisons [Srinivasan, 2015].
Confounding by behavioral or other system-level factors not accounted for Evidence is artefactual; model may diverge from real world observations Treatment effect is usually representative, but may not align with policy-maker priorities and needs Not widely used by causal social science researchers, who are wary of over-calibration Results are conditioned on model assumptions Statistical power: Not applicable Cost: Low Contamination risk: Not applicable Coordination: Low Interpretation: Not intuitive and not always transparent (requires interdisciplinary expertise) Pre-intervention data needs: Moderate to high, depending on calibration needs Flexibility to adapt: High Other: Required model effort is substantial