1. Introduction
Forest inventories continuously monitor the status of forested
ecosystems through the implementation of field campaigns for data
collection and subsequent analysis (Smith, 2002). As forests play a key
role in maintaining ecological stability, national forest inventories
are playing an increasingly-important role in driving academic and
governmental decision making (Saarela et al., 2020). For example,
Mexico’s National Forest and Soils Inventory (INFyS) is a pillar of its
measurement, reporting and verification system (MRV), and the foundation
for the national inventory of greenhouse gasses (GHG) emissions in the
Land Use, Land-Use Change and Forestry (LULUCF) sector and for the
national forest reference emissions level (FREL). MRV and FREL are
components of a carbon accounting system used by the United Nations to
incentivize practices that lower carbon emissions (Mitchell et al.,
2017). National forest inventories usually focus on collecting field
data over large geographic areas. Developing analytical tools that
enhance the accessibility and understanding of nation-wide forest
inventory data is critical for democratizing information about forest
structure at national and international scales.
Forest inventories based on a statistical sample are used to estimate
mean or total amounts of forest inventory attributes within the
population of interest (Tomppo, Haakana, et al., 2008). However, field
surveys can be costly, time consuming and logistically-challenging.
Furthermore, collecting data exclusively from field surveys can result
in designs that do not satisfy the statistical assumptions and can have
limited sample sizes due to the phenomenon of non-response, which occurs
when field plots that were part of the design cannot be accessed.
Improper management of nonresponse can produce bias or increase
uncertainty when generating estimates (McRoberts et al., 2005). Emerging
satellite and machine learning (ML) technologies give us the opportunity
to build standardized analytical tools that can mitigate problems
associated with non-response and produce maps that serve for multiple
purposes (Tomppo, Olsson, et al., 2008).
Technologies for mapping forest attributes have evolved through the
modeling of attributes contained in field data with remotely-sensed
satellite data, and then the use of these models to predict the spatial
distribution of forest attributes (Schumacher et al., 2020; Wang et al.,
2009). The integration of both data sources has been widely applied to
better visualize national-scale estimates, reduce uncertainty, and
improve dataset robustness (Haakana et al., 2019; Ohmann et al., 2014;
Saarela et al., 2020; Tomppo et al., 2010). This approach has played a
key role in modeling national estimates of forest structure such as
aboveground biomass (AGB) as well as attributes such as forest age
(Saarela et al., 2020; Schumacher et al., 2020). Both tree height and
tree density are drivers of AGB and bioenergy potential in forest
ecosystems. To obtain accurate spatial predictions of forest attributes,
many studies employ ML models using a multivariate approach (Khaledian
& Miller, 2020; Li et al., 2020; Soriano-Luna et al., 2018; Wadoux et
al., 2020). ML is a field of artificial intelligence (AI), and one of
its main objectives is to identify and model relationships between
dependent data (such as forest inventory attributes) and independent
data (such as remote sensing), and apply these models to generate
predictions in a semi-autonomous approach (James et al., 2013a). The
performance of different types of ML models often varies when modeling
forest attributes. For example, spatially explicit estimates of AGB
varied by as much as 19% when performing linear (LM), generalized
additive (GAM) and random forest (RF) empirical models in a temperate
forest in central Mexico (Soriano-Luna et al., 2018). The three fitted
AGB models performed well when predicting AGB spatial distribution, but
GAM was better for representing AGB variations across the landscape.
Thus, different ML models yield different results and studies use
multiple models or algorithms to identify the best solutions for
predicting forest attributes or specific response variables, as no
silver bullets exist in ecological modeling (Qiao et al., n.d.).
One commonly-used set of ML approaches used to perform spatial
prediction are ensemble learners, which integrate multiple ML models and
algorithms (Holloway & Mengersen, 2018). Ensemble ML models are used in
mapping forest attributes because they offer improvements in accuracy to
independent algorithms (Healey et al., 2018). Examples of popular
ensemble ML algorithms include RF (Breiman, 2001), which applies a
bagging method, and Super Learner, which applies a stacked method and
uses cross-validation to estimate the performance of multiple ML models
(Polley & Laan, 2010). The latter has been shown to outperform the
individual algorithms used to build the model (Davies & van der Laan,
n.d.; Taghizadeh-Mehrjardi et al., 2021).
Forests in Mexico are a critical natural resource, containing vast
amounts of biodiversity and providing ecosystem goods and services
(e.g., timber production, water security, soil conservation) as well as
economic benefits. The National Forestry Commission of Mexico (CONAFOR)
has been in charge of implementing the INFyS from 2004 to the present.
The INFyS is a national program in which a stratified, systematic sample
of permanent ground plots is used to measure trees (e.g., height,
diameter at breast height, count) and site (e.g., forest type, site
class, topographic data) variables across all forest lands every 5 years
(CONAFOR, 2017).
The main goal of this study is to develop a methodological framework
with which CONAFOR can generate country-level maps of INFyS forest
attributes. Specifically, this involves operationalizing methods based
on integrating field data with remote sensing data in an ensemble ML
framework to map forest attributes. We are starting with tree height and
tree density, as these are key components of forest structure and can be
useful to provide information that helps mitigate impacts of
nonresponse, and in the estimation of AGB, carbon storage and forest
productivity over time (Humagain et al., 2017; Pirotti, 2010; Selkowitz
et al., 2012). Accurate spatial predictions of such structural variables
are fundamental for the management and conservation of forest
ecosystems, as they are important constituents in the study of
land-atmosphere interactions, carbon cycling, assessment of fire hazards
and timber volume estimation (Chopping et al., 2008; Selkowitz et al.,
2012). By developing workflows and products based on INFyS data, this
study aims to support CONAFOR in generating information that will be
used by decision makers to manage forests more effectively, preserve the
country’s forest patrimony, and improve national and international
reporting associated with MRV and FREL. We envision this methodology
could be further applied for several other forest attributes such as
AGB, carbon storage, and timber volume, among others, and improve
Mexico’s national estimates of other relevant forest attributes.