python代写-MAS414/MAS6446

时间：2022-05-07

1

University of Sheffield
School of Mathematics and Statistics
MAS316/MAS414/MAS6446 Mathematical Modelling of Natural Systems
João Carreiras - G39e - j.carreiras@sheffield.ac.uk
Topic III: Machine learning methods to retrieve forest biophysical
parameters
In this topic, you will explore the application of machine learning methods to estimate an important
forest biophysical parameter – aboveground biomass – in combination with predictors obtained from
satellite observations. These notes are organised in four chapters, one per lecture. In Lecture 1,
you will learn about the basics of tree-based machine learning methods and assessing model
generalisation. Lecture 2 will focus on several advanced ensemble techniques currently applied to
regression problems. Lecture 3 will cover a short general introduction to forest biophysical
parameters and the role of satellite observations. In Lecture 4, you will learn more about linking
satellite observations to ground measurements and the Sheffield-born BIOMASS mission that will
launch into space in 2023.
1 Machine learning for regression: decision trees, bagging and
boosting
1.1 Introduction and objectives
Machine learning designates a collection of data-driven methods relying on learning from training
experience. We have a response variable, either quantitative or categorical, that we are interested in
predicting based on a set of explanatory (or predictor) variables. We collected or have access to a set
of observations, in which we measured the response and predictor variables using a representative
sample. Using these data, we build an inductive-based prediction model, or learner, which will enable
us to estimate the response for new unseen predictors. A good learner is one that accurately predicts
the response variable. This is called supervised learning because of the presence of the response
variable to guide the learning process. We will focus on supervised machine learning methods
for the rest of this topic.
Supervised learning methods generally form their predictions via a learned function (), which
produces an output for each input (or a probability distribution over given ). Many different
forms of function exist, including decision trees, logistic regression, support vector machines, neural
networks, kernel machines, or Bayesian classifiers. There are also generic procedures, such as
boosting and bagging, which combine the outputs of multiple versions of the same learning algorithm.
In this topic, we will only address supervised tree-based methods applied to problems dealing
with quantitative response variables. The contents of this lecture are largely adapted from the
2

recommended textbook (Hastie et al. 2009). Those of you who enrolled in the Machine Learning
course will be familiar with most of the concepts given in the first two lectures.
1.2 Tree-based methods
Classification and Regression Trees (CART) is a term introduced by Leo Breiman (Breiman 1984) to
refer to decision tree algorithms that can be used for classification or regression predictive modelling
problems. The CART algorithm provides a foundation for important ensemble algorithms like bagged
decision trees, random forests and boosted decision trees.
1.2.1 Regression trees
Common to all CART-based methods, binary trees partition the space of all possible predictor
variables, starting with the entire training set at the root of the tree and ending with the leaves of a
more or less complex tree grown according to some criterion. In the case of regression trees, that
space is then split into two regions, with the response variable modelled by the mean of in each
region. The splitting rule is then defined by choosing the predictor variable and split value to achieve
the best fit. Then, one or both regions are split into two more regions, and this process is continued,
until some stopping rule is applied.
Consider a training set with observations, predictor variables and a response variable , i.e., ( ,) for = 1, 2, …, . The CART algorithm needs to automatically decide on the splitting variables,
split values, and on the shape of the tree. Assuming we have a partition into regions 1 ,2 , … , ,
and we model the response variable as a constant in each region:
() = � ( ∈ )
=1
1.1
If we choose to minimise a cost function of sum of squares ∑( − ())2 , then the best
estimate of is just the average of in region :
̂ = ave(| ∈ ) 1.2
Ideally, we would like to find a partition that achieves minimal risk: lowest mean squared error
for regression problems. However, the number of potential partitions is too large to search
exhaustively. Therefore, CART uses a search heuristic to choose the best partition. Formally, consider
a splitting predictor variable ( ∈ ) and split value , and define the pair of half-planes:
1(, ) = �� ≤ � and 2(, ) = �� > � 1.3
The objective is then to seek the predictor variable and split value that solve:
3

, �1 � ( − 1)2
∈1(,) + 2 � ( − 2)2∈2(,) � 1.4
For any combination of and , the minimisation problem is solved by:
̂1 = ave�� ∈ 1(, )� and ̂2 = ave�� ∈ 2(, )� 1.5
For each predictor variable , calculating the split value can be done very quickly and therefore
by going through all the predictor variables, determination of the best pair (, ) is feasible. Next, we
partition the data into the two resulting regions and recurse the splitting process on each of the two
regions. The process is repeated on all the resulting regions and stopped at some point according to
a defined criterion. Consider the following example: we are interested in estimating a response with
just one predictor . Figure 1.1 shows the recursive partition process leading to a regression tree with
four terminal (or leaf) nodes.
(a)

(b)

(c)

Figure 1.1. Evolution of the partition of the predictor space for a regression tree based on the CART algorithm.
(a) starting at the root node and using the split value = 1, to create the first partition; (b) further splitting one
child node using the split value = 2; (c) finally splitting the other child node using the split value = 3 . The
values of the terminal (or leaf) nodes are the average of the corresponding observations over .
Obviously, we’ll need to have a criterion to stop growing a tree, because a very large tree might
overfit the training data, whereas a small tree might not capture all the relevant structure in the data.
Tree size is a tuning parameter governing the model’s complexity, and the optimal tree size should be
adaptively chosen from the data. The preferred strategy is to grow a large tree , stopping the
splitting process only when some minimum node size is reached. Then this large tree is pruned using
a cost-complexity pruning approach.
We define a subtree ⊂ to be any tree that can be obtained by pruning , that is,
collapsing any number of its internal (non-terminal) nodes. We index terminal nodes by , with node
representing region . Let || denote the number of terminal nodes in . Letting:
4

= #{ ∈ } 1.6
where is the number of observations in region .
̂ = 1 � ∈ 1.7
where ̂ is the average value of the response variable in region .
() = 1 � ( − ̂)2∈ 1.8
where () is the squared error node impurity measure.
We then define a cost complexity function:
() = � () + ||||
=1
1.9
The idea is to find, for each , the subtree ⊆ to minimize (). The tuning parameter
≥ 0 governs the tradeoff between tree size and its goodness of fit to the data. Large values of
result in smaller trees , and conversely for smaller values of . As the notation suggests, with = 0
the solution is the full tree . We discuss how to adaptively choose below.
For each one can show that there is a unique smallest subtree that minimizes (). To find
we use a weakest link pruning approach: we successively collapse the internal node that produces
the smallest per-node increase in ∑ () , and continue until we produce the single-node (root)
tree. This gives a (finite) sequence of subtrees, and one can show this sequence must contain . In
case you’re interested in more details, please see Breiman (1984). Estimation of is achieved by
cross-validation, i.e., we choose the value � to minimize the cross-validated sum of squares. Our final
tree is then � .
However, decision trees are sensitive to small perturbations in the training set, which, in the
case of regression problems, may translate into models with high variance (Breiman 1996). A small
change in the data can result in a very different series of splits, making interpretation somewhat
precarious. The major reason for this instability is the hierarchical nature of the process: the effect of
an error in the top split is propagated down to all the splits below it. One can alleviate this to some
degree by trying to use a more stable split criterion, but the inherent instability is not removed. It is
the price to be paid for estimating a simple, tree-based structure from the data.
These unstable methods (known as weak or base learners in machine learning jargon) can have
their accuracy improved with perturbing and combining techniques, that is, by generating multiple
perturbed versions of the model (a.k.a. ensemble or committee) and combining those into a single
predictor. Methods that use ensemble of base learners have demonstrated to be very successful at
improving the accuracy of classification and regression trees and neural nets (Ren et al. 2016). These
5

methods can be divided in two types: those that adaptively change the distribution of the training set
based on the performance of previous base learners (e.g., boosting) and those that do not (e.g.,
bagging). In the following sections, you’ll learn more about the basics of bagging and boosting, as these
ensemble methods are the basis for the two classifiers we’ll learn about in the next lecture: random
forests (Breiman 2001) and stochastic gradient boosting (Friedman 2002). Bagging attempts at getting
an ensemble model with less variance than its components whereas boosting will mainly try to
produce strong models less biased than their components (even if variance can also be reduced).
1.3 Bagging
Bagging is the short-term for bootstrap aggregating and is a widely used method for regression and
classification problems, generating multiple versions of a predictor and using these to get an
aggregated predictor. Consider a training set = ( ,), = 1, 2, … ,, with a response variable and
a set of predictors . We only have a single training set , so we need to find a way to replicate the
process leading to multiple versions of a given predictor. This can be achieved by taking multiple
bootstrap samples (randomly drawn with replacement) {} of size ( ≤ ). Under some
assumptions, bootstrap samples have good statistical properties. As an approximation, they can be
seen as being drawn both (i) directly from the true underlying, often unknown, data distribution and
(ii) independently from each other. Consequently, they can be considered as representative and
independent samples of the true data distribution (almost independent and identically distributed
random samples or i.i.d.). However, two hypothesis must be verified to make this approximation valid.
H1: the size of the training set should be large enough to capture most of the complexity of
the underlying distribution so that sampling from the dataset is a good approximation of sampling
from the real distribution (representativity).
H2: the size of the training dataset should be large enough compared to the number of
bootstrap samples ( ≫ ) so that samples are not too much correlated (independence).
6

Assuming we have a model using to generate a predictor (). Due to the theoretical variance
of the training set (the training set is an observed sample coming from a true unknown underlying
distribution), () is also subject to variability: if another dataset had been observed, we would have
obtained a different () model.
The idea of bagging is then simple: we want to fit several () independent models and combine
their predictions to obtain a model with lower variance. However, in practice, to fit fully independent
models would require too many observations. Therefore, we rely on the approximation properties of
bootstrap samples - representativity and independence - to fit models that are almost independent.
The bagging approach can be described as follows:
i) first, creating multiple bootstrap samples, i.e., a sequence of training sets {}, each consisting
of independent observations from so that each new bootstrap sample will act as another
(almost) independent dataset drawn from a true distribution:
�1
1, 21 , … , 1 �, �12 , 22, … , 2 �, … , �1 , 2 , … , �,
where is the ℎ observation of the ℎ bootstrap sample;
ii) second, fitting a base learner to each of these �
� samples:
�
1 �,�2 �, … ,� �
iii) finally, combining the predictions into an averaging process to get an ensemble model
with a lower variance, which for regression problems is the average of the individual models:
= 1��
=1
1.10
Bootstrap samples are often used, for example, to evaluate variance or confidence intervals of
statistical estimators. By definition, a statistical estimator is a function of some observations and, so,
a random variable with variance coming from these observations. To estimate the variance of such
an estimator, we need to evaluate it on several independent samples drawn from the distribution of
interest. In most of the cases, considering truly independent samples would require too much data
compared to the amount available. We can then use bootstrapping to generate several bootstrap
samples that can be considered as being almost-representative and almost-independent (almost
independent and identically distributed samples). These bootstrap samples will allow us to
approximate the variance of the estimator, by evaluating its value for each of them.

The estimator of interest is evaluated for each
bootstrap sample, and the variance and confidence
intervals calculated based on the realisations of the
estimator.
7

One of the big computational advantages of bagging and other similar ensemble methods is that
it can be parallelised. As the different models are fitted independently from each other, intensive
parallelisation techniques can be used if required.
1.4 Boosting
Parallel ensemble methods (e.g., bagging) aim to fit several base learners independently from each
other and, so, it is possible to train them concurrently. In sequential ensemble methods, the idea is to
fit models iteratively such that model training at a given step depends on the models fitted at previous
steps. Boosting is the most well-known of these approaches and it produces an ensemble model that
is in general less biased than the contributing base learners. However, unlike bagging that mainly aims
at reducing variance, boosting is a technique that consists in fitting sequentially multiple base learners
in a very adaptative way: each model in the sequence is fitted giving more importance to observations
in the dataset that were poorly handled by the previous models in the sequence. Intuitively, each new
model focus its efforts on the most difficult observations to fit up to now, so that we obtain, at the end
of the process, a strong learner with lower bias (please note that boosting can also have the effect of
reducing variance). Being mainly focused at reducing bias, the base models that are often considered
for boosting are models with low variance but high bias. For example, if we want to use decision trees
as our base models, we will choose most of the time decision trees with only a few nodes (Figure 1.2).

Figure 1.2. Illustration of how bias and variance change with model complexity.
Another important reason that motivates the use of low variance but high bias models as base
learners for boosting is that these models are in general less computationally expensive to fit (few
degrees of freedom when parametrised). Indeed, as computations to fit the different models cannot
be done in parallel (unlike bagging), it would become too expensive to fit sequentially several complex
models. Once the base learners have been chosen, we still need to define how they will be sequentially
fitted (what information from previous models do we consider when fitting the current model?) and
how they will be aggregated (how do we aggregate the current model to the previous ones?). There
are different flavours of the boosting algorithm, but in this lecture we will focus on the adaptive
boosting (AdaBoost) algorithm proposed by Freund and Schapire (1997).
8

Assuming we have a training set = ( ,), = 1, 2, … ,, with a response variable and a set
of predictors . In the AdaBoost algorithm, we define our ensemble model as a weighted sum
of base predictors ( ):
= 1� × � �
=1
1.11
where are the coefficients and �
� the (weak) base learners.
Finding the best ensemble model with this form is a difficult optimisation problem. Then, instead
of trying to solve it in one iteration (finding all the coefficients and base learners that give the best
overall additive model), we make use of an iterative optimisation process that is much more
manageable, even if it can lead to a sub-optimal solution. Specifically, we add the base learners one by
one, looking at each iteration for the best possible pair (coefficient , base learner �
�) to add to
the current ensemble model. In other words, we define recurrently such that:
= −1 + × � � 1.12
where and �
� are chosen such that is the model that fit the best the training data
and, so, that is the best possible improvement over −1. We can then represent:
� ,� �� =
,() �� ,−1() + × ()()�=1 1.13
where (. , . ) is the loss/error function we want to minimise. Thus, instead of a complete
optimisation over all the models in the sum, we approximate the optimum by locally optimising,
building and adding the base learners to the strong model one by one.
Specifically, when considering a binary classification problem using a training dataset
composed of observations to train a base model (), the AdaBoost algorithm starts (first model
of the sequence) with all the observations having the same weight 1/. Then, the following steps are
repeated times (for the base models in the sequence):
i) fit the best possible base model � � with the current observations weights;
ii) compute the value of the update coefficient , which is a scalar evaluation metric of the
base learner indicating how much it should contribute to the ensemble model;
iii) update the strong learner by adding the new base learner multiplied by its update
coefficient;
iv) compute new observations weights expressing which observations should be the focus
at the next iteration: the weights of the observations wrongly predicted by the
aggregated model increase and the weights of the correctly predicted observations
decrease.
Repeating these steps, we have then built sequentially our models and aggregated them into
a simple linear combination weighted by coefficients expressing the performance of each learner. To
9

derive the AdaBoost algorithm in the context of a regression problem, we reduce the problem to a
binary classification problem and then apply the AdaBoost algorithm intended for classification
purposes. Therefore, the error/loss function (. , . ) in Equation 1.13 we want to minimise could be the
mean square error (). However, any other reasonably bounded error function can be applied.
MSE = 1

�� − ()�2
=1
1.14
where () refers to the ensemble model fitted with Equation 1.12.
1.5 Assessing model generalisation: bias and variance
The generalisation performance of a model relates to its prediction capability using an independent
dataset. Again, considering a dataset = ( ,), = 1, 2, … ,, with a response variable and a set of
predictors . Before proceeding to model fitting, we need to randomly select a proportion of the
original dataset to be used when evaluating the model’s predictive capability – the testing subset ( ⊂ ). That subset should be kept separately and be brought out only at the end of the data analysis.
Typical values often used when splitting the original dataset into training and testing subsets range
from 70-80% and 20-30%, respectively.
Using the notation above, assume there is a relationship between and , such that:
= () + 1.15
where is the error term, which is normally distributed with () = 0, with (… ) being the
expected value. Our objective is then to model that relationship using �(). Therefore, the expected
squared error is usually referred to as the mean square error ():
= 1

�(� − )2
=1
1.16
Where � and are the predicted and observed values of the response variable. This error can
be further decomposed as:
�
1Nt�(y�i − yi)2Nti=1 � = � 1Nt�2
Nt
i=1
� = 1Nt��2�Nti=1 = [2] = () + []2 1.17
The first term is the variance of the error, and the second term is the squared bias. So, using
this notation, the root mean square error () of a model (discarding the irreducible error) can
be written as:
= � 1

�(y�i − yi)2
=1
= �2 + 1.18
10

where and � are the observed and predicted values of the response variable, respectively, and
the number of observations in the test subset. An illustration of the bias-variance decomposition
of the model error is shown in Figure 1.3, which nicely complements the information provided in Figure
1.2.

Figure 1.3. Bullseye illustration of the bias-variance decomposition of the model error.

1.6 Summary
In this lecture, you were introduced to decision trees, a powerful machine learning algorithm relying
on recursive splitting of the training data. You also saw that these methods (often called base or weak
learners) can have their accuracy improved by combining multiple parallel (bagging) or sequential
(boosting) base learners (decision trees). The bias-variance decomposition of the root mean square
error was highlighted as a way of assessing and describing model generalisation.

11

2 Machine learning for regression: random forests and gradient
boosting
2.1 Introduction and objectives
Random forests (Breiman 2001) and gradient boosting (Friedman 2002) are also ensemble methods
relying on decision trees, and were designed to improve the accuracy of parallel (bagging) and
sequential (boosting) ensemble methods, respectively. In this lecture, we will learn the basic concepts
about random forests and gradient boosting. We will also see how to fine-tune them when addressing
regression problems.
2.2 Random forests
The random forests algorithm proposed by Breiman (2001) uses fundamentally the same approach as
bagging. In Lecture 1, we saw that, for regression problems, bagging relies on simply fitting the same
base learner many times to bootstrap samples of the training dataset and average the result. This has
proved to reduce the variance in high-variance, low-bias base learners, such as decision trees. For
regression problems, one would construct a random forests model using the following steps.

Since decision trees are notoriously noisy, they benefit greatly from the averaging process.
Moreover, since each tree generated in bagging is identically distributed (i.d.), the expectation of an
average of such trees is the same as the expectation of any one of them. This means the bias of
bagged trees is the same as that of the individual (bootstrap) trees, and the only hope of improvement
Random Forests for Regression
Considering a training dataset = ( ,), = 1, 2, … ,, with a response variable and a
set of predictors .
Step 1: For each in {1,2, … ,}:
1.1. select a bootstrap sample of size from (i.e., with replacement);
1.2. fit a regression tree to , using recursive splitting until a minimum node size is
reached, using the following approach at each node:
1.2.1. select predictors randomly from � ⊂ �;
1.2.2. select the best predictor and split value, according to some criterion (e.g.,
minimising the mean squared error);
1.2.3. split the parent node into two children nodes.
Step 2: Create an ensemble of regression trees {}, making new predictions at 0 using: 1

�(0)
=1
2.1

12

is through variance reduction. This contrasts with boosting, where the trees are grown in an adaptive
way to remove bias, and hence are not i.d.
An average of i.i.d. random variables, each with variance 2, has variance 1

2. If the variables
are simply i.d. (identically distributed, but not necessarily independent) with positive pairwise
correlation , the variance of the average is:
2 + 1 −

2 2.2
As increases, the second term disappears, but the first remains, and hence the size of the
correlation of pairs of bagged trees limits the benefits of averaging. In random forests, Breiman’s idea
was to improve the variance reduction of bagging by reducing the correlation between the trees,
without increasing the variance too much. This is achieved in the tree-growing process through
random selection of the predictors.
An important feature of random forests is the use of out-of-bag (OOB) samples: for each
observation = ( ,), construct its random forest predictor by averaging only those regression
trees corresponding to bootstrap samples in which did not appear. According to Breiman (2001),
the OOB error estimate is almost identical to that obtained by -fold cross-validation. In an ideal case,
about 36.8 % of the total training data are available as OOB sample for each regression tree.
2.3 Gradient Boosting
The Gradient Boosting algorithm proposed by Friedman (2001) is a development of the original
boosting concept of Freund and Schapire (1997), including a numerical optimisation procedure. For
regression, consider a loss function – in this case, a measure (such as sum of squared errors) that
represents the loss in predictive performance due to a suboptimal model. Gradient boosting is a
numerical optimisation technique for minimizing the loss function by adding, at each step, a new base
learner that best reduces (steps down the gradient of) the loss function. For gradient boosting using
regression trees, the first regression tree is the one that, for the selected tree size, maximally reduces
the loss function. In each following step, the focus is on the residuals: on variation in the response that
is not so far explained by the model. For example, at the second step, a tree is fitted to the residuals
of the first tree, and that second tree could contain quite different variables and split points compared
with the first. The model is then updated to contain two trees (two terms), and the residuals from
this two‐term model are calculated, and so on. The process is stagewise (not stepwise), meaning that
existing trees are left unchanged as the model is enlarged. Only the fitted value for each observation
is re‐estimated at each step to reflect the contribution of the newly added tree. The final model is a
linear combination of many trees (usually hundreds to thousands) that can be thought of as a
regression model where each term is a tree. The model‐building process performs best if it moves
slowly down the gradient, so the contribution of each tree is usually shrunk by a learning rate that is
substantially less than one. Fitted values in the final model are computed as the sum of all trees
multiplied by the learning rate and are much more stable and accurate than those from a single
13

regression tree model. The implementation of a gradient boosted regression model should use the
following steps:

Gradient boosting has several important features. First, the process is stochastic – it includes a
random or probabilistic component. The stochasticity improves predictive performance, reducing the
variance of the final model, by using only a random subset of data to fit each new tree. Second, the
sequential model‐fitting process builds on trees fitted previously, and increasingly focuses on the
hardest observations to predict. This distinguishes the process from one where a single large tree is
fitted to the data set. However, if the perfect fit was a single tree, in a boosted model it would probably
Gradient Boosting for Regression
Considering a training dataset = ( ,), = 1, 2, … ,, with a response variable and a
set of predictors . Assume a differentiable loss function �,()�.
Step 1: initialise 0()
0() = �(, )
=1
2.3
i.e., find the value of minimising the loss function. � ,()� is usually taken as ½( − )2 . The function is generally multiplied by ½ for ease of
derivation. This can be solved by gradient descent to find where the derivative of the
equation is equal to 0.
Step 2: for = 1,2, … , (for a gradient boosting model with regression trees):
i) for = 1,2, … , calculate (the residual value)
= −�� ,()�() �=−1 2.4
ii) fit a regression tree to the response giving terminal nodes (leaves)
, = 1,2, … ,
iii) for = 1,2, … , calculate
= � ( ,−1() + )
∈
2.5
i.e., the output values for each terminal node, which is equal to the mean of the
residuals within each node.
iv) update ()
() = −1() + �� ∈ �
=1
2.6
i.e., make new predictions.
Step 3: output () = ()
14

be fitted by a sum of identical shrunken versions of itself. Third, values must be provided for two
important parameters. The learning rate, also known as the shrinkage parameter, determines the
contribution of each tree to the growing model, and the tree complexity controls whether interactions
are fitted: a tree complexity of 1 (two terminal nodes) fits an additive model, a tree complexity of two
fits a model with up to two‐way interactions, and so on. These two parameters then determine the
number of trees required for optimal prediction. Finally, prediction from a gradient boosting model is
straightforward, but interpretation requires tools for identifying which variables and interactions are
important, and for visualizing fitted functions.
2.4 Summary
In this lecture, you learned about new developments to improve parallel and sequential ensemble
models relying on decision trees as base learners. The key parameters required to tune random
forests and gradient boosting models were highlighted.

15

3 Estimating forest biophysical parameters: ground and satellite
observations
3.1 Introduction and objectives
The 2015 Paris Climate Agreement acknowledged the enormous potential of forests to mitigate
climate change by absorbing the ever-increasing concentration of carbon dioxide in the atmosphere.
Forest loss and forest degradation, especially in tropical regions, make a major contribution to the
increase in atmospheric greenhouse gases, while forest regrowth acts to slow down this increase
(Mitchard 2018). Estimates of the size of these carbon sources and sinks are urgently needed but are
still highly uncertain because we have very poor knowledge about the amount of biomass stored in
tropical forests (biomass contains approximately 50% carbon) and its changes with time. Satellite
observations provide an unprecedented opportunity to significantly improve our understanding of the
spatial distribution, patterns, and dynamics of aboveground biomass - a key forest biophysical
parameter. The availability of spatially explicit information with much reduced uncertainty about the
magnitude, distribution, and dynamics of aboveground biomass change in the tropics is critical to
properly characterise the main forest dynamics operating in this biome. Further information about
the importance of accurately estimating forest aboveground biomass in combination with satellite
observations can be found in Rodriguez-Veiga et al. (2017).
In this lecture, you will learn some basic concepts about estimating forest biophysical
parameters using measurements made on the ground during field campaigns. We will focus on
estimating aboveground biomass, one of the most important forest biophysical parameters, and how
allometry can be used to estimate this parameter. Additionally, you will learn about the main satellite
observations currently used to estimate forest aboveground biomass.
3.2 Forest biophysical parameters
3.2.1 Definition of forest
According to the Food and Agriculture Organisation (FAO) of the United Nations (UN), a forest is
defined as “land spanning more than 0.5 hectares with trees higher than 5 meters and a canopy cover
of more than 10 percent”. National Forest Inventories (NFIs) are conducted by many countries to
maintain current estimates of the condition and trends of the countries' forest resources. These
inventories are often implemented as part of a monitoring system gathering ground observations,
which are typically collected from plots established using a probabilistic sampling design, satellite
observations and other data sources, such as climate and topography.
Forests can be characterised by several biophysical parameters and these are often referred to
a unit area (usually 1 hectare = 10,000 m2). Plot sizes are generally in the range 0.01 to 1 hectare in NFIs.
There is an inventory cost trade-off between spending more time on fewer, larger plots and spending
more time traveling to visit a larger number of smaller plots. Larger plots typically lead to lower
16

variance in estimates, but fewer can be collected for a given budget. Observations and measurements
on these plots vary, but always include the amount of forest cover and tree-level data such as species
identification, diameter, and height, which can be used with allometric models to predict, e.g., the
volume and aboveground biomass of individual trees. Allometric models in this context are based on
relationships between the size of a parameter of interest (e.g., tree volume) and several morphological
variables such as height and diameter.
3.2.2 Key biophysical parameters characterising a forest
The most important biophysical parameters characterising the trees in a forest are: density (D,
number of trees per unit area), average diameter (d), average height (h), basal area (G, sum of the
cross-sectional area of all tree trunks per unit area), volume (V, sum of the volume of all tree trunks
per unit area), and above-ground biomass (W, sum of the biomass of all aboveground tree
components per unit area). Assuming we are measuring trees inside a plot with m2, the following
equations can be used to estimate these forest biophysical parameters.
i) Density (; number of trees ha-1)
= 10,000

3.1
ii) Average diameter (; cm)
= 1

�

=1
3.2
where is the diameter of the ℎ tree.
iii) Average height (ℎ; m)
ℎ = 1

�ℎ

=1
3.3
where ℎ is the height of the ℎ tree.
iv) Basal area (; m2 ha-1)
= 10,000

��

24 �
=1
3.4
v) Volume (; m3 ha-1)
= 10,000

��

24 ℎ�
=1
3.5
where is the form factor of the ℎ tree.
vi) Above-ground biomass (; t ha-1)
17

= 10,000

��

24 ℎ�
=1
3.6
where is the wood density of the ℎ tree.
The diameter of a tree () is usually measured at a specific height (usually at 1.30 m - known as
breast height). The tree form factor () defines the shape of the trunk, i.e., the reduction of diameter
with height. It is defined as the ratio of the volume of the tree trunk by the cylinder of the same height
as the trunk with a sectional area equal to the sectional area of the trunk at breast height. Wood
density defines the mass of wood substance present in a unit volume of wood and can be expressed
as grams per cubic centimetre (g cm-3) or kilograms per cubic meter (kg m-3); wood density is a
characteristic (trait) of each tree species, although some variation can occur. Equation 3.6 does not
account for the volume occupied by the branches and leaves of a tree and is, therefore, a simplification
of the true volume of a tree. This simplification has implications for accurately estimating the
aboveground biomass of a tree as we will see later in this lecture.
3.2.3 Aboveground biomass
Aboveground biomass is crucial to understand both the source and sink terms in the global carbon
cycle - which is fundamentally what drives climate change by controlling the carbon dioxide loading of
the atmosphere. The source term comes from carbon emissions when aboveground biomass is lost
due to fire and land use change (predominantly tropical deforestation); the sink term arises because
growing forests extract carbon dioxide from the atmosphere and tie it up in long-lasting wood and
soil stores.
The aboveground biomass of a tree is not directly measured on the ground, but usually
estimated using tree-level allometric models, with one or more easy-to-measure predictors, such as
diameter, height, and wood density. A generic allometric model to estimate the aboveground biomass
of the ℎ tree () can take the form described in Equation 3.7. The aboveground biomass of the ℎ
tree is then proportional to the product of its cross-sectional area (
2), height (ℎ) and wood density
( ). The coefficients and need to be estimated from sampled data and characterise that
proportionality.
= �2ℎ� 3.7
These allometric equations are generated by destructive sampling, whereas a representative
sample of trees are felled, and its aboveground biomass rigorously calculated and regressed against
several predictors measured on the ground (i.e., diameter, height, and wood density). In the tropics,
an allometric model of the type described in Equation 3.7 was created, using a global database of
thousands of directly harvested trees, spanning a wide range of climatic conditions and vegetation
types (Chave et al. 2014), Equation 3.8:
18

= 0.0559�2ℎ� 3.8
where is the aboveground biomass of the ℎ tree (kg); , ℎ and are the diameter (cm),
height (m) and wood density (g cm-3), respectively, of the ℎ tree. Therefore, the sum of the
aboveground biomass of all trees measured inside a plot with m2 is given by Equation 3.9.
= 10,000

�0.0559�2ℎ�
=1
3.9
The estimates obtained with Equation 3.9 (or similar equations) are often the basis for creating
a reference dataset of aboveground biomass values over a given spatial region. We are then interested
in using these estimates in combination with some meaningful predictors to generate a model with
the lowest possible generalisation error. Machine learning methods are one of such modelling options
and observations collected by satellites provide several metrics that are often used as predictors,
allowing extrapolation of the estimates to much larger regions.
3.3 Predictors of aboveground biomass from space
Earlier, we saw that aboveground biomass is a biophysical parameter of major interest when
characterising a forest. You also learned how to calculate aboveground biomass with data collected
on the ground, using a volume-form factor approach, or supported by allometric models. However,
these estimates of aboveground biomass are often sparse, collected only over a few sites of interest.
In tropical regions, sampling to collect information about the aboveground biomass content in forests
is even scarcer, due to accessibility and budget constraints. Therefore, to extrapolate our knowledge
of the aboveground biomass distribution over larger areas (from country- to continental- and even
global-scales) we often rely on correlation with spatially explicit predictors, such as those collected
by satellite observations.
In tropical regions, there is a huge information gap arising from having minimal coverage by
ground observations, despite these forests having the highest total aboveground biomass content.
This contrasts with the extensive ground data for temperate and boreal latitudes (driven largely by
the needs of commercial forestry) (Figure 3.1).
19

Figure 3.1. The distribution of woody (forest and shrub land) area (black line) and biomass (blue line),
estimated by radar–LiDAR fusion compared to data availability from forest inventory (red histogram).
Schimel et al. (2015) and © 2014 John Wiley & Sons Ltd.
Data acquired by satellite observations cover large areas and contain information that can be
related to aboveground biomass. These sensors can be split into passive and active systems, with
passive systems using sunlight as the energy source to make measurements, whereas active sensors
use their own energy source. Some key active sensors provide useful information to estimate
aboveground biomass (and other forest biophysical parameters). Nevertheless, passive systems often
provide valuable ancillary information when retrieving aboveground biomass. Most of the times, the
best estimator of forest aboveground biomass often includes a combination of data acquired from
both active and passive systems.
3.3.1 Active systems
Synthetic Aperture Radar (SAR) is an active system operating in the microwave domain of the
electromagnetic spectrum. Microwaves are not visible to the human eye and SAR sensors therefore
provide a different, thus complementary, view of the ground compared with optical satellites. The
radar emits an electromagnetic pulse and records the part of the pulse that is reflected, or scattered,
back to the satellite (hence the term backscatter). Unlike sunlight, which is non-polarised and
comprises a large range of different wavelengths, the radar is like a laser, but operates within narrow
and well-defined wavelength bands in the microwave spectrum, with specific polarisations. As
microwave signals are several magnitudes larger than optical light, they are almost unaffected by
clouds, smoke, and haze, making SAR an important tool in areas with frequent cloud cover or haze.
SAR data are typically useful for the estimation of aboveground biomass. Pulses of microwave energy
are transmitted, and in forest land, they are reflected from the ground, canopy or trunk of woody
plants and trees. Using the strength and other attributes of the reflected pulses, the aboveground
biomass of woody vegetation, and their changes over time, can be estimated. Common present and
near-future (planned launch dates provided) spaceborne radar systems operated by space agencies
are listed below:
• P-band: 69.0 cm (BIOMASS: 2023)
• L-band: 23.5 cm (ALOS-2; SAOCOM-1; ALOS-4: 2023; NISAR-L: 2023)
20

• S-band: 9.4 cm (NovaSAR-1; NISAR-S: 2023)
• C-band: 5.6 cm (Sentinel-1; RADARSAT-2; RADARSAT Constellation Mission)
• X-band: 3.1 cm (TerraSAR-X; TanDEM-X; COSMO-SkyMed, PAZ)
LiDAR (Light Detection And Ranging) is an active remote sensing technology (the optical version
of radar) which uses pulses of laser light to measure distance and (in some cases) reflected energy.
The laser altimeter instrument emits light pulses that interact with different strata of the vegetation
and from which quantitative information on vegetation vertical structure can be estimated. As LiDAR
systems provide direct measurements of ground and vegetation height, they are highly relevant for
the estimation of emission factors. There is significant promise to use LiDAR (point) observations to
calibrate and validate estimations of forest stand height and aboveground biomass derived from SAR
(wall-to-wall) data to improve analysis feasibility and accuracy.
3.3.2 Passive systems
Passive systems, such as those relying on solar energy to make measurements (optical sensors) are
generally considered most useful for estimating, e.g., deforestation and forest degradation, than for
retrieving forest biophysical parameters such as aboveground biomass. The minimum area observed
by these sensors (i.e., spatial resolution or pixel size) influences its utility. Coarse spatial resolution
(generally considered as pixel sizes between 100 m and 1000 m) is generally regarded insufficient
resolution for estimating, e.g., deforestation events. In general, medium resolution (10 to 80 m) data
are used for monitoring in this context, and specifically Landsat data at 30 m resolution are commonly
used for mapping deforestation activity. The temporal frequency, coverage, length of the archive,
availability of processed images, and free access of data also influence the utility of data. One of the
major constraints of optical data is the lack of images in cloudy areas, with parts of the humid tropics
experiencing persistent cloudiness. Two major optical data sources with an open data policy and long-
term service plan are highlighted.
The United States Landsat programme has a long history of use by providing systematic global
acquisitions. Data acquired by these sensors are provided free of charge and with several pre-
processing steps already applied. The quantitative information provided by these sensors (i.e.,
spectral bands) are critical to forest monitoring including that acquired in the near and shortwave
infrared. Long time series of data are available for virtually any place on Earth since the early 1980s.
The Copernicus Sentinel-2 mission is part of the European Union's Earth Observation
Programme. Sentinel-2A was launched in June 2015 and Sentinel-2B in March 2017. Each satellite has
a design mission lifetime of more than 7-years and fuel for 12-years, and a free and open data policy.
Sentinel-2C and -2D are currently under development to guarantee data continuity. In the future, as
the time series increases, Sentinel-2 data are likely to become standard for monitoring forest
dynamics. It has 13 spectral bands, 4 visible and near-infrared bands at 10 m spatial resolution, 6 red-
edge/shortwave-infrared bands at 20 m spatial resolution, and 3 bands at 60 m spatial resolution.
21

3.4 Summary
In this lecture, you learned about the main biophysical parameters characterising a forest, including
the importance of accurately estimating aboveground biomass with ground observations. The role of
satellite observations to retrieve forest aboveground biomass was highlighted, distinguishing between
data acquired by active and passive systems.

22

4 Advanced topics of aboveground biomass estimation from space
4.1 Introduction and objectives
In this lecture, you will learn about the two major approaches aimed at estimating aboveground
biomass with satellite observations. This lecture will end with an introduction to the Sheffield-born
BIOMASS mission and the progress sought by this soon to be launched mission.
4.2 Main methods to estimate aboveground biomass with satellite data
The two main approaches to estimate forest aboveground biomass with satellite observations rely on
i) data-driven or ii) process-based methods.
4.2.1 Data-driven methods
In the context of aboveground biomass estimation, data-driven methods, or statistical inference, rely
on a mapping function between a representative sample of aboveground biomass estimates, often
obtained from ground measurements, and variables obtained by satellite observations. These
mapping functions can be divided into parametric and non-parametric methods.
A parametric model presumes that the form of the mapping function is known. When a modeller
chooses one family of curves and inputs the choices into the inferential process, the information that
data can supply concerning the model development is then restricted from the data under this
assumed parametric form. The modeller relies less on the data for information about the model.
Non-parametric methods allow great flexibility in the possible form of the regression model and
makes no assumption about a parametric form. It heavily relies on the modeller to supply only
qualitative information about the function and let the data speak for itself concerning the actual form
of the regression model. It is best suited for inference in a situation where there is very little or no
prior information about the regression model.
Most machine learning method that are often applied to problems dealing with estimating
aboveground biomass from satellite observations are included in the non-parametric category, such
as k-nearest neighbours, support vector machines, and tree-based methods (regression trees,
random forests and gradient boosting). These methods are driven by the sensitivity of the predictor
variables - obtained from the satellite observations - to the response variable (aboveground biomass).
The most promising satellite-based predictors are those obtained from active systems, essentially
from SAR and LiDAR sensors.
4.2.2 Process-based methods
Physically-based approaches relating satellite observations to aboveground biomass are furthermore
preferred to empirical relationships because empirical model and regression approaches are often
developed at local sites and have not been generalised to a level that can be considered of sufficient
performance in a global perspective. Model training requires a dataset of ground observations of
23

aboveground biomass, with high accuracy, with the same spatial scale as the satellite observations and
with a representative distribution within the range of aboveground biomass values in the area of
interest. While such requirements may be met locally, inventoried forests are a small fraction of
forests worldwide. It is believed that accurate estimates of model parameters can be obtained for
forests with ground data, but estimates for forests that are under-represented or not represented at
all in the training dataset may be erroneous because the model parameters are based on reference
material insufficiently descriptive of the behaviour of aboveground biomass as a function of the
satellite observations.
The Water-Cloud Model with gaps is a process-based model relying on backscattering
measurements made by SAR sensors (see section 3.3.1). This model expresses the total forest
backscatter (
0 ) as the sum of direct scattering from the ground (0 ) through gaps in the canopy,
ground scattering attenuated by the canopy and direct scattering from the vegetation (0 ):

0 = 0 − + 0 �1− −� 4.1
where is the forest volume and an empirically defined coefficient. To estimate the forest
biophysical parameter of interest (), Equation 4.1 needs to be solved for and we are then left with
three unknowns that need to be estimated: 0 , 0 and ;
0 is the quantity observed by the SAR
sensor.
= − 1

�

0 − 0

0 −
0 � 4.2
Although these can be estimated by least squares regression using a dataset of forest volume
measurements, this approach is unfeasible for large areas because it requires a dense network of
training sites. This can be overcome by a model training approach that does not rely on ground
measurements. A detailed explanation of this model training process can be found in Santoro et al.
(2011).
4.3 The BIOMASS mission
BIOMASS was selected in 2013 as the 7th European Space Agency (ESA) Earth Explorer mission, with
the aim of providing accurate estimates of the distribution of aboveground biomass in the world's
forests at a spatial scale of 200 m. The science case showed clearly that the mission had to be based
on a P-band Synthetic Aperture Radar (SAR), and its ability to measure aboveground biomass within
dense tropical forests, which, as we saw before, have the highest total forest carbon content but
minimal coverage by ground data.
BIOMASS, to be launched in 2023 with an expected 5 year mission lifetime, will deliver three
primary geophysical products every six months: maps of forest aboveground biomass density and
forest height at 200 m spatial resolution, and maps of severe forest disturbances at 50 m spatial
resolution. In addition to the primary mission objectives, the mission will provide data for new
24

scientific applications, including topographic mapping below forests, mapping ice sheets, glacier flow
and structure analysis and mapping of subsurface geological features in arid areas.
4.4 Summary
In this lecture, you were given an overview of the main methods used to estimate aboveground
biomass from satellite observations. This topic ended with an introduction to the Sheffield-born
BIOMASS mission.

25

5 References
Breiman, L. (1984). Classification and regression trees. Belmont, Calif.: Wadsworth International
Group
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140
Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32
Chave, J., Rejou-Mechain, M., Burquez, A., Chidumayo, E., Colgan, M.S., Delitti, W.B., Duque, A., Eid, T.,
Fearnside, P.M., Goodman, R.C., Henry, M., Martinez-Yrizar, A., Mugasha, W.A., Muller-Landau, H.C.,
Mencuccini, M., Nelson, B.W., Ngomanda, A., Nogueira, E.M., Ortiz-Malavassi, E., Pelissier, R., Ploton,
P., Ryan, C.M., Saldarriaga, J.G., & Vieilledent, G. (2014). Improved allometric models to estimate the
aboveground biomass of tropical trees. Global Change Biology, 20, 3177-3190
Freund, Y., & Schapire, R.E. (1997). A decision-theoretic generalization of on-line learning and an
application to boosting. Journal of Computer and System Sciences, 55, 119-139
Friedman, J.H. (2001). Greedy function approximation: A gradient boosting machine. Annals of
Statistics, 29, 1189-1232
Friedman, J.H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38,
367-378
Hastie, T., Tibshirani, R., & Friedman, J.H. (2009). The elements of statistical learning : data mining,
inference, and prediction. (2nd ed.). New York, NY: Springer
Mitchard, E.T.A. (2018). The tropical forest carbon cycle and climate change. Nature, 559, 527-534
Ren, Y., Zhang, L., & Suganthan, P.N. (2016). Ensemble Classification and Regression-Recent
Developments, Applications and Future Directions. IEEE Computational Intelligence Magazine, 11,
41-53
Rodriguez-Veiga, P., Wheeler, J., Louis, V., Tansey, K., & Balzter, H. (2017). Quantifying Forest Biomass
Carbon Stocks From Space. Current Forestry Reports, 3, 1-18
Santoro, M., Beer, C., Cartus, O., Schmullius, C., Shvidenko, A., McCallum, I., Wegmuller, U., &
Wiesmann, A. (2011). Retrieval of growing stock volume in boreal forest using hyper-temporal series
of Envisat ASAR ScanSAR backscatter measurements. Remote Sensing of Environment, 115, 490-
507
Schimel, D., Pavlick, R., Fisher, J.B., Asner, G.P., Saatchi, S., Townsend, P., Miller, C., Frankenberg, C.,
Hibbard, K., & Cox, P. (2015). Observing terrestrial ecosystems and the carbon cycle from space.
Global Change Biology, 21, 1762-1776