Machine Learning Basics

This section includes the basics of machine learning (ML) theory as well as the commonly used terms and concepts in order to get comfortable with with the two ML algorithms we will focus on: Random Forest (RF) and Support Vector Machines (SVM).

Section in a Box

You will learn what machine learning means and get insight in two ML algorithms for image classification: Random Forest and Support Vector Machine.

What is Machine Learning?
– basic introduction and definintion
– commonly used terms when dealing with ML
Random Forest
– Random Forest method explained
– concept of Decision Trees (CART) as part of Random Forests
Support Vector Machine
– concept of a Support Vector Machine

What is machine learning?

We will have a look at two supervised, non-parametric and nonlinear algorithms in the next two sections, the Random Forest and Support Vector Machine. These terms do not tell you anything? Then read the following short intro, which explains the most important terminology.

Machine Learning (ML) is actually a lot of things – it is a generic term for the articificial generation of knowledge or artificial intelligence. A artifically learning system learns from examples and can generalize those after the learning phase is completed. Examples are not simply memorized, but the underlying pattern is recognized. This becomes handy, when new, unknown, data should be handled (learning transfer).
ML methods optimize their performances iteratively by learning from the training examples or train data. Those methods can be descriptive by distinguishing between several classes of a phenomenon (i.e., classification) or predictive by making predictions of a phenomenon (i.e., regression). Ultimately, ML methods create a function, which tries to map input data \(x\) (e.g., Landsat 8 band reflectances) to an desired output response \(y\) (e.g., class labels, such as urban):

\begin{equation} \label{poly}\tag{1}
\text{Find }f : x \mapsto y = f(x).

In the field of remote sensing, descriptive ML algorithms are often used for land cover classifications. The range of applications for this is huge, ranging from land use / land cover (LULC) changes of all kind, e.g., urban sprawl, burned area detection, flood area forecast or land degradation. In order to perform a classification, the algorithm must learn to differentiate between the various types of patterns based on training samples. After the model has learned, it can then be tested for performance using independent testing samples. There are two main ways an algorithm can train: in a unsupervised or a supervised learning manner:

Unsupervised vs. Supervised Algorithms

The difference between unsupervised and supervised algorithms is based on whether they include a priori knowledge during the training phase by using labeled training samples or not.

Unsupervised algorithms have only the training samples \(x\) available, e.g., reflectance values of all 9 bands of Landsat 8. Thus, the class label information, e.g., urban, or forest, is missing. The objective of the algorithm is to describe how the data are organized or clustered – it has to find patterns and relationships by itself.

Supervised algorithms make use of a set of labeled training samples, including the samples \(x\) and the appropriate class labels \(y\). The objective here is to predict the value \(y\) corresponding to a new, unknown, sample \(x\). In other words, we teach the algorithm what the individual classes look like in the feature space, e.g., Landsat 8 bands.

There are also semisupervised methods, where the two beforementioned approaches are combined. Waske at al. 2009 provided a more complete overview of different classifier categories and examples.

Linear vs. Nonlinear Algorithms

This differentiation is straightforward:

Linear algorithms assume, that the sample features \(x\) and the label output \(y\) are linearly related and there is an affine function \(f(x) = \langle w, x \rangle + b\) describing the underlying relationship.

Nonlinear algorithms assumes a nonlinear relationship between \(x\) and \(y\). Thus, \(f(x)\) can by a function of arbitrary complexity.

Parametric vs. Non-Parametric Algorithms

Some ML methods requires the data to follow a specific distribution in its feature space, for example the form of a multivariate normal Gaussian model. A method is called parametric, when those assumptions are made. The Maximimum Likelihood Classifier is such a parametric algorithm.

Non-parametric approaches are not constrained to prior assumptions on the data distribution. Such methods allow application in versatile tasks and data types, such as e.g., RADAR data.

Overfitting vs. Underfitting

In statistics, the term fit refers to how well you approximate the function the ML algorithm uses to map input \(x\) to output \(y\).

Overfitting appears, when a model learns to map the training data too well, which negatively impacts the performance of the model on new, unknown, data. Thus, the model lost its ability to generalize.

A trained ML model is underfitting, if it neither can model the train nor the test data correctly. Underfitting is easy to detect in the training phase by using given performance metrics. If the problem can not be solved, another classifier should be considered.