# Learning Curve

One question that is asked again and again is the amount of training data needed for a supervised classification. This question is not easy to answer, because every classification and regression problem is different.
However, there is an ideal analysis method to assess this problem for certain: creating a learning curve:
Simply repeat your training procedure multiple times with different train sample sizes and compare the performance differences!
Plotting those results will (hopefully) show a decreasing error rate with increasing amounts of train data.

In the following, a complete script has been developed, which performs the RF classification from the previous section with a maximum of $$\{2, 4, 8, 16, …, 2048\}$$ samples per class (steps). In order to minimize any random effects and specify a confidence interval, all classifications were repeated 20 times each (nrepeat).

In order not to unnecessarily inflate the code, we use for loops to iterate over the individual sample sizes and iterations and write the respective results into a matrix r. The following code is comprehensively documented.

# import libraries
library(raster)
library(randomForest)

# import image (img) and shapefile (shp)
setwd("/media/sf_exchange/landsatdata/")
img <- brick("LC081930232017060201T1-SC20180613160412_subset.tif")
shp <- shapefile("training_data.shp")

# extract samples with class labels and put them all together in a dataframe
names(img) <- c("b1", "b2", "b3", "b4", "b5", "b6", "b7")
smp <- extract(img, shp, df = TRUE)
smp$cl <- as.factor(shp$classes[match(smp$ID, seq(nrow(shp)))]) smp <- smp[-1] # number of samples per class chosen for each run steps <- 2^(1:11) # number of repetitions of each run nrepeat = 20 # create empty matrix for final results r <- matrix(0, 5, length(steps)) for (i in 1:length(steps)) { # create empty vector for OOB error from each repetition rtmp <- rep(0, nrepeat) for (j in 1:nrepeat) { # shuffle all samples and subset according to size defined in "steps" sub <- smp[sample(nrow(smp)),] sub <- sub[ave(1:(nrow(sub)), sub$cl, FUN = seq) <= steps[i], ]

# RF classify as usual
sub.size <- rep(min(summary(sub$cl)), nlevels(sub$cl))
rfmodel <- tuneRF(x = sub[-ncol(sub)],
y = sub$cl, sampsize = sub.size, strata = sub$cl,
ntree = 250,
doBest = TRUE,plot = FALSE
)
# extract OOB error rate of last tree (the longest trained) and save to rtmp
ooberrors <- rfmodel$err.rate[ , 1] rtmp[j] <- ooberrors[length(ooberrors)] } # use repetitions to calc statistics (mean, min, max, CI) & save it to final results matrix ttest <- t.test(rtmp, conf.level = 0.95) r[ , i] <- c(mean(rtmp), ttest$conf.int, ttest\$conf.int, max(rtmp), min(rtmp))
}


When the individual experiments have been processed, we can create a nice plot that summarizes the findings:

# conversion in percent
r <- r * 100
# plot empty plot without x-axis
plot(x = 1:length(steps),
y = r[1,],
ylim = c(min(r), max(r)),
type = "n",
xaxt = "n",
xlab = "number of samples per class",
ylab = "OOB error [%]"
)
# complete the x-axis
axis(1, at=1:length(steps), labels=steps)
grid()
# draw min-max range of OOB errors
polygon(c(1:length(steps), rev(1:length(steps))),
c(r[5, ], rev(r[4, ])),
col = "grey80",
border = FALSE
)
# draw confidence interval 95%
polygon(c(1:length(steps), rev(1:length(steps))),
c(r[3, ], rev(r[2, ])),
col = "grey60",
border = FALSE
)
# draw line of mean OOB
lines(1:length(steps), r[1, ], lwd = 3) 