Create Samples in R

Creating validation samples is a crucial step in validation workflows. Certainly, sampling is not a trivial task. Many publications deal with the optimal sampling method, trying to maintain heterogeneity and avoid autcorrelations within validation samples, e.g., Köhl et al. 2006, Mu et al. 2015, Oloffson et al. 2014.

In general, there are several random sampling strategies commonly found in remote sensing studies and software solutions:

  • random: Validation samples were picked completely random. Each pixel within the study area has the same probability of being picked.
  • stratified random: The proportions of the classes in the validation samples correspond to the area proportion on the classification map.That is, the larger the area of class on the map, the more samples will be drawn from it.
  • equalized stratified random: Ensures, that each class’s sample size is exactly the same regardless of area of class on the map

We will use equalized stratified random sampling for this example. This is a complete R script you need in order to automatically generate exactly 50 samples per class within your study extent:

# import package
library(raster)
 
# import classification image (last chapter)
setwd("/media/sf_exchange/landsatdata/")
img.classified <- raster("classification_RF.tif")
 
# create 50 test samples per class
samplesperclass <- 50
smp.test <- sampleStratified(classification, size = samplesperclass, na.rm = TRUE, sp = TRUE)
# shuffle test samples
smp.test <- smp.test[sample(nrow(smp.test)), ]
# delete attributes
smp.test <- smp.test[ , -c(1, 2)]
# create standard ID attribute
smp.test$ID <- 1:nrow(smp.test)
 
# save test samples as point shapefile
shapefile(smp.test,
          filename = "validation_RFnew.shp",
          overwrite = TRUE
          )


In-depth Guide

All we need is the raster package, so make sure you’ve imported it.

#install.packages("raster")
library(raster)

Furthermore, you should have your classification map ready (Chapter Analysis), lying in your working directory. Import it with the raster() function:

setwd("/media/sf_exchange/landsatdata/")
img.classified <- raster("classification_RF.tif")

The raster provides a function called sampleStratified(), which does all the work for us:

smp.test <- sampleStratified(x = classification,
                             size = 50,
                             na.rm = TRUE,
                             sp = TRUE)

This function needs an Raster-Object as x argument and a positive integer value as size argument. Latter is the number of sample points per class. Additionally we can exclude all NA values, by setting na.rm = TRUE. NA Values can arise if your scene lies diagonally in space and points are placed in the border areas. Line 4 ensures, that the returned object is a SpatialPointDataFrame (which is easier to handle).

We can now check the class labels of our newly extracted validation points:

smp.test$classification_RF
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [176] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5
## [211] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## [246] 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## [281] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6

Note: the name of the column within smp.test depends on the file name of your classification map.

Due to the subsequent manual labeling in the next section, it makes sense to mix the samples so that the order is random. Again, we use the sample() function for this:

smp.test <- smp.test[sample(nrow(smp.test)), ]

By looking at the class labels again, we see that the order is now random:

smp.test$classification_RF
##   [1] 3 1 4 4 6 4 5 3 6 2 1 5 4 2 2 4 4 3 4 5 1 3 4 2 6 3 3 6 1 4 6 6 5 6 2
##  [36] 3 1 1 3 3 1 2 5 4 3 4 2 4 1 5 1 1 4 1 3 5 2 1 5 4 5 2 3 6 2 1 5 3 5 2
##  [71] 2 6 4 1 5 1 6 4 6 6 6 6 1 1 1 2 4 5 6 4 5 3 5 4 2 2 1 2 2 3 5 6 6 1 1
## [106] 2 1 2 1 1 3 4 6 1 6 6 5 4 6 2 1 6 6 5 1 5 2 3 6 4 2 6 6 3 4 6 4 2 6 3
## [141] 2 2 2 3 2 6 6 5 3 3 4 2 6 1 2 5 1 3 6 1 3 6 1 3 4 2 2 5 4 1 1 6 5 4 4
## [176] 4 2 5 1 6 1 5 5 6 4 5 2 2 2 5 3 2 6 2 5 3 6 4 1 5 5 3 1 3 5 4 4 4 4 2
## [211] 3 5 3 4 1 1 3 3 4 2 2 3 3 1 4 4 6 2 1 1 2 6 3 4 5 3 6 1 6 5 2 3 6 3 6
## [246] 5 1 1 4 1 2 4 3 4 5 5 3 5 2 3 2 2 6 3 6 5 5 3 2 4 3 1 3 6 5 6 5 5 5 6
## [281] 2 3 5 5 1 1 3 4 4 1 4 5 6 4 2 3 5 3 5 4

In addition, we can delete all variables in our dataframe smp.test and append a consecutive ID variable called ID, which will then be displayed to us in QGIS:

smp.test <- smp.test[, -c(1, 2)]
smp.test$ID <- 1:nrow(smp.test)

smp.test
## class       : SpatialPointsDataFrame 
## features    : 300 
## extent      : 369870, 400650, 5812410, 5827500  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=33 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
## variables   : 1
## names       :  ID 
## min values  :   1 
## max values  : 300

To visualize the distribution of our validation points, we can plot the SpatialPointDataFrame smp.test on top of our classification map in one plot:

plot(classification, 
     axes = FALSE, 
     box = FALSE,
     col = c("#fbf793", "#006601", "#bfe578", "#d00000", "#fa6700", "#6569ff")
     )

points(smp.test)

Last but not least, it is still necessary to save the SpatialPointDataFrame smp.test as a shapefile to your hard drive. This is also very easy with the shapefile() function from the raster package. Choose an appropriate filename = for the new shapefile created.

shapefile(smp.test,
          filename = "validation_RF.shp",
          overwrite = TRUE
          )