{"id":1966,"date":"2018-07-09T15:23:31","date_gmt":"2018-07-09T13:23:31","guid":{"rendered":"https:\/\/blogs.fu-berlin.de\/reseda\/?page_id=1966"},"modified":"2018-09-25T19:33:09","modified_gmt":"2018-09-25T17:33:09","slug":"rf-classification","status":"publish","type":"page","link":"https:\/\/blogs.fu-berlin.de\/reseda\/rf-classification\/","title":{"rendered":"RF Classification"},"content":{"rendered":"<p>There are some packages available containing the possibility to perform the standard RF algorthm described by <a href=\"https:\/\/www.stat.berkeley.edu\/~breiman\/randomforest2001.pdf\" rel=\"noopener\" target=\"_blank\">Breiman (2001)<\/a>, e.g., in the <a href=\"https:\/\/cran.r-project.org\/web\/packages\/caret\/caret.pdf\" rel=\"noopener\" target=\"_blank\">caret<\/a>, <a href=\"https:\/\/cran.r-project.org\/web\/packages\/randomForest\/randomForest.pdf\" rel=\"noopener\" target=\"_blank\">randomForest<\/a>, <a href=\"https:\/\/github.com\/imbs-hl\/ranger\" rel=\"noopener\" target=\"_blank\">ranger<\/a>, <a href=\"https:\/\/cran.r-project.org\/web\/packages\/xgboost\/xgboost.pdf\" rel=\"noopener\" target=\"_blank\">xgboost<\/a>, or <a href=\"https:\/\/cran.r-project.org\/web\/packages\/randomForestSRC\/randomForestSRC.pdf\" rel=\"noopener\" target=\"_blank\">randomForestSRC <\/a>packages. However, we will use the package called &#8220;randomForest&#8221; because it is the most common and therefore best supported.<\/p>\n<p>Below you can see a complete code implementation. While this is already executable with your input data, you should read the following comprehensive in-depth guide to understand the code in detail. Even better: You will learn how to generate numerous useful plots, which do great in each thesis!<\/p>\n<pre class=\"theme:amityreseda\">\r\n# import packages\r\nlibrary(raster)\r\nlibrary(randomForest)\r\n \r\n# import image (img) and shapefile (shp)\r\nsetwd(\"\/media\/sf_exchange\/landsatdata\/\")\r\nimg &lt;- brick(&quot;LC081930232017060201T1-SC20180613160412_subset.tif&quot;)\r\nshp &lt;- shapefile(&quot;training_data.shp&quot;)\r\n \r\n# extract samples with class labels and put them all together in a dataframe\r\nnames(img) &lt;- c(&quot;b1&quot;, &quot;b2&quot;, &quot;b3&quot;, &quot;b4&quot;, &quot;b5&quot;, &quot;b6&quot;, &quot;b7&quot;)\r\nsmp &lt;- extract(img, shp, df = TRUE)\r\nsmp$cl &lt;- as.factor( shp$classes[ match(smp$ID, seq(nrow(shp)) ) ] )\r\nsmp &lt;- smp[-1]\r\n\r\n# tune and train rf model\r\nsmp.size &lt;- rep(min(summary(smp$cl)), nlevels(smp$cl))\r\nrfmodel &lt;- tuneRF(x = smp[-ncol(smp)],\r\n                  y = smp$cl,\r\n                  sampsize = smp.size,\r\n                  strata = smp$cl,\r\n                  ntree = 250,\r\n                  importance = TRUE,\r\n                  doBest = TRUE\r\n                  )\r\n\r\n# save rf model \r\nsave(rfmodel, file = \"rfmodel.RData\")\r\n\r\n# predict image data with rf model\r\nresult &lt;- predict(img,\r\n                  rfmodel,\r\n                  filename = &quot;classification_RF.tif&quot;,\r\n                  overwrite = TRUE\r\n                  )\r\n<\/pre>\n<p><\/br><a name=\"1\"><\/a><\/p>\n<h1>In-depth Guide<\/h1>\n<p>In order to be able to use the functions of the randomForest package, we must additionally load the library into the current session via <span class=\"crayon-inline theme:amityreseda\">library()<\/span>. If you do not use our VM, you must first download and install the packages with <span class=\"crayon-inline theme:amityreseda\">install.packages()<\/span>:<\/p>\n<pre class=\"theme:amityreseda\">\r\n#install.packages(\"raster\")\r\n#install.packages(\"randomForest\")\r\nlibrary(raster)\r\nlibrary(randomForest)\r\n<\/pre>\n<p>First, it is necessary to process the training samples in the form of a data frame. The necessary steps are described in detail in the <a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/prepare-samples\/\">previous section<\/a>.<\/p>\n<pre class=\"theme:amityreseda\">\r\nnames(img) &lt;- c(&quot;b1&quot;, &quot;b2&quot;, &quot;b3&quot;, &quot;b4&quot;, &quot;b5&quot;, &quot;b6&quot;, &quot;b7&quot;)\r\nsmp &lt;- extract(img, shp, df = TRUE)\r\nsmp$cl &lt;- as.factor( shp$classes[ match(smp$ID, seq(nrow(shp)) ) ] )\r\nsmp &lt;- smp[-1]\r\n<\/pre>\n<p>After that, you can identify the number of available training samples per class with <span class=\"crayon-inline theme:amityreseda\">summary()<\/span>. There is often an imbalance in the number of those training pixels, i.e., one class is represented by a large number of pixels, while another class has very few samples:<\/p>\n<pre class=\"theme:amityreseda\">\r\nsummary(smp$cl)\r\n##  baresoil    forest grassland  urban_hd  urban_ld     water \r\n##       719      2074      1226      1284       969       763\r\n<\/pre>\n<p>This often leads to the problem that classifiers favor and overclass strongly-represented classes in the classification. However, the Random Forest Algorithm, as an ensemble classifier, provides an ideal solution to compensate for this imbalance. For each decision tree, we draw a bootstrap sample from the minority class (class with the fewest samples). Then, we randomly draw the same number of cases, with replacement, from all other classes. This technique is called down-sampling.<br \/>\nIn our example this is the class <em>baresoil<\/em> with 719 samples. With the <span class=\"crayon-inline theme:amityreseda\">rep()<\/span> function we form a vector where the length corresponds to the number of target classes. We will use this vector to tell the classifier how many samples it should randomly draw per class for each decision tree during training:<\/p>\n<pre class=\"theme:amityreseda\">\r\nsmp.size &lt;- rep(min(summary(smp$cl)), nlevels(smp$cl))\r\nsmp.size\r\n## [1] 719 719 719 719 719 719\r\n<\/pre>\n<p>The complete training takes place via just one function call of <span class=\"crayon-inline theme:amityreseda\">tuneRF()<\/span>! This function automatically searches for the best parameter setting for <em>mtry<\/em> &#8211; the number of variables available for each tree node. So we just have to worry about <span class=\"crayon-inline theme:amityreseda\">ntree<\/span>, i.e., the number of trees to grow. 250-1000 trees are usually sufficient. Basically, the more the better, but many trees will increase the calculation time. When <span class=\"crayon-inline theme:amityreseda\">tuneRF()<\/span> is called, we need to specify the training samples as <span class=\"crayon-inline theme:amityreseda\">x<\/span>, i.e., all columns of our <span class=\"crayon-inline theme:amityreseda\">smp<\/span> dataframe except the last one, and the corresponding class labels as <span class=\"crayon-inline theme:amityreseda\">y<\/span>, i.e. the last column of our <span class=\"crayon-inline theme:amityreseda\">smp<\/span> dataframe called &#8220;cl&#8221;:<\/p>\n<pre class=\"theme:amityreseda\">\r\nrfmodel &lt;- tuneRF(x = smp[-ncol(smp)],\r\n                  y = smp$cl,\r\n                  sampsize = smp.size,\r\n                  strata = smp$cl,\r\n                  ntree = 250,\r\n                  importance = TRUE,\r\n                  doBest = TRUE\r\n                  )\r\n## mtry = 2  OOB error = 2.5% \r\n## Searching left ...\r\n## mtry = 1     OOB error = 2.54% \r\n## -0.01704545 0.05 \r\n## Searching right ...\r\n## mtry = 4     OOB error = 2.7% \r\n## -0.07954545 0.05                                       \r\n<\/pre>\n<p>In line 3, we pass our <span class=\"crayon-inline theme:amityreseda\">smp.size<\/span> vector as argument the <span class=\"crayon-inline theme:amityreseda\">sampsize =<\/span> to define how many samples it should draw per class, and <span class=\"crayon-inline theme:amityreseda\">strata =<\/span> at line 4 defines the column which should use for this stratified sampling. The argument <span class=\"crayon-inline theme:amityreseda\">importance = <\/span> in line 6 allows the subsequent assessment of the variable importance when set to <span class=\"crayon-inline theme:amityreseda\">TRUE<\/span>. By setting the argument <span class=\"crayon-inline theme:amityreseda\">doBest<\/span> to <span class=\"crayon-inline theme:amityreseda\">TRUE<\/span>, the RF with the optimal <em>mtry <\/em> is output directly from the function.<\/p>\n<p><a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_006.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_006.png\" alt=\"\" width=\"1253\" height=\"776\" class=\"aligncenter size-full wp-image-2112\" srcset=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_006.png 1253w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_006-300x186.png 300w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_006-768x476.png 768w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_006-1024x634.png 1024w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_006-1200x743.png 1200w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/a><\/p>\n<p>If you use <span class=\"crayon-inline theme:amityreseda\">tuneRF<\/span>, you will automatically get a plot that will tell you the OOB errors in the dependency of different <em>mtry <\/em>settings. As mentioned, the feature automatically identifies the best <em>mtry<\/em> setting and uses this to generate the optimal RF.<\/p>\n<p>When the model is created, we get some really useful information by executing the object name. First we get the command call with which we trained the model and the final number of variables tried at each split, i.e. the <em>mtry<\/em> parameter. Furthermore, we get an averaged out of bag (OOB) estimate, as well as a complete confusion matrix based on the training data! The column headers contain the classes of training pixels and the rows describe the corresponding classification. For a more detailed description of a confusion matrix, please refer to the chapter <a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/validate-classifiers\/\">Validate Classifiers<\/a>.<\/p>\n<pre class=\"theme:amityreseda\">\r\nrfmodel\r\n## \r\n## Call:\r\n##  randomForest(x = smp[-ncol(smp)], y = smp$cl, ntree = 300, mtry = mtry.opt,      \r\n##               strata = smp$cl, sampsize = smpsize, importance = TRUE) \r\n##                Type of random forest: classification\r\n##                      Number of trees: 300\r\n## No. of variables tried at each split: 2\r\n## \r\n##         OOB estimate of  error rate: 2.52%\r\n## Confusion matrix:\r\n##           baresoil forest grassland urban_hd urban_ld water class.error\r\n## baresoil       708      0         2        3        6     0 0.015299026\r\n## forest           0   2060         2        0       11     1 0.006750241\r\n## grassland        4      0      1220        0        2     0 0.004893964\r\n## urban_hd        18      0         0     1204       62     0 0.062305296\r\n## urban_ld         7     10         9       39      904     0 0.067079463\r\n## water            0      0         0        1        0   762 0.001310616\r\n<\/pre>\n<p>Before we started the training procedure, we set the argument <span class=\"crayon-inline theme:amityreseda\">importance = true<\/span>, which now allow us to have a look at the importance variable using the <span class=\"crayon-inline theme:amityreseda\">varImpPlot<\/span> command:<\/p>\n<pre class=\"theme:amityreseda\">\r\nvarImpPlot(rfmodel)\r\n<\/pre>\n<p><a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_002.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_002.png\" alt=\"\" width=\"1268\" height=\"692\" class=\"aligncenter size-full wp-image-2113\" srcset=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_002.png 1268w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_002-300x164.png 300w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_002-768x419.png 768w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_002-1024x559.png 1024w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_002-1200x655.png 1200w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/a><\/p>\n<p>The <a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/random-forest\/#VI\" rel=\"noopener\" target=\"_blank\">variable importance <\/a>shows the most significant or important features for the classification, which are band 5 and 6 in this case. For details of those metrics, please refer to the<a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/random-forest\/#5\" rel=\"noopener\" target=\"_blank\"> Random Forest Section<\/a>. However, it is interesting to note that the most important bands also provide the largest spectral differences between the classes <a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/prepare-samples\/#plot\" rel=\"noopener\" target=\"_blank\">in the previous section<\/a>.<\/p>\n<p>Additionally, you can plot the RF model itself, which shows you the relationship between <a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/random-forest\/#oob\" rel=\"noopener\" target=\"_blank\">OOB Error<\/a> and the number of trees used. We can color the lines by passing a vector of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Web_colors\" rel=\"noopener\" target=\"_blank\">hex-colors<\/a> whose length equals the number of classes + 1 &#8211; the first color is the average OOB line, which is also plotted automatically:<\/p>\n<pre class=\"theme:amityreseda\">\r\nplot(rfmodel, col = c(\"#000000\", \"#fbf793\", \"#006601\", \"#bfe578\", \"#d00000\", \"#fa6700\", \"#6569ff\"))\r\n<\/pre>\n<p><a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_003.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_003.png\" alt=\"\" width=\"1253\" height=\"755\" class=\"aligncenter size-full wp-image-2119\" srcset=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_003.png 1253w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_003-300x181.png 300w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_003-768x463.png 768w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_003-1024x617.png 1024w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_003-1200x723.png 1200w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/a><\/p>\n<p>We can see a decrease in the error with increasing number of trees, where the urban classes have the highest OOB error values.<\/p>\n<p>Save the model by using the <span class=\"crayon-inline theme:amityreseda\">save()<\/span> function. This function saves the model object <span class=\"crayon-inline theme:amityreseda\">rfmodel<\/span> to your working directory, so that you have it permanently stored on your hard drive. If needed, you can load it any time with <span class=\"crayon-inline theme:amityreseda\">load()<\/span>.<\/p>\n<pre class=\"theme:amityreseda\">\r\nsave(rfmodel, file = \"rfmodel.RData\")\r\n#load(\"rfmodel.RData\")\r\n<\/pre>\n<p>Since your model is now fully trained, you can use it to predict all the pixels in your image. The command method <span class=\"crayon-inline theme:amityreseda\">predict()<\/span> takes a lot of work from you: It is recognized that there is an image which then passes through your Random Forest pixel by pixel. As with the training pixels, each image pixel is now individually classified and finally reassembled into your final classification image. Use the argument <span class=\"crayon-inline theme:amityreseda\">filename = <\/span> to specify the name of your output map:<\/p>\n<pre class=\"theme:amityreseda\">\r\nresult &lt;- predict(img,\r\n                  rfmodel,\r\n                  filename = &quot;classification.tif&quot;,\r\n                  overwrite = TRUE\r\n                  )\r\n<\/pre>\n<p>Random Forest classification completed!<br \/>\nIf you store the result of the <span class=\"crayon-inline theme:amityreseda\">predict()<\/span> function in an object, e.g., <span class=\"crayon-inline theme:amityreseda\">result<\/span>, you can plot the map using the standard plot command and passing this object:<\/p>\n<pre class=\"theme:amityreseda\">\r\nplot(result, \r\n     axes = FALSE, \r\n     box = FALSE,\r\n     col = c(&quot;#fbf793&quot;, # baresoil\r\n             &quot;#006601&quot;, # forest\r\n             &quot;#bfe578&quot;, # grassland\r\n             &quot;#d00000&quot;, # urban_hd\r\n             &quot;#fa6700&quot;, # urban_ld\r\n             &quot;#6569ff&quot;  # water\r\n             )\r\n     )\r\n<\/pre>\n<p><a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_004-1.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_004-1.png\" alt=\"\" width=\"1096\" height=\"471\" class=\"aligncenter size-full wp-image-2114\" srcset=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_004-1.png 1096w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_004-1-300x129.png 300w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_004-1-768x330.png 768w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_004-1-1024x440.png 1024w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/a><\/p>\n<p><\/br><\/br><\/p>\n<hr style=\"height:4px;background-color:#6b9e1f\">\n<a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/svm-classification\/\"><br \/>\n<button style=\"width:100%;text-align:right;padding: 10 0;background-color:white;margin:-55px 0 0 0\"><\/p>\n<div style=\"font-family: 'Noto Sans',sans-serif;line-height: 1.2\">\n<span style=\"font-size: 12px;color:#bfbfbf\"><strong><em>NEXT<\/em><\/strong><\/span><br \/>\n<span style=\"font-size: 30px;color:#6b9e1f\"><strong><em>SVM Classification<\/em><\/strong><\/span>\n<\/div>\n<p><\/button><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>There are some packages available containing the possibility to perform the standard RF algorthm described by Breiman (2001), e.g., in the caret, randomForest, ranger, xgboost, or randomForestSRC packages. However, we will use the package called &#8220;randomForest&#8221; because it is the most common and therefore best supported. Below you can see a complete code implementation. While &hellip; <a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/rf-classification\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;RF Classification&#8221;<\/span><\/a><\/p>\n","protected":false},"author":3237,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1966","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/pages\/1966","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/users\/3237"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/comments?post=1966"}],"version-history":[{"count":41,"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/pages\/1966\/revisions"}],"predecessor-version":[{"id":2865,"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/pages\/1966\/revisions\/2865"}],"wp:attachment":[{"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/media?parent=1966"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}