{"id":2155,"date":"2018-07-19T17:25:51","date_gmt":"2018-07-19T15:25:51","guid":{"rendered":"https:\/\/blogs.fu-berlin.de\/reseda\/?page_id=2155"},"modified":"2018-09-27T15:17:20","modified_gmt":"2018-09-27T13:17:20","slug":"learning-curve","status":"publish","type":"page","link":"https:\/\/blogs.fu-berlin.de\/reseda\/learning-curve\/","title":{"rendered":"Learning Curve"},"content":{"rendered":"<p>One question that is asked again and again is the amount of training data needed for a supervised classification. This question is not easy to answer, because every classification and regression problem is different.<br \/>\nHowever, there is an ideal analysis method to assess this problem for certain: creating a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Learning_curve\" rel=\"noopener\" target=\"_blank\">learning curve<\/a>:<br \/>\nSimply repeat your training procedure multiple times with different train sample sizes and compare the performance differences!<br \/>\nPlotting those results will (hopefully) show a decreasing error rate with increasing amounts of train data.<\/p>\n<p>In the following, a complete script has been developed, which performs the <a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/rf-classification\/\" rel=\"noopener\" target=\"_blank\">RF classification from the previous section<\/a> with a maximum of \\(\\{2, 4, 8, 16, &#8230;, 2048\\}\\) samples per class (<span class=\"crayon-inline theme:amityreseda\">steps<\/span>). In order to minimize any random effects and specify a confidence interval, all classifications were repeated 20 times each (<span class=\"crayon-inline theme:amityreseda\">nrepeat<\/span>).<\/p>\n<p>In order not to unnecessarily inflate the code, we use for loops to iterate over the individual sample sizes and iterations and write the respective results into a matrix <span class=\"crayon-inline theme:amityreseda\">r<\/span>. The following code is comprehensively documented. <\/p>\n<pre class=\"theme:amityreseda\">\r\n# import libraries\r\nlibrary(raster)\r\nlibrary(randomForest)\r\n\r\n# import image (img) and shapefile (shp)\r\nsetwd(\"\/media\/sf_exchange\/landsatdata\/\")\r\nimg &lt;- brick(&quot;LC081930232017060201T1-SC20180613160412_subset.tif&quot;)\r\nshp &lt;- shapefile(&quot;training_data.shp&quot;)\r\n\r\n# extract samples with class labels and put them all together in a dataframe\r\nnames(img) &lt;- c(&quot;b1&quot;, &quot;b2&quot;, &quot;b3&quot;, &quot;b4&quot;, &quot;b5&quot;, &quot;b6&quot;, &quot;b7&quot;)\r\nsmp &lt;- extract(img, shp, df = TRUE)\r\nsmp$cl &lt;- as.factor(shp$classes[match(smp$ID, seq(nrow(shp)))])\r\nsmp &lt;- smp[-1]\r\n\r\n# number of samples per class chosen for each run\r\nsteps &lt;- 2^(1:11)\r\n# number of repetitions of each run\r\nnrepeat = 20\r\n\r\n# create empty matrix for final results\r\nr &lt;- matrix(0, 5, length(steps))\r\n\r\nfor (i in 1:length(steps)) {\r\n  # create empty vector for OOB error from each repetition\r\n  rtmp &lt;- rep(0, nrepeat)\r\n  \r\n  for (j in 1:nrepeat) {\r\n    # shuffle all samples and subset according to size defined in &quot;steps&quot;\r\n    sub &lt;- smp[sample(nrow(smp)),]\r\n    sub &lt;- sub[ave(1:(nrow(sub)), sub$cl, FUN = seq) &lt;= steps[i], ]\r\n    \r\n    # RF classify as usual\r\n    sub.size &lt;- rep(min(summary(sub$cl)), nlevels(sub$cl))\r\n    rfmodel &lt;- tuneRF(x = sub[-ncol(sub)],\r\n                      y = sub$cl,\r\n                      sampsize = sub.size,\r\n                      strata = sub$cl,\r\n                      ntree = 250,\r\n                      doBest = TRUE,plot = FALSE\r\n                      )\r\n    # extract OOB error rate of last tree (the longest trained) and save to rtmp\r\n    ooberrors &lt;- rfmodel$err.rate[ , 1]\r\n    rtmp[j] &lt;- ooberrors[length(ooberrors)]\r\n  }\r\n  \r\n  # use repetitions to calc statistics (mean, min, max, CI) &amp; save it to final results matrix\r\n  ttest &lt;- t.test(rtmp, conf.level = 0.95)\r\n  r[ , i] &lt;- c(mean(rtmp), ttest$conf.int[2], ttest$conf.int[1], max(rtmp), min(rtmp))\r\n}\r\n<\/pre>\n<p>When the individual experiments have been processed, we can create a nice plot that summarizes the findings:<\/p>\n<pre class=\"theme:amityreseda\">\r\n# conversion in percent\r\nr &lt;- r * 100\r\n# plot empty plot without x-axis\r\nplot(x = 1:length(steps),\r\n     y = r[1,],\r\n     ylim = c(min(r), max(r)),\r\n     type = &quot;n&quot;,\r\n     xaxt = &quot;n&quot;,\r\n     xlab = &quot;number of samples per class&quot;,\r\n     ylab = &quot;OOB error [%]&quot;\r\n     )\r\n# complete the x-axis \r\naxis(1, at=1:length(steps), labels=steps)\r\n# add a grid\r\ngrid()\r\n# draw min-max range of OOB errors\r\npolygon(c(1:length(steps), rev(1:length(steps))), \r\n        c(r[5, ], rev(r[4, ])),\r\n        col = &quot;grey80&quot;, \r\n        border = FALSE\r\n        )\r\n# draw confidence interval 95%\r\npolygon(c(1:length(steps), rev(1:length(steps))), \r\n        c(r[3, ], rev(r[2, ])),\r\n        col = &quot;grey60&quot;, \r\n        border = FALSE\r\n        )\r\n# draw line of mean OOB \r\nlines(1:length(steps), r[1, ], lwd = 3)\r\n# add a legend\r\nlegend(&quot;topright&quot;,\r\n       c(&quot;mean&quot;, &quot;t-test CI 95%&quot;, &quot;min-max range&quot;),\r\n       col = c(&quot;black&quot;, &quot;grey80&quot;, &quot;grey60&quot;),\r\n       lwd = 3,\r\n       bty = &quot;n&quot;\r\n       )\r\n<\/pre>\n<p><a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_021.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/07\/cla_021.png\" alt=\"\" width=\"1253\" height=\"761\" class=\"aligncenter size-full wp-image-2164\" \/><\/a><\/p>\n<p>Conclusion for this example: From 512 samples per class the achieved improvement is negligible. So 512 samples per class should be enough to get a robust classification.<\/p>\n<p><\/br><\/br><\/p>\n<hr style=\"height:4px;background-color:#6b9e1f\">\n<a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/regression-in-r\/\"><br \/>\n<button style=\"width:100%;text-align:right;padding: 10 0;background-color:white;margin:-55px 0 0 0\"><\/p>\n<div style=\"font-family: 'Noto Sans',sans-serif;line-height: 1.2\">\n<span style=\"font-size: 12px;color:#bfbfbf\"><strong><em>NEXT<\/em><\/strong><\/span><br \/>\n<span style=\"font-size: 30px;color:#6b9e1f\"><strong><em>Regression in R<\/em><\/strong><\/span>\n<\/div>\n<p><\/button><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>One question that is asked again and again is the amount of training data needed for a supervised classification. This question is not easy to answer, because every classification and regression problem is different. However, there is an ideal analysis method to assess this problem for certain: creating a learning curve: Simply repeat your training &hellip; <a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/learning-curve\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Learning Curve&#8221;<\/span><\/a><\/p>\n","protected":false},"author":3237,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-2155","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/pages\/2155","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/users\/3237"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/comments?post=2155"}],"version-history":[{"count":16,"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/pages\/2155\/revisions"}],"predecessor-version":[{"id":2626,"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/pages\/2155\/revisions\/2626"}],"wp:attachment":[{"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/media?parent=2155"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}