{"id":2614,"date":"2018-09-27T15:03:58","date_gmt":"2018-09-27T13:03:58","guid":{"rendered":"https:\/\/blogs.fu-berlin.de\/reseda\/?page_id=2614"},"modified":"2018-09-30T16:16:37","modified_gmt":"2018-09-30T14:16:37","slug":"prepare-samples-for-regression","status":"publish","type":"page","link":"https:\/\/blogs.fu-berlin.de\/reseda\/prepare-samples-for-regression\/","title":{"rendered":"Prepare Samples for Regression"},"content":{"rendered":"<p>We want to create training samples in the form of a data frame \u2013 which most regressors in R can handle easily. For this we will have to use a few GIS operations. Fortunately, these can also be done automatically in R, so that the entire workflow do not have to be performed manually again and again for other datasets. Finally, we will utilize the <a href=\"https:\/\/cran.r-project.org\/web\/packages\/raster\/raster.pdf\" rel=\"noopener\" target=\"_blank\">raster <\/a>and the <a href=\"https:\/\/cran.rstudio.com\/web\/packages\/rgeos\/rgeos.pdf\" rel=\"noopener\" target=\"_blank\">rgeos <\/a>packages for this. <\/p>\n<p>The complete code is shown below. The shapefile <em>reg_train_data.shp<\/em> and the imagery <em>LC081930232017060201T1-SC20180613160412_subset.tif<\/em> can be downloaded <a href=\"##\" rel=\"noopener\" target=\"_blank\">here<\/a>.<br \/>\nHowever, we strongly advise you to have a look at the detailed description in the following.<\/p>\n<pre class=\"theme:amityreseda\">\r\nlibrary(raster)\r\nlibrary(rgeos)\r\n\r\n# import low resolution image and digitized polygon training areas\r\nsetwd(\"\/media\/sf_exchange\/regression\/\")\r\nimg &lt;- brick(&quot;LC081930232017060201T1-SC20180613160412_subset.tif&quot;)\r\nshp &lt;- shapefile(&quot;reg_train_data.shp&quot;)\r\n\r\n# rename bands if needed\r\nnames(img) &lt;- c(&quot;b1&quot;, &quot;b2&quot;, &quot;b3&quot;, &quot;b4&quot;, &quot;b5&quot;, &quot;b6&quot;, &quot;b7&quot;)\r\n# create buffer with no width to avoid topology problems\r\nshp &lt;- gBuffer(shp, byid=TRUE, width=0)\r\n\r\n# crop image to extent of training polygons\r\nimg.subset &lt;- crop(img, shp)\r\n# extract pixel completely covered by training polygons\r\nimg.mask &lt;- rasterize(shp, img.subset, getCover = TRUE)\r\nimg.mask[img.mask &lt; 100] &lt;- NA\r\nimg.subset &lt;- mask(img.subset, img.mask)\r\n# transform those pixels to polygons\r\ngrid &lt;- rasterToPolygons(img.subset)\r\n# create a new ID attribute for addressing\r\ngrid$ID &lt;- seq.int(nrow(grid))\r\n\r\n# create new dataframe for final samples\r\nsmp &lt;- data.frame()\r\n# do for each cell in grid:\r\nfor (i in 1:length(grid)) {\r\n  # intersect grid and shapefile for cell i\r\n  cell &lt;- intersect(grid[i, ], shp)\r\n  # extract all target class polygons\r\n  cell &lt;- cell[cell$Class_name == &quot;building&quot; | cell$Class_name == &quot;impervious&quot;, ]\r\n  # if target class is in cell: calculate area, otherwise area is zero\r\n  if (length(cell) &gt; 0) {\r\n    areaPercent &lt;- sum(area(cell) \/ area(grid)[1])\r\n  } else {\r\n    areaPercent &lt;- 0\r\n  }\r\n  # create new sample and append to training samples\r\n  newsmp &lt;- cbind(grid@data[i, 1:nlayers(img)], areaPercent)\r\n  smp &lt;- rbind(smp, newsmp)\r\n}\r\n<\/pre>\n<p><\/br><a name=\"1\"><\/a><\/p>\n<h1>In-depth Guide<\/h1>\n<p>In order to use functionalities of the raster and rgeos packages, load them into your current session via <span class=\"crayon-inline theme:amityreseda\">library()<\/span>. If you do not use our VM, you must first download and install the packages with <span class=\"crayon-inline theme:amityreseda\">install.packages()<\/span>:<\/p>\n<pre class=\"theme:amityreseda\">\r\n#install.packages(\"raster\")\r\n#install.packages(\"rgeos\")\r\nlibrary(raster)\r\nlibrary(rgeos)\r\n<\/pre>\n<p>Next: set your working directory, in which all your image and shapefile data is stored by giving a character (do not forget the quotation marks <span class=\"crayon-inline theme:amityreseda\">&#8221; &#8220;<\/span>) variable to <span class=\"crayon-inline theme:amityreseda\">setwd()<\/span>. Check your path with <span class=\"crayon-inline theme:amityreseda\">getwd()<\/span> and the stored files in it via <span class=\"crayon-inline theme:amityreseda\">dir()<\/span>:<\/p>\n<pre class=\"theme:amityreseda\">\r\nsetwd(\"\/media\/sf_exchange\/regression\/\")\r\n \r\ngetwd()\r\n## [1] \"\/media\/sf_exchange\/regression\"\r\n\r\ndir()  \r\n##  [1] \"LC081930232017060201T1-SC20180613160412_subset.tif\"                                                \r\n##  [2] \"reg_train_data.dbf\"     \r\n##  [3] \"reg_train_data.prj\"\r\n##  [4] \"reg_train_data.qpj\"\r\n##  [5] \"reg_train_data.shp\" \r\n##  [6] \"reg_train_data.shx\"   \r\n<\/pre>\n<p>If you do not get your files listed, there is an error in your working path &#8211; check again! Everything ready to go? Fine, then import your raster file as <span class=\"crayon-inline theme:amityreseda\">img<\/span> and your shapefile as <span class=\"crayon-inline theme:amityreseda\">shp<\/span> and have a look at them:<\/p>\n<pre class=\"theme:amityreseda\">\r\nimg &lt;- brick(&quot;LC081930232017060201T1-SC20180613160412_subset.tif&quot;)\r\nimg\r\n## class       : RasterBrick \r\n## dimensions  : 504, 1030, 519120, 7  (nrow, ncol, ncell, nlayers)\r\n## resolution  : 30, 30  (x, y)\r\n## extent      : 369795, 400695, 5812395, 5827515  (xmin, xmax, ymin, ymax)\r\n## coord. ref. : +proj=utm +zone=33 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 \r\n## data source : \/media\/sf_exchange\/regression\/LC081930232017060201T1-SC20180613160412_subset.tif \r\n## names       : LC0819302\/\/2_subset.1, LC0819302\/\/2_subset.2, LC0819302\/\/2_subset.3, LC0819302\/\/2_subset.4, LC0819302\/\/2_subset.5, LC0819302\/\/2_subset.6, LC0819302\/\/2_subset.7 \r\n## min values  :                 -2000,                 -2000,                 -1502,                 -1384,                  -107,                   -72,                   -27 \r\n## max values  :                 13240,                 13539,                 14574,                 14549,                 14243,                 14896,                 15388\r\n\r\nshp &lt;- shapefile(&quot;reg_train_data.shp&quot;)\r\nshp\r\n## class       : SpatialPolygonsDataFrame \r\n## features    : 286 \r\n## extent      : 390233.4, 390698.4, 5817147, 5817720  (xmin, xmax, ymin, ymax)\r\n## coord. ref. : +proj=utm +zone=33 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 \r\n## variables   : 1\r\n## names       : Class_name \r\n## min values  :  bare_soil \r\n## max values  :      water\r\n\r\ncompareCRS(shp, img)\r\n## [1] TRUE\r\n<\/pre>\n<p>Both the <span class=\"crayon-inline theme:amityreseda\">brick()<\/span> and <span class=\"crayon-inline theme:amityreseda\">shapefile()<\/span> functions are provided by the raster package. As shown above, they create objects of the class RasterBrick and SpatialPolygonsDataFrame respectively. The L8 raster provides 7 bands, and our example shapefile 286 features, i.e., polygons. You can check whether the projections of the two datasets are identical or not by executing <span class=\"crayon-inline theme:amityreseda\">compareCRS(shp, img)<\/span>. If this is not the case (output equals <em>FALSE<\/em>), the raster package will automatically re-project your data on the fly later on. However, we recommend to adjust the projections manually in advance to prevent any future inaccuracies.<br \/>\nPlot your data to make sure everything is imported properly (check <a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/visualize-in-r\/\">Visualize in R<\/a> for an intro to plotting):<\/p>\n<pre class=\"theme:amityreseda\">\r\nplot(shp)\r\n<\/pre>\n<p><a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_009.png\"><img decoding=\"async\" src=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_009.png\" alt=\"\" width=\"250\" class=\"aligncenter size-full wp-image-2657\" srcset=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_009.png 768w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_009-240x300.png 240w\" sizes=\"(max-width: 768px) 85vw, 768px\" \/><\/a><\/p>\n<p>With the argument <span class=\"crayon-inline theme:amityreseda\">add = TRUE<\/span> in line 2 several data layers can be displayed one above the other:<\/p>\n<pre class=\"theme:amityreseda\">\r\nplot(img[[4]], col=gray.colors(100))\r\nplot(shp, add=TRUE)\r\n<\/pre>\n<p><a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_003.png\"><img decoding=\"async\" src=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_003.png\" alt=\"\" width=\"600\" class=\"aligncenter size-full wp-image-2639\" srcset=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_003.png 1237w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_003-300x164.png 300w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_003-768x421.png 768w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_003-1024x561.png 1024w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_003-1200x658.png 1200w\" sizes=\"(max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/a><\/p>\n<p>In Line 1, initially only the fourth channel of the image <span class=\"crayon-inline theme:amityreseda\">img[[4]]<\/span> was displayed with a gray color representation using 100 shades <span class=\"crayon-inline theme:amityreseda\">col=gray.colors(100)<\/span>. The black spots in the upper plot represent the polygons of the shapefile. It becomes clear that they cover only a very small part of our study area.<\/p>\n<p><strong>Optional<\/strong>: Let us take a look at the naming of the raster bands via the <span class=\"crayon-inline theme:amityreseda\">names()<\/span> function. Those names can be quite bulky and cause problems in some illustrations when used as axis labels. You can easily rename it to something more concise by overriding the names with any string vector of the same length:<\/p>\n<pre class=\"theme:amityreseda\">\r\nnames(img)\r\n## [1] \"LC081930232017060201T1.SC20180613160412_subset.1\"\r\n## [2] \"LC081930232017060201T1.SC20180613160412_subset.2\"\r\n## [3] \"LC081930232017060201T1.SC20180613160412_subset.3\"\r\n## [4] \"LC081930232017060201T1.SC20180613160412_subset.4\"\r\n## [5] \"LC081930232017060201T1.SC20180613160412_subset.5\"\r\n## [6] \"LC081930232017060201T1.SC20180613160412_subset.6\"\r\n## [7] \"LC081930232017060201T1.SC20180613160412_subset.7\"\r\n \r\nnames(img) &lt;- c(&quot;b1&quot;, &quot;b2&quot;, &quot;b3&quot;, &quot;b4&quot;, &quot;b5&quot;, &quot;b6&quot;, &quot;b7&quot;)\r\n \r\nnames(img)\r\n## [1] &quot;b1&quot; &quot;b2&quot; &quot;b3&quot; &quot;b4&quot; &quot;b5&quot; &quot;b6&quot; &quot;b7&quot;\r\n<\/pre>\n<p>In order to accelerate all computation processes in the following, we first trim the image to the extent of the shapefile using the <span class=\"crayon-inline theme:amityreseda\">crop()<\/span> function. It is often recommended to use a buffer of zero width wrapped around the polygons to avoid any topological errors in the upcoming process using <span class=\"crayon-inline theme:amityreseda\">gbuffer()<\/span> as shown in line 1:<\/p>\n<pre class=\"theme:amityreseda\">\r\nshp &lt;- gBuffer(shp, byid=TRUE, width=0)\r\n\r\nimg.subset &lt;- crop(img, shp)\r\n<\/pre>\n<p>Now we just want to extract the pixels that are 100% covered by the polygons. Full coverage is important to be able to extract reliable predictions about the ratio of an target class. In order to assess this coverage, we use the <span class=\"crayon-inline theme:amityreseda\">rasterize()<\/span> function with the argument <span class=\"crayon-inline theme:amityreseda\">getCover = TRUE<\/span> to generate a mask. In line 2, we set all pixels which have less than 100% coverage to <span class=\"crayon-inline theme:amityreseda\">NA<\/span>. In the end, we use the mask to subset our image data again using the <span class=\"crayon-inline theme:amityreseda\">mask()<\/span> function:<\/p>\n<pre class=\"theme:amityreseda\">\r\nimg.mask &lt;- rasterize(shp, img.subset, getCover = TRUE)\r\nimg.mask[img.mask &lt; 100] &lt;- NA\r\nimg.subset &lt;- mask(img.subset, img.mask)\r\n<\/pre>\n<p>Plot the image data again with the polygons in order to see our achievements:<\/p>\n<pre class=\"theme:amityreseda\">\r\nplot(shp)\r\nplot(img.subset[[4]], add=T)\r\n<\/pre>\n<p><a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_008.png\"><img decoding=\"async\" src=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_008.png\" alt=\"\" width=\"250\" class=\"aligncenter size-full wp-image-2641\" \/><\/a><\/p>\n<p>Only the raster cells within our polygons are left &#8211; great!<br \/>\nIn a next step, we transform those pixels to polygons, forming a grid with the information of the Landsat 8 bands maintained by using the <span class=\"crayon-inline theme:amityreseda\"> rasterToPolygons()<\/span> function. In addition, we want to assign each of these cells in the grid with an ID in order to address them (line 2):<\/p>\n<pre class=\"theme:amityreseda\">\r\ngrid &lt;- rasterToPolygons(img.subset)\r\ngrid$ID &lt;- seq.int(nrow(grid))\r\n\r\ngrid \r\n## class       : SpatialPolygonsDataFrame \r\n## features    : 93 \r\n## extent      : 390255, 390675, 5817165, 5817705  (xmin, xmax, ymin, ymax)\r\n## coord. ref. : +proj=utm +zone=33 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 \r\n## variables   : 8\r\n## names       :   b1,   b2,   b3,   b4,   b5,   b6,   b7, ID \r\n## min values  :  144,  198,  339,  279,  678,  563,  398,  1 \r\n## max values  : 1175, 1245, 1452, 1422, 3686, 2382, 2193, 93\r\n\r\nplot(grid)\r\n<\/pre>\n<p><a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_006.png\"><img decoding=\"async\" src=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_006.png\" alt=\"\" width=\"250\" class=\"aligncenter size-full wp-image-2642\" srcset=\"https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_006.png 768w, https:\/\/blogs.fu-berlin.de\/reseda\/files\/2018\/09\/reg_006-240x300.png 240w\" sizes=\"(max-width: 768px) 85vw, 768px\" \/><\/a><\/p>\n<p>Finally, we have generated 93 cells (polygons) in this <span class=\"crayon-inline theme:amityreseda\">grid<\/span> for which we now want to make a statement of how much of their areas is impervious, i.e., all polygons that have either <em>building <\/em> or <em>impervious<\/em> as <span class=\"crayon-inline theme:amityreseda\">Class_name<\/span> attribute.<br \/>\nFor this, we generate a new, empty data frame in which we write the training samples one after the other in line 1. We use a for-loop to iterate over each cell of the <span class=\"crayon-inline theme:amityreseda\">grid<\/span> in line 2:<\/p>\n<pre class=\"theme:amityreseda\">\r\nsmp &lt;- data.frame()\r\nfor (i in 1:length(grid)) {\r\n  cell &lt;- intersect(grid[i, ], shp)\r\n  cell &lt;- cell[cell$Class_name == &quot;building&quot; | cell$Class_name == &quot;impervious&quot;, ]\r\n  if (length(cell) &gt; 0) {\r\n    areaPercent &lt;- sum(area(cell) \/ area(grid)[1])\r\n  } else {\r\n    areaPercent &lt;- 0\r\n  }\r\n  newsmp &lt;- cbind(grid@data[i, 1:nlayers(img)], areaPercent)\r\n  smp &lt;- rbind(smp, newsmp)\r\n}\r\n\r\nhead(smp)\r\n##      b1   b2   b3   b4   b5   b6  b7  areaPercent\r\n## 2  1175 1245 1452 1422 1742 1321 950 9.765630e-01\r\n## 3   497  597  736  767 1335 1175 947 7.986431e-01\r\n## 5   277  330  477  462 1221  960 743 3.669043e-01\r\n## 4   265  310  444  428 1077  789 545 1.727224e-01\r\n## 41  230  267  390  343  809  613 404 9.765625e-06\r\n## 42  168  217  339  312  678  563 398 3.540961e-02\r\n<\/pre>\n<p><strong>Explanation<\/strong>:<br \/>\nLine 3: we select the ith cell of the grid and intersect this cell with the <span class=\"crayon-inline theme:amityreseda\">shp<\/span>, which gives us the geometries and attributes of both the shapefile and the cell.<br \/>\nLine 4: only the polygons belonging to the target classes were extracted, we discard all other polygons.<br \/>\nLine 5: if a polygon of a target class exists in the cell, do everything inside the curly brackets. If there is no polygon in the cell, jump to line 7.<br \/>\nLine 6: calculate the proportion of the area of each polygon and sum the proportions (<span class=\"crayon-inline theme:amityreseda\">area(grid)[1]<\/span> equals 900 m\u00b2 in our example due to the geometric resolution).<br \/>\nLine 7: if there is no polygon of class <em>building <\/em>or <em>impervious<\/em>, set <span class=\"crayon-inline theme:amityreseda\">areaPercent<\/span> to zero.<br \/>\nLine 10: use the column bind (<span class=\"crayon-inline theme:amityreseda\">cbind()<\/span>) function to combine the feature values (reflectances of Landsat image) and the <span class=\"crayon-inline theme:amityreseda\">areaPercent<\/span> value to form a new sample <span class=\"crayon-inline theme:amityreseda\">newsmp<\/span>.<br \/>\nLine 10: use row bind (<span class=\"crayon-inline theme:amityreseda\">rbind()<\/span>) to add the new sample <span class=\"crayon-inline theme:amityreseda\">newsmp<\/span> to our training dataset <span class=\"crayon-inline theme:amityreseda\">smp<\/span>.<\/p>\n<p>We now prepared a training data set <span class=\"crayon-inline theme:amityreseda\">smp<\/span> with all input features (Landsat 8 bands 1 to 7) and the percentage value of our target class (<em>imperviousness<\/em>). <\/p>\n<p>Time to train a Support Vector Machine <a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/svm-regression\/\">in the next section<\/a>!<\/p>\n<p><\/br><\/br><\/p>\n<hr style=\"height:4px;background-color:#6b9e1f\">\n<a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/svm-regression\/\"><br \/>\n<button style=\"width:100%;text-align:right;padding: 10 0;background-color:white;margin:-55px 0 0 0\"><\/p>\n<div style=\"font-family: 'Noto Sans',sans-serif;line-height: 1.2\">\n<span style=\"font-size: 12px;color:#bfbfbf\"><strong><em>NEXT<\/em><\/strong><\/span><br \/>\n<span style=\"font-size: 30px;color:#6b9e1f\"><strong><em>SVM Regression<\/em><\/strong><\/span>\n<\/div>\n<p><\/button><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We want to create training samples in the form of a data frame \u2013 which most regressors in R can handle easily. For this we will have to use a few GIS operations. Fortunately, these can also be done automatically in R, so that the entire workflow do not have to be performed manually again &hellip; <a href=\"https:\/\/blogs.fu-berlin.de\/reseda\/prepare-samples-for-regression\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Prepare Samples for Regression&#8221;<\/span><\/a><\/p>\n","protected":false},"author":3237,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-2614","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/pages\/2614","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/users\/3237"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/comments?post=2614"}],"version-history":[{"count":29,"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/pages\/2614\/revisions"}],"predecessor-version":[{"id":2684,"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/pages\/2614\/revisions\/2684"}],"wp:attachment":[{"href":"https:\/\/blogs.fu-berlin.de\/reseda\/wp-json\/wp\/v2\/media?parent=2614"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}