7. Lists
A list is a special form of a vector that allows multiple elements of different classes at once. It thus serves as a kind of container for other objects, such as numbers, strings, vectors or matrices. A list can be created using the function list(). Element names can be given to existing lists via the names() function so that they can later be indexed using these names:
l <- list(13L, "Hello", matrix(1:6, 2, 3)) l ## [[1]] ## [1] 13 ## ## [[2]] ## [1] "Hello" ## ## [[3]] ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 names(l) <- c("my.integer", "my.string", "my.matrix") l ## $my.integer ## [1] 13 ## ## $my.string ## [1] "Hello" ## ## $my.matrix ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 str(l) ## List of 3 ## $ my.integer: int 13 ## $ my.string: chr "Hello" ## $ my.matrix: int [1:2, 1:3] 1 2 3 4 5 6
When creating a list, however, the element names can also be assigned immediately:
l <- list("my.integer"=13L, "my.string"="Hello", "my.matrix"=matrix(1:6, 2, 3) )
Indexing in lists
Using the respective index number or the assigned element name (if available), we can use a double square bracket [[]] to access the contents of the list. Using a simple square bracket, we would only get a part of the list here, which would still belong to class list:
l[1] # first part of the list ## $my.integer ## [1] 13 class(l[1]) ## [1] "list" l[[1]] # extract first element (integer value) ## [1] 13 class(l[[1]]) ## [1] "integer" l[["my.string"]] # extract element by its name ## [1] "Hello" l[[3]] # extract third element (matrix) ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6
Modify lists
Lists can be expanded (assign a new index number or new element name to a value), and elements can be deleted (assign NULL) or overwrite individual list elements (reassign existing index or name):
l["my.numeric"] <- 45.7325 # add new element to list l[1] <- NULL # delete first element in list l["meinString"] <- "World" # overwrite existing element
8. Dataframe
The data frame is the most commonly used data type when manipulating databases and allows you to manage two-dimensional tabular data. Where is the difference to a matrix? Well, while a matrix can only contain elements of a class, several classes can exist in one data frame. Each column in a data frame is basically a list.
Whenever external data is read into R, a data frame is created.
df <- data.frame( "name" = c("Ben", "Hanna", "Paul", "Arthur"), "size" = c(185, 166, 175, 190), "weight" = c(110, 60, 76, 89) ) df ## name size weight ## 1 Ben 185 110 ## 2 Hanna 166 60 ## 3 Paul 175 76 ## 4 Arthur 190 89 length(df) # number of columns (variables) ## [1] 3 dim(df) # dimensionen (4 rows, 3 columns) ## [1] 4 3 nrow(df) # number of rows (observations) ## [1] 4 ncol(df) # number of columns (variables) ## [1] 3 str(df) # shows structure of df ## 'data.frame': 4 obs. of 3 variables: ## $ name : Factor w/ 4 levels "Arthur","Ben",..: 2 3 4 1 ## $ size : num 185 166 175 190 ## $ weight: num 110 60 76 89 summary(df) # statistical summary ## name size weight ## Arthur:1 Min. :166.0 Min. : 60.00 ## Ben :1 1st Qu.:172.8 1st Qu.: 72.00 ## Hanna :1 Median :180.0 Median : 82.50 ## Paul :1 Mean :179.0 Mean : 83.75 ## 3rd Qu.:186.2 3rd Qu.: 94.25 ## Max. :190.0 Max. :110.00
Interesting is the output of the function str(). He first shows us that we have 4 observations (obs., Ie “Ben”, “Hanna”, “Paul”, “Arthur”) with 3 variables each (variables, ie “name,” “size”, “weight”) Furthermore, for each variable it is determined whether it is numeric (num) or categorial (factor), for the latter the number of different values (w / 4 levels) is displayed, and even more useful is the statistical summary for each column of the Data frames via the function summary()!
Indexing in data frames
In a data frame columns can be addressed either by the double square brackets [[]] by means of index numbers or directly by the name of the column (if available) by means of the dollar sign $. In addition, the rows or columns can be addressed adequately to a matrix by means of simple square brackets[]:
df ## name size weight ## 1 Ben 185 110 ## 2 Hanna 166 60 ## 3 Paul 175 76 ## 4 Arthur 190 89 df[2] # output column 2 as data frame ## size ## 1 185 ## 2 166 ## 3 175 ## 4 190 df[[2]] # output as numeric ## [1] 185 166 175 190 df$size # output as numeric ## [1] 185 166 175 190 df[ , 2] # column output as numeric ## [1] 185 166 175 190 df[1, ] # row output as data frame ## name size weight ## 1 Ben 185 110 df[1, 2] # element in row 1, col 2 as numeric ## [1] 185
Various queries are also possible, for which we use the boolean operators:
df ## name size weight ## 1 Ben 185 110 ## 2 Hanna 166 60 ## 3 Paul 175 76 ## 4 Arthur 190 89 df$size > 170 ## [1] TRUE FALSE TRUE TRUE df[df$size > 170, ] ## name size weight ## 1 Ben 185 110 ## 3 Paul 175 76 ## 4 Arthur 190 89 df[df$size > 180 & df$weight < 100, ] # AND condition ## name size weight ## 4 Arthur 190 89 df[df$size > 188 | df$weight < 70, ] # OR condition ## name size weight ## 2 Hanna 166 60 ## 4 Arthur 190 89 df[df$name == "Ben" | df$name == "Hanna", ] # OR condition ## name size weight ## 1 Ben 185 110 ## 2 Hanna 166 60
Explanation: For queries we use boolean operators. By the query in line 8 we get a boolean Vector, which contains a TRUE if the respective value of the Observation is greater than 170. We use this vector in line 11 to index the corresponding entries in the data frame (outputs all observations with a TRUE). When chaining conditions, either both conditions must be fulfilled at the same time by using AND &, or only one of both by using OR |.
Modify data frames
Often it is necessary to delete data from a data frame or to implement additional entries later. For both tasks there are several possibilities in R. In the following two simple solutions:
df2 <- df[ , -2] # delete column by index df2 ## name weight ## 1 Ben 110 ## 2 Hanna 60 ## 3 Paul 76 ## 4 Arthur 89 df3 <- subset(df, select = -c(weight, size)) # delete column by name df3 ## name ## 1 Ben ## 2 Hanna ## 3 Paul ## 4 Arthur df4 <- df[-3, ] # delet row by index df4 ## name size weight ## 1 Ben 185 110 ## 2 Hanna 166 60 ## 4 Arthur 190 89 df5 <- subset(df, !name %in% c("Ben", "Hanna")) # delete row by attribute df5 ## name size weight ## 3 Paul 175 76 ## 4 Arthur 190 89
Excluding columns via the column name is possible via the subset() function. Here we can use the argument -select= with a leading minus to specify the name of the column to be deleted (or a vector with c() for several columns at the same time). The ! symbol is a logical operator and negates a condition (see ? “!”).
The addition of observations and variables is of course also possible:
df$gender = c("m", "w", "m", "m") # add a column (variable) df ## name size weight gender ## 1 Ben 185 110 m ## 2 Hanna 166 60 w ## 3 Paul 175 76 m ## 4 Arthur 190 89 m newdata <- data.frame("name" = 'Lisa', # add a row (observation) "size" = 180, "weight" = 70, "gender" = "w" ) df <- rbind(df, newdata) df ## name size weight gender ## 1 Ben 185 110 m ## 2 Hanna 166 60 w ## 3 Paul 175 76 m ## 4 Arthur 190 89 m ## 5 Lisa 180 70 w
If a new line has to be added, the new data must have the same structure as the existing data frame.
Time for training session number IV: