R Crash Course Part IV

7. Lists

A list is a special form of a vector that allows multiple elements of different classes at once. It thus serves as a kind of container for other objects, such as numbers, strings, vectors or matrices. A list can be created using the function list(). Element names can be given to existing lists via the names() function so that they can later be indexed using these names:

l <- list(13L, "Hello", matrix(1:6, 2, 3))
l
## [[1]]
## [1] 13
## 
## [[2]]
## [1] "Hello"
## 
## [[3]]
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

names(l) <- c("my.integer", "my.string", "my.matrix")
l
## $my.integer
## [1] 13
## 
## $my.string
## [1] "Hello"
## 
## $my.matrix
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

str(l)
## List of 3
##  $ my.integer: int 13
##  $ my.string: chr "Hello"
##  $ my.matrix: int [1:2, 1:3] 1 2 3 4 5 6

When creating a list, however, the element names can also be assigned immediately:

l <- list("my.integer"=13L,
          "my.string"="Hello",
          "my.matrix"=matrix(1:6, 2, 3)
          )

Indexing in lists

Using the respective index number or the assigned element name (if available), we can use a double square bracket [[]] to access the contents of the list. Using a simple square bracket, we would only get a part of the list here, which would still belong to class list:

l[1]                        # first part of the list
## $my.integer
## [1] 13
class(l[1])
## [1] "list"

l[[1]]                      # extract first element (integer value)
## [1] 13
class(l[[1]])
## [1] "integer"

l[["my.string"]]           # extract element by its name
## [1] "Hello"

l[[3]]                      # extract third element (matrix)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Modify lists

Lists can be expanded (assign a new index number or new element name to a value), and elements can be deleted (assign NULL) or overwrite individual list elements (reassign existing index or name):

l["my.numeric"] <- 45.7325          # add new element to list

l[1] <- NULL                         # delete first element in list

l["meinString"] <- "World"           # overwrite existing element


8. Dataframe

The data frame is the most commonly used data type when manipulating databases and allows you to manage two-dimensional tabular data. Where is the difference to a matrix? Well, while a matrix can only contain elements of a class, several classes can exist in one data frame. Each column in a data frame is basically a list.
Whenever external data is read into R, a data frame is created.

df <- data.frame(
  "name"   = c("Ben", "Hanna", "Paul", "Arthur"), 
  "size"   = c(185, 166, 175, 190),
  "weight" = c(110, 60, 76, 89)
  )

df
##     name size weight
## 1    Ben  185    110
## 2  Hanna  166     60
## 3   Paul  175     76
## 4 Arthur  190     89

length(df)                  # number of columns (variables)
## [1] 3

dim(df)                     # dimensionen (4 rows, 3 columns)
## [1] 4 3

nrow(df)                    # number of rows (observations)
## [1] 4

ncol(df)                    # number of columns (variables)
## [1] 3

str(df)                     # shows structure of df
## 'data.frame':    4 obs. of  3 variables:
##  $ name  : Factor w/ 4 levels "Arthur","Ben",..: 2 3 4 1
##  $ size  : num  185 166 175 190
##  $ weight: num  110 60 76 89

summary(df)                 # statistical summary
##      name       size            weight
##  Arthur:1   Min.   :166.0   Min.   : 60.00  
##  Ben   :1   1st Qu.:172.8   1st Qu.: 72.00  
##  Hanna :1   Median :180.0   Median : 82.50  
##  Paul  :1   Mean   :179.0   Mean   : 83.75  
##             3rd Qu.:186.2   3rd Qu.: 94.25  
##             Max.   :190.0   Max.   :110.00

Interesting is the output of the function str(). He first shows us that we have 4 observations (obs., Ie “Ben”, “Hanna”, “Paul”, “Arthur”) with 3 variables each (variables, ie “name,” “size”, “weight”) Furthermore, for each variable it is determined whether it is numeric (num) or categorial (factor), for the latter the number of different values (w / 4 levels) is displayed, and even more useful is the statistical summary for each column of the Data frames via the function summary()!

Indexing in data frames

In a data frame columns can be addressed either by the double square brackets [[]] by means of index numbers or directly by the name of the column (if available) by means of the dollar sign $. In addition, the rows or columns can be addressed adequately to a matrix by means of simple square brackets[]:

df
##     name size weight
## 1    Ben  185    110
## 2  Hanna  166     60
## 3   Paul  175     76
## 4 Arthur  190     89

df[2]                                  # output column 2 as data frame
##   size
## 1  185
## 2  166
## 3  175
## 4  190

df[[2]]                                # output as numeric
## [1] 185 166 175 190

df$size                                # output as numeric
## [1] 185 166 175 190

df[ , 2]                               # column output as numeric
## [1] 185 166 175 190

df[1,  ]                               # row output as data frame
##   name size weight
## 1  Ben  185    110

df[1, 2]                               # element in row 1, col 2 as numeric
## [1] 185

Various queries are also possible, for which we use the boolean operators:

df
##     name size weight
## 1    Ben  185    110
## 2  Hanna  166     60
## 3   Paul  175     76
## 4 Arthur  190     89

df$size > 170
## [1]  TRUE FALSE  TRUE  TRUE

df[df$size > 170, ]                     
##     name size weight
## 1    Ben  185    110
## 3   Paul  175     76
## 4 Arthur  190     89

df[df$size > 180 & df$weight < 100, ]        # AND condition
##     name size weight
## 4 Arthur  190     89

df[df$size > 188 | df$weight < 70, ]         # OR condition
##     name size weight
## 2  Hanna  166     60
## 4 Arthur  190     89

df[df$name == "Ben" | df$name == "Hanna", ]  # OR condition
##    name size weight
## 1   Ben  185    110
## 2 Hanna  166     60

Explanation: For queries we use boolean operators. By the query in line 8 we get a boolean Vector, which contains a TRUE if the respective value of the Observation is greater than 170. We use this vector in line 11 to index the corresponding entries in the data frame (outputs all observations with a TRUE). When chaining conditions, either both conditions must be fulfilled at the same time by using AND &, or only one of both by using OR |.

Modify data frames

Often it is necessary to delete data from a data frame or to implement additional entries later. For both tasks there are several possibilities in R. In the following two simple solutions:

df2 <- df[ , -2]                                # delete column by index
df2
##     name weight
## 1    Ben    110
## 2  Hanna     60
## 3   Paul     76
## 4 Arthur     89

df3 <- subset(df, select = -c(weight, size))    # delete column by name
df3
##     name
## 1    Ben
## 2  Hanna
## 3   Paul
## 4 Arthur

df4 <- df[-3, ]                                 # delet row by index
df4
##     name size weight
## 1    Ben  185    110
## 2  Hanna  166     60
## 4 Arthur  190     89

df5 <- subset(df, !name %in% c("Ben", "Hanna")) # delete row by attribute
df5
##     name size weight
## 3   Paul  175     76
## 4 Arthur  190     89

Excluding columns via the column name is possible via the subset() function. Here we can use the argument -select= with a leading minus to specify the name of the column to be deleted (or a vector with c() for several columns at the same time). The ! symbol is a logical operator and negates a condition (see ? “!”).
The addition of observations and variables is of course also possible:

df$gender = c("m", "w", "m", "m")         # add a column (variable)
df
##     name size weight gender
## 1    Ben  185    110      m
## 2  Hanna  166     60      w
## 3   Paul  175     76      m
## 4 Arthur  190     89      m

newdata <- data.frame("name" = 'Lisa',    # add a row (observation)
                      "size" = 180,
                      "weight" = 70,
                      "gender" = "w"
                      )

df <- rbind(df, newdata)
df
##     name size weight gender
## 1    Ben  185    110      m
## 2  Hanna  166     60      w
## 3   Paul  175     76      m
## 4 Arthur  190     89      m
## 5   Lisa  180     70      w

If a new line has to be added, the new data must have the same structure as the existing data frame.

Time for training session number IV: