R Crash Course Part V

9. Missing values

Sometimes observations (rows) in a data frame are incomplete. The correct handling of missing values is especially needed in statistical evaluations. Missing values are indicated by the symbol NA (Not Available). Non-possible values (e.g., division by 0) are described by the symbol NaN (Not a Number). For test purposes, we artificially define two elements as NA in our df:

df <- data.frame(
  "name" = c("Ben", "Hanna", "Paul", "Arthur"), 
  "size" = c(185, 166, 175, 190)
  )

df
##     name size
## 1    Ben  185
## 2  Hanna  166
## 3   Paul  175
## 4 Arthur  190

df[1:2, 2] <- NA              # generate NAs
df
##     name size
## 1    Ben   NA
## 2  Hanna   NA
## 3   Paul  175
## 4 Arthur  190

mean(df$size)                 # mean with NAs
## [1] NA

mean(df$size, na.rm=TRUE)     # mean with ignoring NAs
## [1] 182.5

Often it makes more sense to record the observations with missing values and to exclude them from your dataset. For detection we can use the is.na() function to create a logical vector. This outputs a TRUE for every NA value, a FALSE for every element present. Using the logical vector we can then index our data frame:

is.na(df$size)               # logical vector (TRUE for all rows with NAs)
## [1]  TRUE  TRUE FALSE FALSE

df[is.na(df$size), ]         # index using logical vector
##    name size
## 1   Ben   NA
## 2 Hanna   NA

df[complete.cases(df), ]     # get all rows without NAs
##     name size
## 3   Paul  175
## 4 Arthur  190


10. Control Structures

Control structures allow better control over the execution of our scripts, e.g., if, if else, else, while, switch, repeat, break, return
Use the help function for a more detailed documentation.

The IF statement

Use the if / else command to perform simple queries on all data types. The Microsoft Excel equivalent would be the “IF” feature. In R, the syntax is as follows:

x <- 3                                # define x

x <= 4                                # logical condition
## [1] TRUE

if (x <= 4) {
  print("x is smaller than or equal to 4!")    
} else {
  print("x is larger than 4!")        
}
## [1] "x is smaller than or equal to 4!"

Explanation: An IF-command needs three parts: the keyword if(), a condition that results in a single logical output x <= 4 and a block of code in curly braces {}, which is executed if the expression is TRUE. So, if the condition is TRUE, the code will run in curly brackets after the IF-command. If the condition is FALSE, the code block after the elseis executed.
The print() function outputs the strings in the parentheses to the console window. Here we need the print() function so that the output can be written out of the IF-function, similar to return(see chapter 5). A vectorized (and therefore more efficient) notation is the following:

x <- c(3, 4, 5, 6, 7)                                

ifelse(x <=4, "x is smaller than or equal to 4!", "x is larger than 4!")
## [1] "x is smaller than or equal to 4!" "x is smaller than or equal to 4!"
## [3] "x is larger than 4!"              "x is larger than 4!"             
## [5] "x is larger than 4!"

Using ifelse(), the condition for each individual element is determined as a vector. This is helpful, e.g., if we want to categorize data!

The FOR loop

Loops are incredibly useful when certain tasks need to be repeated very often in the script. A for loop is based on an iterable variable of defined length. But what does that mean? We define any variable, e.g., i, with a start integer value, e.g, 1. We then increment this integer value until a second integer value, e.g. 8, is reached. This can be done via the sequence operator ::

for (i in 1:8) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8

The practical thing: The code in curly brackets is automatically executed once each time (eight times in total)! And we can meanwhile pick up the expression of our variable i, in order to print it or index a vector with it and much more:

v <- c(23, 54, 12, 59, 67, 45)    # create integer vector

length(v)                         # check length of vector
## [1] 6
  
for (i in 1:length(v)) {          # iterate length(v) times
  print(v[i])
}
## [1] 23
## [1] 54
## [1] 12
## [1] 59
## [1] 67
## [1] 45

It is also possible to set the variable / iterator equal to the elements of the vector instead of an integer value for indexing. The information in which run the loop is, however, is initially lost:

v <- c("R", "is", "still", "fun")    

for (i in v) {
  print(i)
}
## [1] "R"
## [1] "is"
## [1] "still"
## [1] "fun"

Now, have a look at exercise V: