9. Missing values
Sometimes observations (rows) in a data frame are incomplete. The correct handling of missing values is especially needed in statistical evaluations. Missing values are indicated by the symbol NA (Not Available). Non-possible values (e.g., division by 0) are described by the symbol NaN (Not a Number). For test purposes, we artificially define two elements as NA in our df:
df <- data.frame( "name" = c("Ben", "Hanna", "Paul", "Arthur"), "size" = c(185, 166, 175, 190) ) df ## name size ## 1 Ben 185 ## 2 Hanna 166 ## 3 Paul 175 ## 4 Arthur 190 df[1:2, 2] <- NA # generate NAs df ## name size ## 1 Ben NA ## 2 Hanna NA ## 3 Paul 175 ## 4 Arthur 190 mean(df$size) # mean with NAs ## [1] NA mean(df$size, na.rm=TRUE) # mean with ignoring NAs ## [1] 182.5
Often it makes more sense to record the observations with missing values and to exclude them from your dataset. For detection we can use the is.na() function to create a logical vector. This outputs a TRUE for every NA value, a FALSE for every element present. Using the logical vector we can then index our data frame:
is.na(df$size) # logical vector (TRUE for all rows with NAs) ## [1] TRUE TRUE FALSE FALSE df[is.na(df$size), ] # index using logical vector ## name size ## 1 Ben NA ## 2 Hanna NA df[complete.cases(df), ] # get all rows without NAs ## name size ## 3 Paul 175 ## 4 Arthur 190
10. Control Structures
Control structures allow better control over the execution of our scripts, e.g., if, if else, else, while, switch, repeat, break, return …
Use the help function for a more detailed documentation.
The IF statement
Use the if / else command to perform simple queries on all data types. The Microsoft Excel equivalent would be the “IF” feature. In R, the syntax is as follows:
x <- 3 # define x x <= 4 # logical condition ## [1] TRUE if (x <= 4) { print("x is smaller than or equal to 4!") } else { print("x is larger than 4!") } ## [1] "x is smaller than or equal to 4!"
Explanation: An IF-command needs three parts: the keyword if(), a condition that results in a single logical output x <= 4 and a block of code in curly braces {}, which is executed if the expression is TRUE. So, if the condition is TRUE, the code will run in curly brackets after the IF-command. If the condition is FALSE, the code block after the elseis executed.
The print() function outputs the strings in the parentheses to the console window. Here we need the print() function so that the output can be written out of the IF-function, similar to return(see chapter 5). A vectorized (and therefore more efficient) notation is the following:
x <- c(3, 4, 5, 6, 7) ifelse(x <=4, "x is smaller than or equal to 4!", "x is larger than 4!") ## [1] "x is smaller than or equal to 4!" "x is smaller than or equal to 4!" ## [3] "x is larger than 4!" "x is larger than 4!" ## [5] "x is larger than 4!"
Using ifelse(), the condition for each individual element is determined as a vector. This is helpful, e.g., if we want to categorize data!
The FOR loop
Loops are incredibly useful when certain tasks need to be repeated very often in the script. A for loop is based on an iterable variable of defined length. But what does that mean? We define any variable, e.g., i, with a start integer value, e.g, 1. We then increment this integer value until a second integer value, e.g. 8, is reached. This can be done via the sequence operator ::
for (i in 1:8) { print(i) } ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6 ## [1] 7 ## [1] 8
The practical thing: The code in curly brackets is automatically executed once each time (eight times in total)! And we can meanwhile pick up the expression of our variable i, in order to print it or index a vector with it and much more:
v <- c(23, 54, 12, 59, 67, 45) # create integer vector length(v) # check length of vector ## [1] 6 for (i in 1:length(v)) { # iterate length(v) times print(v[i]) } ## [1] 23 ## [1] 54 ## [1] 12 ## [1] 59 ## [1] 67 ## [1] 45
It is also possible to set the variable / iterator equal to the elements of the vector instead of an integer value for indexing. The information in which run the loop is, however, is initially lost:
v <- c("R", "is", "still", "fun") for (i in v) { print(i) } ## [1] "R" ## [1] "is" ## [1] "still" ## [1] "fun"
Now, have a look at exercise V: