Writing Funtions in R, Know Your Data Types

Control flow in R can be a bit tricky, especially when you’re using functions. An if statement is just an if statement right? Not quite the case in R.

Control Flow - Arrays

Functions are a common go-to tool for organizing complex logic in functions. In R, functions can behave a bit weird depending on the data you pass into them.

Let’s start with a simple function and data.

myFunction <- function(x)
{
	total <- x*2;

	if(total > 1)
	    return("GREAT");

	return("NOT GREAT");
}

data <- c(0, 2)

# > data
# [1] 0 2

Now, what we want to do is call this function with each item in our array. The easiest way to do this is with lapply

lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

> lapply(data, myFunction)
[[1]]
[1] "NOT GREAT"

[[2]]
[1] "GREAT"

Not only is this pretty cool, loops are often non-existent in R, but it shows how easy it is to wire up our functions to a chunk of data.

Control Flow - Vectors

Let’s take a little bit more complicated example, now exploring this with vectors. Now we’re going to use the built in mtcars data frame, which under the hood is a list of vectors, and our goal will be to add a new column containing the output of our function.

Working with data frames can be quite a bit nicer than arrays and matrices. For example, let’s say we want to add a new column with simply represents the hp column divided by mpg in our mtcars data above.

# copy mtcars into a new variable
> data <- mtcars

#> data
#                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1


#add a new column
> data$rating <- data$hp / data$mpg

#                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb    rating
#Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4  5.238095
#Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  5.238095
#Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1  4.078947

So, we should be able to just do the same with our function right?

#> data$rating <- myFunction(data$hp, data$mpg)

#> data
#                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb rating
#Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4  GREAT
#Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  GREAT
#Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1  GREAT
#Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1  GREAT

Wrong! Notice how we have “GREAT” for all of them? We should be showing “NOT GREAT” for the Datsun 710. The reason this is happening is because the if expr control statement is not vectorized. This means that only the first result of the condition will be used. If you run this same script in R Studio, you’ll notice you also get a warning explaining this:

Warning message: In if (rating >= 5) return(“GREAT”) : the condition has length > 1 and only the first element will be used

IfElse Statement

One way for us to fix this is to use the ifelse statement, which is vectorized:

myFunction <- function(hp,mpg)
{
	rating <- hp / mpg

	ifelse(rating >= 5, "GREATE", "NOT GREAT");
}

#> data$rating <- myFunction(data$hp, data$mpg)
#> data
#                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb    rating
#Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4     GREAT
#Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     GREAT
#Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 NOT GREAT

Simple as that. The ifelse takes a similar syntax to our ternary statements you see in languages like c#.

ifelse(CONDITION, TRUE_EXPR, FALSE_EXPR)

mapply Statement

The ifelse statement above is pretty easy and will work great in a lot of simple situations. However, if you can imagine having 3-4 different conditions to apply, this will get pretty messy with nested ifelse statements. The better alternative here is to use the mapply statement.

myFunction <- function(hp,mpg)
{
	rating <- hp / mpg

	if(rating >= 5)
	    return("GREAT");

	return("NOT GREAT");
}

#> data$rating <- mapply(myFunction, data$hp, data$mpg)
#> data
#                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb    rating
#Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4     GREAT
#Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     GREAT
#Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 NOT GREAT

As you’ll notice above, we’ve reverted to the non-vectorized if expr that previously was not giving us our expected result. Because we’re now using mapply to call our function, the result is being applied correctly.

mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each … argument, the second elements, the third elements, and so on. Arguments are recycled if necessary.

Remember, as mentioned earlier that a data frame is just a list of vectors, which means we are passing multiple vectors to our function.

Summary

The takeaway here is to always be aware of the data structure you’re trying to pass through your control statements. This was something that caught me off guard as, coming from a development world, you kind of expect the if statement to work the same way.