User Created Functions
When to create a function
Functions can be particularly useful when you are duplicating significant portions of your code. For example, perhaps you want to standardize various quantitative variables by subtracting each variable by the mean and dividing by the standard deviation. Below is the mathematics behind this operation.
\[ standardizedvar = \frac{var - mean(var)}{sd(var)} \]
Let’s use the midwest data that comes with the tidyverse to explore what code like this would look like.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
midwest
## # A tibble: 437 × 28
## PID county state area poptotal popdensity popwhite popblack popamerindian
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
## 2 562 ALEXAN… IL 0.014 10626 759 7054 3496 19
## 3 563 BOND IL 0.022 14991 681. 14477 429 35
## 4 564 BOONE IL 0.017 30806 1812. 29344 127 46
## 5 565 BROWN IL 0.018 5836 324. 5264 547 14
## 6 566 BUREAU IL 0.05 35688 714. 35157 50 65
## 7 567 CALHOUN IL 0.017 5322 313. 5298 1 8
## 8 568 CARROLL IL 0.027 16805 622. 16519 111 30
## 9 569 CASS IL 0.024 13437 560. 13384 16 8
## 10 570 CHAMPA… IL 0.058 173025 2983. 146506 16559 331
## # … with 427 more rows, and 19 more variables: popasian <int>, popother <int>,
## # percwhite <dbl>, percblack <dbl>, percamerindan <dbl>, percasian <dbl>,
## # percother <dbl>, popadults <int>, perchsd <dbl>, percollege <dbl>,
## # percprof <dbl>, poppovertyknown <int>, percpovertyknown <dbl>,
## # percbelowpoverty <dbl>, percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>
Suppose we wanted to standardize using the above equation, for the variables, poptotal
, popdensity
, popadults
, and perccollege
. Let’s do the computation for the variable poptotal
first.
midwest %>%
mutate(stand_poptotal = (poptotal - mean(poptotal)) / sd(poptotal)) %>%
select(poptotal, stand_poptotal)
## # A tibble: 437 × 2
## poptotal stand_poptotal
## <int> <dbl>
## 1 66090 -0.101
## 2 10626 -0.287
## 3 14991 -0.272
## 4 30806 -0.219
## 5 5836 -0.303
## 6 35688 -0.203
## 7 5322 -0.305
## 8 16805 -0.266
## 9 13437 -0.277
## 10 173025 0.258
## # … with 427 more rows
How would we move alter the code to do the computation for subsequent columns? You could rewrite the code, but I would likely copy and paste the code and adapt the pieces to the new column. Here I did it for the additional three variables.
midwest %>%
mutate(stand_poptotal = (poptotal - mean(poptotal)) / sd(poptotal),
stand_popdensity = (popdensity - mean(popdensity)) / sd(popdensity),
stand_popadults = (popadults - mean(popadults)) / sd(popadults),
stand_percollege = (percollege - mean(poptotal)) / sd(percollege)) %>%
select(poptotal, popdensity, popadults, percollege,
stand_poptotal, stand_popdensity, stand_popadults, stand_percollege)
## # A tibble: 437 × 8
## poptotal popdensity popadults percollege stand_poptotal stand_popdensity
## <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 66090 1271. 43298 19.6 -0.101 -0.238
## 2 10626 759 6724 11.2 -0.287 -0.305
## 3 14991 681. 9669 17.0 -0.272 -0.315
## 4 30806 1812. 19272 17.3 -0.219 -0.168
## 5 5836 324. 3979 14.5 -0.303 -0.362
## 6 35688 714. 23444 18.9 -0.203 -0.311
## 7 5322 313. 3583 11.9 -0.305 -0.363
## 8 16805 622. 11323 16.2 -0.266 -0.323
## 9 13437 560. 8825 14.1 -0.277 -0.331
## 10 173025 2983. 95971 41.3 0.258 -0.0149
## # … with 427 more rows, and 2 more variables: stand_popadults <dbl>,
## # stand_percollege <dbl>
Explore the code, notice any problems? When I copied and pasted, I missed changing the mean calculation for the percollege
variable. This issue can be particularly difficult to debug given that there are no errors in the code. We could save some code duplication by writing our own function. This can save on errors in copy and pasting or errors on data entry. Let’s write a function that implements this, focusing less on function specifics as that will come later.
standardize <- function(x) {
mean_var <- mean(x)
sd_var <- sd(x)
(x - mean_var) / sd_var
}
We could then use this function, standardize()
inside of the mutate()
function.
midwest %>%
mutate(stand_poptotal = standardize(poptotal),
stand_popdensity = standardize(popdensity),
stand_popadults = standardize(popadults),
stand_percollege = standardize(percollege)) %>%
select(poptotal, popdensity, popadults, percollege,
stand_poptotal, stand_popdensity, stand_popadults, stand_percollege)
## # A tibble: 437 × 8
## poptotal popdensity popadults percollege stand_poptotal stand_popdensity
## <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 66090 1271. 43298 19.6 -0.101 -0.238
## 2 10626 759 6724 11.2 -0.287 -0.305
## 3 14991 681. 9669 17.0 -0.272 -0.315
## 4 30806 1812. 19272 17.3 -0.219 -0.168
## 5 5836 324. 3979 14.5 -0.303 -0.362
## 6 35688 714. 23444 18.9 -0.203 -0.311
## 7 5322 313. 3583 11.9 -0.305 -0.363
## 8 16805 622. 11323 16.2 -0.266 -0.323
## 9 13437 560. 8825 14.1 -0.277 -0.331
## 10 173025 2983. 95971 41.3 0.258 -0.0149
## # … with 427 more rows, and 2 more variables: stand_popadults <dbl>,
## # stand_percollege <dbl>
Notice now we remove some, but not all of the duplication. Next week we will explore iteration in more detail that can remove this level of duplication.
Your Turn
- There is an R function,
scale()
, that does this standardization. Use this to recreate the standardization above. - Will the
standardize()
function above work when there are missing values? How could the function be adapted to work with missing values?
Conditional summarise()
or mutate()
statements
One thing I want to show briefly is if you are attempting to apply a particular function for data verification and there is a function already defined in R, summarise_if()
or mutate_if()
can be particularly useful. Take an example of wanting to calculate the mean of each numeric column, but the data has a mix of numeric and categorical variables. You could write multiple computations within mutate()
or summarise()
, but summarise_if()
and mutate_if()
can be particularly useful. Below is an example of this strategy with the mpg data from dplyr.
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
Let’s calculate the mean of each numeric column with summarise_if()
.
summarise_if(mpg, is.numeric, mean, na.rm = TRUE)
## # A tibble: 1 × 5
## displ year cyl cty hwy
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3.47 2004. 5.89 16.9 23.4
The mutate_if()
function could be used if these values wanted to be saved to the data. Additional level of detail could be added by using summarise_if()
and group_by()
in tandem.
mpg %>%
group_by(manufacturer) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
## # A tibble: 15 × 6
## manufacturer displ year cyl cty hwy
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 audi 2.54 2004. 5.22 17.6 26.4
## 2 chevrolet 5.06 2005. 7.26 15 21.9
## 3 dodge 4.38 2004. 7.08 13.1 17.9
## 4 ford 4.54 2003. 7.2 14 19.4
## 5 honda 1.71 2003 4 24.4 32.6
## 6 hyundai 2.43 2004. 4.86 18.6 26.9
## 7 jeep 4.58 2006. 7.25 13.5 17.6
## 8 land rover 4.3 2004. 8 11.5 16.5
## 9 lincoln 5.4 2002 8 11.3 17
## 10 mercury 4.4 2004. 7 13.2 18
## 11 nissan 3.27 2004. 5.54 18.1 24.6
## 12 pontiac 3.96 2003. 6.4 17 26.4
## 13 subaru 2.46 2004. 4 19.3 25.6
## 14 toyota 2.95 2003. 5.12 18.5 24.9
## 15 volkswagen 2.26 2003. 4.59 20.9 29.2
These can be great, simple ways to explore the data at a first inspection.
Function basics
Every function has three basic components, a function name, a function definition (i.e. arguments), and the function specifics/computations or function body. The following code chunk attempts to show these basics.
function_name <- function(argument1, argument2) {
function_body
}
The function()
function is used to define a new function. Inside the parentheses of this are where the function arguments are defined. The arguments can contain default values or can be unspecified, more details on this later. The function body is where the details of what the function is going to do is contained. From the previous function we created, the computation to standardize a variable was perfomed within the function body. Finally, the function name is specified just like naming any other object.
Picking a Function Name
Functions names are an important step to think about for a bit at the beginning. A few general heuristics are useful to keep in mind when thinking about a function name.
- Do not use a name from a function already defined within R. For example, avoid things like:
mean
,c
,t.test
, etc. - Create a semi-descriptive name that isn’t too detailed. For example, function names like
f
,s
,tr
, etc are not descriptive enough to be useful. Also, better to avoid particularly long names likestandardize_variable
. - If you define multiple functions, be consistent in their naming convention. If you use underscores, use them for all of your function names. If you like dots or titleCase be consistent.
Function Arguments
Function arguments need to be named and I would recommend using similar heuristics to picking a function name. The only element that can be ignored is the first, you are able to reuse function arguments across functions. This would actually be encouraged across functions that are similar to ensure that the function arguments work similarly across the function.
Arguments with Default Values
Arguments in R can either have default values or can be left without default values. The big distinction to keep in mind when thinking about this is whether it is an argument that users should explicitly specify everytime (i.e. major user argument) or is an argument that users may only change on occasion and there is an obvious default that makes sense. For the former situation, I would only define the argument name with no default compared to the latter I would typically define a default value.
Consider the standardize()
function we defined previously.
standardize <- function(x) {
mean_var <- mean(x)
sd_var <- sd(x)
(x - mean_var) / sd_var
}
Notice that the function argument, x
here is just named it does not have a default value. This was the case where this argument was specified to be the variable to be standardized, therefore having a default value did not make sense. The first function arguments are commonly do not have default values.
In a “Your Turn” section earlier, it was asked if this function would work with missing data. The short answer, is it would not, because the mean and sd would not be able to be calculated with missing values. For example, using the nycflights13
package used earlier, we could show that when missing data are present, the function breaks (i.e. produces NA values).
library(nycflights13)
flights %>%
mutate(stand_arr_delay = standardize(arr_delay)) %>%
select(arr_delay, stand_arr_delay)
## # A tibble: 336,776 × 2
## arr_delay stand_arr_delay
## <dbl> <dbl>
## 1 11 NA
## 2 20 NA
## 3 33 NA
## 4 -18 NA
## 5 -25 NA
## 6 12 NA
## 7 19 NA
## 8 -14 NA
## 9 -8 NA
## 10 8 NA
## # … with 336,766 more rows
We could adapt this to include another function argument, that deals with how missing data should be handled.
standardize <- function(x, na.rm = TRUE) {
mean_var <- mean(x, na.rm = na.rm)
sd_var <- sd(x, na.rm = na.rm)
(x - mean_var) / sd_var
}
flights %>%
mutate(stand_arr_delay = standardize(arr_delay)) %>%
select(arr_delay, stand_arr_delay)
## # A tibble: 336,776 × 2
## arr_delay stand_arr_delay
## <dbl> <dbl>
## 1 11 0.0920
## 2 20 0.294
## 3 33 0.585
## 4 -18 -0.558
## 5 -25 -0.715
## 6 12 0.114
## 7 19 0.271
## 8 -14 -0.468
## 9 -8 -0.334
## 10 8 0.0247
## # … with 336,766 more rows
This function can now handle missing values through the argument, na.rm
, which defaults to the value of TRUE
and since the default value is TRUE
, the missing data are automatically removed when computing the mean and sd. We could change the default value explicitly when we call the function.
flights %>%
mutate(stand_arr_delay = standardize(arr_delay, na.rm = FALSE)) %>%
select(arr_delay, stand_arr_delay)
## # A tibble: 336,776 × 2
## arr_delay stand_arr_delay
## <dbl> <dbl>
## 1 11 NA
## 2 20 NA
## 3 33 NA
## 4 -18 NA
## 5 -25 NA
## 6 12 NA
## 7 19 NA
## 8 -14 NA
## 9 -8 NA
## 10 8 NA
## # … with 336,766 more rows
Note here I used the same argument name, na.rm
, as is found within the mean and sd functions by default. This was to ensure that they are familiar to many users already comfortable with these function names.
Function arguments can also be specified as text strings, that will be discussed in more detail in the next section.
Conditional Functions
Conditional branching in functions are common, particularly when the functions attempt to do a few different things based on what the input values are. The general format for creating functions that use conditional logic are as follows:
if(some_condition) {
# do something
} else {
# do something else
}
This is an example of a branch with two different computations. For example, we could create a function that first calculates the skewness of a variable, if it is skewed returns the median otherwise computes the mean.
library(e1071)
mean_median <- function(variable, na.rm = TRUE) {
var_skew <- skewness(variable, na.rm = na.rm)
if(abs(var_skew) > 2) {
descriptive <- median(variable, na.rm = na.rm)
} else {
descriptive <- mean(variable, na.rm = na.rm)
}
data.frame(skewness = var_skew,
descriptive = descriptive)
}
We could then use this function to calculate the statistic of interest for a few different variables.
mean_median(flights$arr_delay)
## skewness descriptive
## 1 3.716783 -5
mean_median(gss_cat$tvhours)
## skewness descriptive
## 1 2.915875 2
mean_median(diamonds$price)
## skewness descriptive
## 1 1.618305 3932.8
Note here, that you need to name the variable explicitly, you can’t just do the following:
mean_median(arr_delay)
This is happening because the function is unable to find the variable, arr_delay
as it does not know to look in the flights data. To rectify this you need to use the $
notation shown above or other ways of telling the function which data to use to find the correct variable (e.g., with()
would also be an option).
Conditionals based on function arguments
You can also control flow of a function based on the arguments. For example, we could rewrite the mean_median()
function to include another argument that lets users specify whether they want the mean or median computed rather than basing this decision on the skewness. This may look as follows.
mean_median_arg <- function(variable, statistic = 'mean', na.rm = TRUE,
skewness = FALSE) {
if(statistic == 'mean') {
descriptive <- mean(variable, na.rm = na.rm)
} else {
descriptive <- median(variable, na.rm = na.rm)
}
if(skewness) {
skewness_stat <- skewness(variable, na.rm = na.rm)
} else {
skewness_stat <- NA
}
data.frame(skewness_stat, descriptive)
}
This function can now be tested using the variables we defined before:
mean_median_arg(flights$arr_delay)
## skewness_stat descriptive
## 1 NA 6.895377
Notice by default, the skewness is not calculated as it is set to FALSE
by default. The skewness could be returned by setting skewness = TRUE
. The mean is also calculated by default, due to the default value of setting statistic = 'mean'
. If we wanted the median instead, we can add this argument.
mean_median_arg(flights$arr_delay, statistic = 'median', skewness = TRUE)
## skewness_stat descriptive
## 1 3.716783 -5
Note, the way the function was written, specifying anything other than statistic = 'mean'
would compute the median.
mean_median_arg(flights$arr_delay, statistic = 'anything', skewness = TRUE)
## skewness_stat descriptive
## 1 3.716783 -5
This behavior is not ideal, but may be okay for simple functions. If it would be a problem, some additional structure inside the function to test for the specific arguments that are possible could be defined and an error could be returned if the arguments specified are not appropriate.
Larger conditional branching
Conditinoal branching can take on more than two levels as well. The general structure using nested if statements would look like the following.
if(some_condition) {
# do something computationally
} else if (another_condition) {
# do another computation
} else if (yet_another_condition) {
# do yet another computation
} else {
# final computation
}
In the example above, there are four different paths that the function can take depending on the function input. Commonly, this behavior would use a function argument defined as a character string to navigate through the different branches.
For example, explore the following function.
desc_stats <- function(variable,
statistic = c('sd', 'mean', 'median', 'quantile')) {
if(statistic == 'sd') {
sd(variable, na.rm = TRUE)
} else if(statistic == 'mean') {
mean(variable, na.rm = TRUE)
} else if(statistic == 'median') {
median(variable, na.rm = TRUE)
} else {
quantile(variable, na.rm = TRUE)
}
}
We can then see what is returned for this function with different arguments.
desc_stats(flights$arr_delay, statistic = 'sd')
## [1] 44.63329
desc_stats(flights$arr_delay, statistic = 'mean')
## [1] 6.895377
desc_stats(flights$arr_delay, statistic = 'median')
## [1] -5
desc_stats(flights$arr_delay, statistic = 'quantile')
## 0% 25% 50% 75% 100%
## -86 -17 -5 14 1272
Note, when defined this way, the statistic argument must always be specified, but when including the argument values in the function definition, the function can be more specific.
Using switch()
function
This behavior can make functions difficult to read and debug, therefore using the switch()
function can be a nice alternative if there are many different values for a function and each just calls a single function. We can redefine the previous function using switch()
. For the switch()
function, the argument name is passed as the first argument and subsequent arguments to switch()
are function calls based on the text of the arguments (shown before the equal sign) followed by the function or computation to perform.
desc_stats_sw <- function(variable,
statistic = c('sd', 'mean', 'median', 'quantile')) {
switch(statistic,
sd = sd(variable, na.rm = TRUE),
mean = mean(variable, na.rm = TRUE),
median = median(variable, na.rm = TRUE),
quantile = quantile(variable, na.rm = TRUE),
stop("Unknown statistics argument"))
}
Then to test the function.
desc_stats_sw(flights$arr_delay, statistic = 'sd')
## [1] 44.63329
desc_stats_sw(flights$arr_delay, statistic = 'mean')
## [1] 6.895377
desc_stats_sw(flights$arr_delay, statistic = 'median')
## [1] -5
desc_stats_sw(flights$arr_delay, statistic = 'quantile')
## 0% 25% 50% 75% 100%
## -86 -17 -5 14 1272
desc_stats_sw(flights$arr_delay, statistic = 'variance')
## Error in desc_stats_sw(flights$arr_delay, statistic = "variance"): Unknown statistics argument
Notice that now I also flagged the function to throw an error if an argument name does not match the statistic
argument definition.
Thinking about Returned Values
I only want to touch briefly on return values. These are commonly found at the end of the function and I usually just print them out to the console. Here I show the first function, standardize()
for reference. Notice that the last line of the function is the resulting statistic being printed. That is, the (x - mean_var) / sd_var
was not saved to an object therefore would print directly to the console. This is the return value for the function.
standardize <- function(x) {
mean_var <- mean(x)
sd_var <- sd(x)
(x - mean_var) / sd_var
}
The return()
function
The return()
function is useful to be explicit about the value that is returned from the function. I use it when I have more than one returned value that occur at different points in the function definition. If the value I return is the last line of code, I don’t explicitly use return, however if there is another value earlier in the function that gets returned, I’m more likely to use return()
.
Here is an example of what explicit return functions would look like using a function we have already used and defined.
desc_stats <- function(variable,
statistic = c('sd', 'mean', 'median', 'quantile')) {
if(statistic == 'sd') {
return(sd(variable, na.rm = TRUE))
} else if(statistic == 'mean') {
return(mean(variable, na.rm = TRUE))
} else if(statistic == 'median') {
return(median(variable, na.rm = TRUE))
} else {
return(quantile(variable, na.rm = TRUE))
}
}
The function would behave the same as before, just now the values that could be returned are wrapped in the return()
function. I leave it to you to test the function to ensure that the values are indeed the same.