User Created Functions

When to create a function

Functions can be particularly useful when you are duplicating significant portions of your code. For example, perhaps you want to standardize various quantitative variables by subtracting each variable by the mean and dividing by the standard deviation. Below is the mathematics behind this operation.

\[ standardizedvar = \frac{var - mean(var)}{sd(var)} \]

Let’s use the midwest data that comes with the tidyverse to explore what code like this would look like.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
midwest
## # A tibble: 437 × 28
##      PID county  state  area poptotal popdensity popwhite popblack popamerindian
##    <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
##  1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
##  2   562 ALEXAN… IL    0.014    10626       759      7054     3496            19
##  3   563 BOND    IL    0.022    14991       681.    14477      429            35
##  4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
##  5   565 BROWN   IL    0.018     5836       324.     5264      547            14
##  6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
##  7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
##  8   568 CARROLL IL    0.027    16805       622.    16519      111            30
##  9   569 CASS    IL    0.024    13437       560.    13384       16             8
## 10   570 CHAMPA… IL    0.058   173025      2983.   146506    16559           331
## # … with 427 more rows, and 19 more variables: popasian <int>, popother <int>,
## #   percwhite <dbl>, percblack <dbl>, percamerindan <dbl>, percasian <dbl>,
## #   percother <dbl>, popadults <int>, perchsd <dbl>, percollege <dbl>,
## #   percprof <dbl>, poppovertyknown <int>, percpovertyknown <dbl>,
## #   percbelowpoverty <dbl>, percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>

Suppose we wanted to standardize using the above equation, for the variables, poptotal, popdensity, popadults, and perccollege. Let’s do the computation for the variable poptotal first.

midwest %>%
  mutate(stand_poptotal = (poptotal - mean(poptotal)) / sd(poptotal)) %>%
  select(poptotal, stand_poptotal)
## # A tibble: 437 × 2
##    poptotal stand_poptotal
##       <int>          <dbl>
##  1    66090         -0.101
##  2    10626         -0.287
##  3    14991         -0.272
##  4    30806         -0.219
##  5     5836         -0.303
##  6    35688         -0.203
##  7     5322         -0.305
##  8    16805         -0.266
##  9    13437         -0.277
## 10   173025          0.258
## # … with 427 more rows

How would we move alter the code to do the computation for subsequent columns? You could rewrite the code, but I would likely copy and paste the code and adapt the pieces to the new column. Here I did it for the additional three variables.

midwest %>%
  mutate(stand_poptotal = (poptotal - mean(poptotal)) / sd(poptotal),
         stand_popdensity = (popdensity - mean(popdensity)) / sd(popdensity),
         stand_popadults = (popadults - mean(popadults)) / sd(popadults),
         stand_percollege = (percollege - mean(poptotal)) / sd(percollege)) %>%
  select(poptotal, popdensity, popadults, percollege, 
         stand_poptotal, stand_popdensity, stand_popadults, stand_percollege)
## # A tibble: 437 × 8
##    poptotal popdensity popadults percollege stand_poptotal stand_popdensity
##       <int>      <dbl>     <int>      <dbl>          <dbl>            <dbl>
##  1    66090      1271.     43298       19.6         -0.101          -0.238 
##  2    10626       759       6724       11.2         -0.287          -0.305 
##  3    14991       681.      9669       17.0         -0.272          -0.315 
##  4    30806      1812.     19272       17.3         -0.219          -0.168 
##  5     5836       324.      3979       14.5         -0.303          -0.362 
##  6    35688       714.     23444       18.9         -0.203          -0.311 
##  7     5322       313.      3583       11.9         -0.305          -0.363 
##  8    16805       622.     11323       16.2         -0.266          -0.323 
##  9    13437       560.      8825       14.1         -0.277          -0.331 
## 10   173025      2983.     95971       41.3          0.258          -0.0149
## # … with 427 more rows, and 2 more variables: stand_popadults <dbl>,
## #   stand_percollege <dbl>

Explore the code, notice any problems? When I copied and pasted, I missed changing the mean calculation for the percollege variable. This issue can be particularly difficult to debug given that there are no errors in the code. We could save some code duplication by writing our own function. This can save on errors in copy and pasting or errors on data entry. Let’s write a function that implements this, focusing less on function specifics as that will come later.

standardize <- function(x) {
  mean_var <- mean(x)
  sd_var <- sd(x)
  
  (x - mean_var) / sd_var
}

We could then use this function, standardize() inside of the mutate() function.

midwest %>%
  mutate(stand_poptotal = standardize(poptotal),
         stand_popdensity = standardize(popdensity),
         stand_popadults = standardize(popadults),
         stand_percollege = standardize(percollege)) %>%
  select(poptotal, popdensity, popadults, percollege, 
         stand_poptotal, stand_popdensity, stand_popadults, stand_percollege)
## # A tibble: 437 × 8
##    poptotal popdensity popadults percollege stand_poptotal stand_popdensity
##       <int>      <dbl>     <int>      <dbl>          <dbl>            <dbl>
##  1    66090      1271.     43298       19.6         -0.101          -0.238 
##  2    10626       759       6724       11.2         -0.287          -0.305 
##  3    14991       681.      9669       17.0         -0.272          -0.315 
##  4    30806      1812.     19272       17.3         -0.219          -0.168 
##  5     5836       324.      3979       14.5         -0.303          -0.362 
##  6    35688       714.     23444       18.9         -0.203          -0.311 
##  7     5322       313.      3583       11.9         -0.305          -0.363 
##  8    16805       622.     11323       16.2         -0.266          -0.323 
##  9    13437       560.      8825       14.1         -0.277          -0.331 
## 10   173025      2983.     95971       41.3          0.258          -0.0149
## # … with 427 more rows, and 2 more variables: stand_popadults <dbl>,
## #   stand_percollege <dbl>

Notice now we remove some, but not all of the duplication. Next week we will explore iteration in more detail that can remove this level of duplication.

Your Turn

  1. There is an R function, scale(), that does this standardization. Use this to recreate the standardization above.
  2. Will the standardize() function above work when there are missing values? How could the function be adapted to work with missing values?

Conditional summarise() or mutate() statements

One thing I want to show briefly is if you are attempting to apply a particular function for data verification and there is a function already defined in R, summarise_if() or mutate_if() can be particularly useful. Take an example of wanting to calculate the mean of each numeric column, but the data has a mix of numeric and categorical variables. You could write multiple computations within mutate() or summarise(), but summarise_if() and mutate_if() can be particularly useful. Below is an example of this strategy with the mpg data from dplyr.

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

Let’s calculate the mean of each numeric column with summarise_if().

summarise_if(mpg, is.numeric, mean, na.rm = TRUE)
## # A tibble: 1 × 5
##   displ  year   cyl   cty   hwy
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  3.47 2004.  5.89  16.9  23.4

The mutate_if() function could be used if these values wanted to be saved to the data. Additional level of detail could be added by using summarise_if() and group_by() in tandem.

mpg %>%
  group_by(manufacturer) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE)
## # A tibble: 15 × 6
##    manufacturer displ  year   cyl   cty   hwy
##    <chr>        <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 audi          2.54 2004.  5.22  17.6  26.4
##  2 chevrolet     5.06 2005.  7.26  15    21.9
##  3 dodge         4.38 2004.  7.08  13.1  17.9
##  4 ford          4.54 2003.  7.2   14    19.4
##  5 honda         1.71 2003   4     24.4  32.6
##  6 hyundai       2.43 2004.  4.86  18.6  26.9
##  7 jeep          4.58 2006.  7.25  13.5  17.6
##  8 land rover    4.3  2004.  8     11.5  16.5
##  9 lincoln       5.4  2002   8     11.3  17  
## 10 mercury       4.4  2004.  7     13.2  18  
## 11 nissan        3.27 2004.  5.54  18.1  24.6
## 12 pontiac       3.96 2003.  6.4   17    26.4
## 13 subaru        2.46 2004.  4     19.3  25.6
## 14 toyota        2.95 2003.  5.12  18.5  24.9
## 15 volkswagen    2.26 2003.  4.59  20.9  29.2

These can be great, simple ways to explore the data at a first inspection.

Function basics

Every function has three basic components, a function name, a function definition (i.e. arguments), and the function specifics/computations or function body. The following code chunk attempts to show these basics.

function_name <- function(argument1, argument2) {
  
  function_body
  
}

The function() function is used to define a new function. Inside the parentheses of this are where the function arguments are defined. The arguments can contain default values or can be unspecified, more details on this later. The function body is where the details of what the function is going to do is contained. From the previous function we created, the computation to standardize a variable was perfomed within the function body. Finally, the function name is specified just like naming any other object.

Picking a Function Name

Functions names are an important step to think about for a bit at the beginning. A few general heuristics are useful to keep in mind when thinking about a function name.

  1. Do not use a name from a function already defined within R. For example, avoid things like: mean, c, t.test, etc.
  2. Create a semi-descriptive name that isn’t too detailed. For example, function names like f, s, tr, etc are not descriptive enough to be useful. Also, better to avoid particularly long names like standardize_variable.
  3. If you define multiple functions, be consistent in their naming convention. If you use underscores, use them for all of your function names. If you like dots or titleCase be consistent.

Function Arguments

Function arguments need to be named and I would recommend using similar heuristics to picking a function name. The only element that can be ignored is the first, you are able to reuse function arguments across functions. This would actually be encouraged across functions that are similar to ensure that the function arguments work similarly across the function.

Arguments with Default Values

Arguments in R can either have default values or can be left without default values. The big distinction to keep in mind when thinking about this is whether it is an argument that users should explicitly specify everytime (i.e. major user argument) or is an argument that users may only change on occasion and there is an obvious default that makes sense. For the former situation, I would only define the argument name with no default compared to the latter I would typically define a default value.

Consider the standardize() function we defined previously.

standardize <- function(x) {
  mean_var <- mean(x)
  sd_var <- sd(x)
  
  (x - mean_var) / sd_var
}

Notice that the function argument, x here is just named it does not have a default value. This was the case where this argument was specified to be the variable to be standardized, therefore having a default value did not make sense. The first function arguments are commonly do not have default values.

In a “Your Turn” section earlier, it was asked if this function would work with missing data. The short answer, is it would not, because the mean and sd would not be able to be calculated with missing values. For example, using the nycflights13 package used earlier, we could show that when missing data are present, the function breaks (i.e. produces NA values).

library(nycflights13)

flights %>%
  mutate(stand_arr_delay = standardize(arr_delay)) %>%
  select(arr_delay, stand_arr_delay)
## # A tibble: 336,776 × 2
##    arr_delay stand_arr_delay
##        <dbl>           <dbl>
##  1        11              NA
##  2        20              NA
##  3        33              NA
##  4       -18              NA
##  5       -25              NA
##  6        12              NA
##  7        19              NA
##  8       -14              NA
##  9        -8              NA
## 10         8              NA
## # … with 336,766 more rows

We could adapt this to include another function argument, that deals with how missing data should be handled.

standardize <- function(x, na.rm = TRUE) {
  mean_var <- mean(x, na.rm = na.rm)
  sd_var <- sd(x, na.rm = na.rm)
  
  (x - mean_var) / sd_var
}

flights %>%
  mutate(stand_arr_delay = standardize(arr_delay)) %>%
  select(arr_delay, stand_arr_delay)
## # A tibble: 336,776 × 2
##    arr_delay stand_arr_delay
##        <dbl>           <dbl>
##  1        11          0.0920
##  2        20          0.294 
##  3        33          0.585 
##  4       -18         -0.558 
##  5       -25         -0.715 
##  6        12          0.114 
##  7        19          0.271 
##  8       -14         -0.468 
##  9        -8         -0.334 
## 10         8          0.0247
## # … with 336,766 more rows

This function can now handle missing values through the argument, na.rm, which defaults to the value of TRUE and since the default value is TRUE, the missing data are automatically removed when computing the mean and sd. We could change the default value explicitly when we call the function.

flights %>%
  mutate(stand_arr_delay = standardize(arr_delay, na.rm = FALSE)) %>%
  select(arr_delay, stand_arr_delay)
## # A tibble: 336,776 × 2
##    arr_delay stand_arr_delay
##        <dbl>           <dbl>
##  1        11              NA
##  2        20              NA
##  3        33              NA
##  4       -18              NA
##  5       -25              NA
##  6        12              NA
##  7        19              NA
##  8       -14              NA
##  9        -8              NA
## 10         8              NA
## # … with 336,766 more rows

Note here I used the same argument name, na.rm, as is found within the mean and sd functions by default. This was to ensure that they are familiar to many users already comfortable with these function names.

Function arguments can also be specified as text strings, that will be discussed in more detail in the next section.

Conditional Functions

Conditional branching in functions are common, particularly when the functions attempt to do a few different things based on what the input values are. The general format for creating functions that use conditional logic are as follows:

if(some_condition) {
  # do something
} else {
  # do something else
}

This is an example of a branch with two different computations. For example, we could create a function that first calculates the skewness of a variable, if it is skewed returns the median otherwise computes the mean.

library(e1071)
mean_median <- function(variable, na.rm = TRUE) {
  var_skew <- skewness(variable, na.rm = na.rm)
  
  if(abs(var_skew) > 2) {
    descriptive <- median(variable, na.rm = na.rm)
  } else {
    descriptive <- mean(variable, na.rm = na.rm)
  }
  data.frame(skewness = var_skew, 
             descriptive = descriptive)
}

We could then use this function to calculate the statistic of interest for a few different variables.

mean_median(flights$arr_delay)
##   skewness descriptive
## 1 3.716783          -5
mean_median(gss_cat$tvhours)
##   skewness descriptive
## 1 2.915875           2
mean_median(diamonds$price)
##   skewness descriptive
## 1 1.618305      3932.8

Note here, that you need to name the variable explicitly, you can’t just do the following:

mean_median(arr_delay)

This is happening because the function is unable to find the variable, arr_delay as it does not know to look in the flights data. To rectify this you need to use the $ notation shown above or other ways of telling the function which data to use to find the correct variable (e.g., with() would also be an option).

Conditionals based on function arguments

You can also control flow of a function based on the arguments. For example, we could rewrite the mean_median() function to include another argument that lets users specify whether they want the mean or median computed rather than basing this decision on the skewness. This may look as follows.

mean_median_arg <- function(variable, statistic = 'mean', na.rm = TRUE,
                            skewness = FALSE) {
  
  if(statistic == 'mean') {
    descriptive <- mean(variable, na.rm = na.rm)
  } else {
    descriptive <- median(variable, na.rm = na.rm)
  }
  
  if(skewness) {
    skewness_stat <- skewness(variable, na.rm = na.rm)
  } else {
    skewness_stat <- NA
  }
  data.frame(skewness_stat, descriptive)
}

This function can now be tested using the variables we defined before:

mean_median_arg(flights$arr_delay)
##   skewness_stat descriptive
## 1            NA    6.895377

Notice by default, the skewness is not calculated as it is set to FALSE by default. The skewness could be returned by setting skewness = TRUE. The mean is also calculated by default, due to the default value of setting statistic = 'mean'. If we wanted the median instead, we can add this argument.

mean_median_arg(flights$arr_delay, statistic = 'median', skewness = TRUE)
##   skewness_stat descriptive
## 1      3.716783          -5

Note, the way the function was written, specifying anything other than statistic = 'mean' would compute the median.

mean_median_arg(flights$arr_delay, statistic = 'anything', skewness = TRUE)
##   skewness_stat descriptive
## 1      3.716783          -5

This behavior is not ideal, but may be okay for simple functions. If it would be a problem, some additional structure inside the function to test for the specific arguments that are possible could be defined and an error could be returned if the arguments specified are not appropriate.

Larger conditional branching

Conditinoal branching can take on more than two levels as well. The general structure using nested if statements would look like the following.

if(some_condition) {
  # do something computationally
} else if (another_condition) {
  # do another computation
} else if (yet_another_condition) {
  # do yet another computation
} else {
  # final computation
}

In the example above, there are four different paths that the function can take depending on the function input. Commonly, this behavior would use a function argument defined as a character string to navigate through the different branches.

For example, explore the following function.

desc_stats <- function(variable, 
                       statistic = c('sd', 'mean', 'median', 'quantile')) {
  
  if(statistic == 'sd') {
    sd(variable, na.rm = TRUE)
  } else if(statistic == 'mean') {
    mean(variable, na.rm = TRUE)
  } else if(statistic == 'median') {
    median(variable, na.rm = TRUE)
  } else {
    quantile(variable, na.rm = TRUE)
  }
}

We can then see what is returned for this function with different arguments.

desc_stats(flights$arr_delay, statistic = 'sd')
## [1] 44.63329
desc_stats(flights$arr_delay, statistic = 'mean')
## [1] 6.895377
desc_stats(flights$arr_delay, statistic = 'median')
## [1] -5
desc_stats(flights$arr_delay, statistic = 'quantile')
##   0%  25%  50%  75% 100% 
##  -86  -17   -5   14 1272

Note, when defined this way, the statistic argument must always be specified, but when including the argument values in the function definition, the function can be more specific.

Using switch() function

This behavior can make functions difficult to read and debug, therefore using the switch() function can be a nice alternative if there are many different values for a function and each just calls a single function. We can redefine the previous function using switch(). For the switch() function, the argument name is passed as the first argument and subsequent arguments to switch() are function calls based on the text of the arguments (shown before the equal sign) followed by the function or computation to perform.

desc_stats_sw <- function(variable, 
                          statistic = c('sd', 'mean', 'median', 'quantile')) {
  
  switch(statistic,
         sd = sd(variable, na.rm = TRUE),
         mean = mean(variable, na.rm = TRUE),
         median = median(variable, na.rm = TRUE),
         quantile = quantile(variable, na.rm = TRUE),
         stop("Unknown statistics argument"))
}

Then to test the function.

desc_stats_sw(flights$arr_delay, statistic = 'sd')
## [1] 44.63329
desc_stats_sw(flights$arr_delay, statistic = 'mean')
## [1] 6.895377
desc_stats_sw(flights$arr_delay, statistic = 'median')
## [1] -5
desc_stats_sw(flights$arr_delay, statistic = 'quantile')
##   0%  25%  50%  75% 100% 
##  -86  -17   -5   14 1272
desc_stats_sw(flights$arr_delay, statistic = 'variance')
## Error in desc_stats_sw(flights$arr_delay, statistic = "variance"): Unknown statistics argument

Notice that now I also flagged the function to throw an error if an argument name does not match the statistic argument definition.

Thinking about Returned Values

I only want to touch briefly on return values. These are commonly found at the end of the function and I usually just print them out to the console. Here I show the first function, standardize() for reference. Notice that the last line of the function is the resulting statistic being printed. That is, the (x - mean_var) / sd_var was not saved to an object therefore would print directly to the console. This is the return value for the function.

standardize <- function(x) {
  mean_var <- mean(x)
  sd_var <- sd(x)
  
  (x - mean_var) / sd_var
}

The return() function

The return() function is useful to be explicit about the value that is returned from the function. I use it when I have more than one returned value that occur at different points in the function definition. If the value I return is the last line of code, I don’t explicitly use return, however if there is another value earlier in the function that gets returned, I’m more likely to use return().

Here is an example of what explicit return functions would look like using a function we have already used and defined.

desc_stats <- function(variable, 
                       statistic = c('sd', 'mean', 'median', 'quantile')) {
  
  if(statistic == 'sd') {
    return(sd(variable, na.rm = TRUE))
  } else if(statistic == 'mean') {
    return(mean(variable, na.rm = TRUE))
  } else if(statistic == 'median') {
    return(median(variable, na.rm = TRUE))
  } else {
    return(quantile(variable, na.rm = TRUE))
  }
}

The function would behave the same as before, just now the values that could be returned are wrapped in the return() function. I leave it to you to test the function to ensure that the values are indeed the same.

Previous