Assignment 5

User Created Functions and Bootstrap: 10 pts

Due: May 8th, 2022 - No penalty for late submissions, but due no later than May 8th.


For this assignment, you will write your own user functions to perform a bootstrap. The assignment will guide you through elements to include and functions to look into to create the function for the bootstrap. Note, we have not specifically covered the bootstrap in this course, therefore, if elements are not clear, please reach out for assistance.

Please turn in the source file (the .Rmd) file as well as the compiled version (html). Create a new .Rmd file for this assignment. If a question asks for code or there is particular elements of code you’d like to share, you can definitely include those in the final compiled version, but please limit those sections to specific elements rather than long chunks of code. Long chunks of code are best left for the Rmd document. Assignments will be uploaded to ICON.

Any graphics you create should be of high quality, this includes formatting of axes, axes labels, etc. If none of the graphics are of high quality, a 2 pt penalty will apply.

Bootstrap Procedure

  1. First, we treat the sample data as fixed and as the population of interest. For this reason, we hope the original sample data collected is representative of the population of interest.
  2. Generate a random sample of the same size as the original sample, with replacement, from the sample data.
  3. Compute a statistic of interest. This could be a statistic (i.e., mean, median) or fit a statistical model.
  4. Repeat steps 2 and 3 many times.
  5. Visualize or compute descriptive statistics about the multiple statistics of interest computed from step 3.

Code to run prior to starting the assignment.

To keep track of the bootstrap procedure, let’s add some IDs to the individual penguins from the penguins data.

library(palmerpenguins)
library(tidyverse)

penguins <- penguins %>%
  mutate(penguin_id = 1:n())

Questions

  1. Create a function that generates a random sample, with replacement, from the original data, penguins from the palmerpenguins package. More specifically, create a function that accomplishes step 2 from the bootstrap procedure discussion above. Exploring the sample_n() function may be helpful for this question. For this question, provide the code for the bootstrap function you created and also show the first few rows of the bootstrapped data when you run the function once. The head() function may be helpful to show just the first few rows. 1 pt

  2. From a single bootstrap (i.e., using your function from #1), verify that the sampling with replacement is occurring. Use the penguin_id attribute created from the code used prior to starting the assignment to show that sampling with replacement is working properly. What would you expect if sampling with replacement is working? 1 pt

  3. Given the function created in #1, let’s add a function argument to the function. Add a function argument that allows users to control whether sampling with replacement or without replacement is performed. For this question, the function code is sufficient. 1 pt

  4. Verify that the function argument created in #3 is working properly by running the function specifying the argument to do the sampling with and without replacement. Use the penguin_id attribute created from the code used prior to starting the assignment to show that the function argument is working properly. What difference in the output helps to show that the function argument is working properly? 1 pt

  5. Add an argument that follows branching logic to implement different statistics to be computed to fulfill step #3 of the bootstrap procedures above. You can use any type of statistics you wish in this step, but to simplify this step, compute the statistics for a single continuous attribute from the penguins data (you can chose the attribute you are most interested in). This would make use of the if() and else() functions discussed in the user written functions portion of the course. The code is sufficient for this question. Note, make sure your function returns the statistic of interest. 1 pt

  6. Verify that the branching logic and different statistics are working properly from step #5. Run the function for each branch of the function to confirm that different statistics are being computed. 1 pt

  7. Replicate the bootstrap procedure many times using the following general structure:

map_dbl(1:num_reps, function_name, your_function_arguments)

where num_reps are the number of bootstrap replications, function_name is the name of your function you created as part of this assignment, and your_function_arguments are the function arguments you may need to pass. You may not need to pass any function arguments to the function above. Note, if your function returns a data frame, you may need to replace the map_dbl() function above with map_dfr().

For this question, pick two different number of bootstrap replications and save those two objects. Just the code is sufficient for this question. 1 pt

  1. Visualize the bootstrap results for both number of bootstrap replications chosen in #7 using a histogram, density curve, violin plot, boxplot, or some other similar visualization to explore the distribution of the bootstrapped statistic. Are there any differences in the two distributions based on the different number of replications chosen? Why may these differences be showing up? Which number of replications would be more defensible in your opinion? 3 pts
Previous