Assignment 3

Exploratory Data Analysis and Data Restructuring: 10 pts

Due: Around March 27th, 2021 - No penalty for late submissions, but due no later than May 8th.


In this assignment, you will perform exploratory data analysis using the gss_cat data from the forcats package. The source file (the .Rmd) file will be turned in as well as the compiled version (html). Create a new .Rmd file for this assignment. You also do not need to explicitly answer the research question, rather this is used to guide the first part of the assignment.

You also do not need to include the code for most questions in the compiled (html) version since the Rmd will be turned in as well. If a question asks for code or there is particular elements of code you’d like to share, you can definitely include those in the final compiled version, but please limit those sections to specific elements rather than long chunks of code. Long chunks of code are best left for the Rmd document.

The source file (the .Rmd) file will be turned in as well as the compiled version (html). Note, please create a new Rmd document for this assignment rather than continue the one from the first assignment. Submit completed assignment, including Rmd and html to ICON.

All graphics should be of high quality, this includes formatting of axes, axes labels, etc. If none of the graphics are of high quality, a 2 pt penalty will apply over and above any item-specific reductions.

Research Questions

The following research questions will be used to guide the assignment, but you do not need to answer these directly. The questions below will reference these questions.

  1. How is age related to the amount of television a person reports watching?

Questions

  1. Does there appear to be patterns in the missing data from the variable tvhours by different income and age levels? Provide evidence for your reasoning. 1 pt

  2. Provide a descriptive analysis exploring the research question above. Does age appear to be related to the amount of time a person spends watching television? Ensure in your discussion you provide justification for why you feel one way or the other and be as descriptive as possible. 1 pt


Data Import, Restructuring, Joining

In this part of the assignment, you will import data and perform data manipulations on this data file.

Questions

  1. Read in the “ECLS_6250.csv” data file from the course website.

    • Using the head() function, print the first few rows of the data and using the dim() function, print out the dimensions of the data. 1 pt
  2. Verify that the data that are indeed missing are read in as missing values. Use the “ECLS_6250.pdf” codebook (found on the course website) to confirm values of missing data for each variable, these are listed for each variable directly in the codebook.

    • Compute the mean of the following two variables: “C4R4MSCL” and “W3SESL”. 1 pt
  3. Using the tidyr package, convert this data into an extra long format where the variables CHILDID, KURBAN_R, GENDER, and RACE are the id variables. The other data attributes would all represent data values to be restructured.

    • Using the head() function, print the first few rows of the data and using the dim() function, print out the dimensions of the data. Note, you should have 6 columns when you are done with this step. 1 pt
  4. Using the restructured data from #3 above, create three new variables that represent the type of variable (S = School, C = Child, W = family), the wave number, and remaining information. The type of variable and the wave number are the first and second characters of the variable names that were restructured (i.e. stacked) in #3. Hint, using the separate() function would be useful for this step and exploring some random rows of the data may be helpful using head() or tail() or sample_n().

    • Using the head() function, print the first few rows of the data and using the dim() function, print out the dimensions of the data. Note, you should have 8 columns when you are done with this step. 1 pt
  5. Using the restructured data from #4, combine the variable type (i.e., the attribute that is either S, C, or W) and left-over names. This new variable should be similar to the original variable names from when reading in the data from #1, but should not include the wave information in it. Hint, the unite() function should be helpful for this step.

    • Using the head() function, print the first few rows of the data and using the dim() function, print out the dimensions of the data. Note, you should have 7 columns when you are done with this step. 1 pt
  6. Finally, using the restructured data from #5, widen the data set to create a variable for each unique value from the newly created variable from #5.

    • Using the head() function, print the first few rows of the data and using the dim() function, print out the dimensions of the data. Note, you should have 9 columns and fewer rows than #5 when you are done with this step. 1 pt
  7. Read in the “ECLS_6250_school.csv” data file found on the course website. Using the restructured data created in #6, merge the school data imported in question #7 into the child level data. Use the type of join where the number of rows for the child level data are not changed. More specifically, the final data should have the same number of rows as #6, but will add 5 new columns.

    • Using the head() function, print the first few rows of the merged data and using the dim() function, print out the dimensions of the merged data. 1 pt
  8. This question has a number of steps which are highlighted below in more detail.

    1. Identify the 25 schools at wave 1 that have the highest proportion of female students and create a new data file that has has the school ID and proportion of female students in the school. Note: Use the codebook to identify which code represents males and females. It may also be helpful, although not necessary, to create a new variable for gender.
    2. Perform a filtering join that returns only the rows from the child level data (i.e., the final data from #7) at wave 1 that belong to the 25 schools that had the highest proportion of females at wave 1 (from step one of this question).
    • Using the head() function, print the first few rows of the data and using the dim() function, print out the dimensions of the data. 1 pt
Previous
Next