Data Project One

100 points

Due Sept. 18 at midnight

Below are the questions for the first data practical assignment. This project uses the “FossilAnts.csv” file, located in the data directory for the project. The point value of each question is denoted next to it. A blank cell is below each for your answer; feel free to create more blank cells as needed.

  1. (5 pts) Create a directory called projects_lastname (sub your last name for lastname), and in it, a subdirectory called project_one_lastname. Use download.files to download the instructions: https://raw.githubusercontent.com/BiologicalDataAnalysis2019/2020/master/vignettes/ProjectOne.Rmd

and the data

https://raw.githubusercontent.com/BiologicalDataAnalysis2019/2020/master/projects/project_one/data/FossilAnts.csv

  1. 5 pts. Import the tidyverse package and load the data using read_csv. The data for this part of the practical is located in the data directory. Save the data in a variable called project_dat. Print the data to the screen to ensure it loaded correctly.
#Enter Your Answer Here
  1. (5pts) Check the datatypes of each column. There is a column called reference number. This is a static identifier - it should not be changed, and is an indexer used to identify this specimen uniquely. Do we want to treat it as an integer (this is an opinion question - answers may vary).
# Answer here
  1. (5 pts) Change the datatype of the reference number column to character. Hint: use mutate. Look at the as.character function.
#Answer Here
  1. (5 pts) Look at your data. What are the missing data values? In particular, have a look at the Tribe column. In your opinion, are these intelligent missing values for the dataset? Why or why not? If not, how would you like to change them?
#Answer here
  1. (5pts) Now do it - replace your missing values with these more logical missing values. Look at the function na_if, which replaces nonstandard NA values. Please first look at the help page for na_if.
#Answer here
  1. (5 points) Are there any columns in your dataset that contain two pieces of data? If so, identify them.
# Answer here
  1. (5pts) Please look at the help page for the separate function.
# Show how you would pull up the help
  1. (10 pts) How could you separate one column into two?
#Answer here

Next, we will test a hypothesis. Your hypothesis is that there are more specimens in the 75 million years ago (mya) - 100 mya interval than the 30 mya to the present interval.

  1. (5 pts) Write out the steps you would take to address this question. Will you need to split up the data? Will you need to group the data based on the values in some column?
#Answer Here
  1. (15 pts) Perform the operations you described in (8).
#Answer here
  1. (5 pts) Do the results of your code support the hypothesis?
#Answer here
  1. (10 pts) Save the dataframe with the split columns into a new directory called project_one_data_output_lastname. Save it as a csv file called “column_separated.csv”
# Answer here
  1. (10 pts) What is the most numerous subfamily? Print the subfamilies in descnding order.
#Answer here

Grad Students

Do the undergrad part of the exam. It’s actually kind of hard? Find your name below for the additional part of the exam to complete.

Tyler

  1. Replace your ‘?’ with NAs.

  2. Convert your date column to a date object with lubridate. Store as a year column, month column, day column.

  3. Save in your project one folder as “tyler_modified_data.csv”

Ariel

  1. If you were to group by Oiling Category, and calculate, say, the mean above ground biomass, some of your categories are going to be near zero because of all the zeroes. Replace your 0s with NAs. (Note - I’m aware I might have just described the correct outcome of your analysis. Humor me anyway.)

  2. Convert your date column to a date object with lubridate. Store as a year column, month column, day column.

  3. Save in your project one folder as “ariel_modified_data.csv”

Tess & Nick

  1. Which site has the highest mean diameter for the Acer rubrum in 2016? Does the same site have the largest diameter all years?

  2. Which tree has the largest average diameter?

Victoria

I’m going to have you do something a little different. You’re going to reshape your dataset. Ultimately, you will have four columns: Treatment, Specimen, Date Entered, Days until death. The final columns should be how many days the animal lived before death. If it was entered 3/13/20 and made it to 3/15/20, put ‘2’.

Export as a CSV and put on your RStudio server. Which treatment had the highest average days until death?

Josh

  1. I’m looking at the SirenSpecimenRecords spreadsheet. Make sure when you read it in, the blanks read in as NA. If not, replace them.

  2. Split your species column into genus and species columns, but retain what you’re currently calling species as “Specific Epithet”.

  3. Save in your project one folder as “josh_modified_data.csv”

Elisabeth

These spreadsheets are great. Very consistent.

  1. What is the most commonly-collected species on here. I’m not sure if any of the species names are shared across genera (ie Genus_species, Genus1_species). If so, you might consider using unite() to put the genus and species columns into one cell before counting.

  2. Which site has the most observations?

Brittany and Claire

  1. There is a package loaded we haven’t used yet, readexl
library(readxl)

See if you can read in the measurements spreadsheet in your Excel worksheet. Pick two or three columns in the ratios spreadsheet and see if you can reproduce them in R here:

  1. Save the measurements spreadsheet as a csv file.