ProjectOne • compbio2020

Data Project One

100 points

DUE Sept 15 at 11:59 PM

Below are the questions for the first data practical assignment. This project uses the “FossilAnts.csv” file, located in the data directory for the project. The point value of each question is denoted next to it. A blank cell is below each for your answer; feel free to create more blank cells as needed.

(5 pts) Create a directory called projects, and in it, a subdirectory called project_one. Use download.files to get this file, and this one and save them to the project_one directory.
5 pts. Import the tidyverse package and load the data. The data for this part of the practical is located in the data directory. Save the data in a variable called project_dat. Print the data to the screen to ensure it loaded correctly.

#Enter Your Answer Here

(5pts) Check the datatypes of each column. There is a column called reference number. This is a static identifier - it should not be changed, and is an indexer used to identify this specimen uniquely. Do we want to treat it as an integer (this is an opinion question - answers may vary).

# Answer here

(5 pts) Change the datatype of the reference number column to character. Take a peek at the function as.character().

#Answer Here

(5 pts) Look at your data. What are the missing data values? In particular, have a look at the Tribe column. In your opinion, are these intelligent missing values for the dataset? Why or why not? If not, how would you like to change them?

#Answer here

(5 points) In the surveys dataset, we have genus and species split between two columns. Here, we have them combined. What are the pros and cons of the way we have recorded taxa in these two data sets.

# Answer here

(5pts) Please look at the help page for the separate function.

# Show how you would pull up the help

(10 pts) How could you separate one column into two?

#Answer here

Next, we will test a hypothesis. Your hypothesis is that there are more specimens in the 75 million years ago (mya) - 100 mya interval than the 30 mya to the present interval.

(5 pts) Write out the steps you would take to address this question. Will you need to split up the data? Will you need to group the data based on the values in some column?

#Answer Here

(15 pts) Perform the operations you described in (8).

#Answer here

(5 pts) Do the results of your code support the hypothesis?

#Answer here

(10 pts) Save the dataframe with the split taxon columns into a new directory called project_one_data_output. Save it as a csv file called “column_separated.csv”

# Answer here

(10 pts) Produce a histogram of the number of specimens by minimum age. Look at the geom_histogram() help.

#Answer here

(5 pts) Change the font size on your histogram so that it can be read comfortably on your computer screen from five feet away. Save the file as “large_font.pdf”.
Finally, produce a histogram of counts for each subfamily.

Graduate Students

You will do the above steps with the classroom dataset.
Next, you will choose three data steps from above to perform with your own data. This could be grouping, mutating, cleaning NA values, visualization, or any combination of the three.
Why did you choose these steps? What function does this serve for your thesis and/or other research projects?