10_GBIF_and_Location.Rmd
This week, we will be starting with geospatial data and mapping. Today, we will begin with locality data in GBIF. GBIF is a website for aggregating location data from other sources.
We will work with two packages today. One is RGBIF, an interface to GBIF data maintained by the ROpenSci group.
install.packages("rgbif")
install.packages("leaflet")
Later in this lesson, we will also use the rotl
package we’ve seen last week.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
First, we’re going to try an example for one ant. We will query locality data and map it in a few ways. Then, you will figure out how to make the process of mapping data iterable over a set of taxa.
name_suggest("Atta mexicana")
## Records returned [1]
## No. unique hierarchies [0]
## Args [q=Atta mexicana, limit=100, fields1=key, fields2=canonicalName,
## fields3=rank]
## # A tibble: 1 × 3
## key canonicalName rank
## <int> <chr> <chr>
## 1 5035745 Atta mexicana SPECIES
Next, we’re going to use occ_search
to search the GBIF database for where these things are occurring.
occurences <- occ_search(taxonKey = 5035745, limit = 20)
Next, we will filter the resultant dataset to name and lat and longitude data.
no_na <- occurences$data %>%
select(scientificName, decimalLatitude, decimalLongitude) %>%
drop_na()
And finally, we will plot the data using leaflet
k <- leaflet::leaflet(no_na) %>%
addTiles() %>%
addMarkers(~decimalLongitude, ~decimalLatitude, popup = no_na$scientificName)
This looks much like plotting with ggplot
and tidyverse. That’s because leaflet is based on the same principles. We establish a canvas and the data in the first line, then we add tiles (the actual map we will use), and then we add our points. When we think about it, this is quite similar to establishing a ggplot canvas, the axes, and the points to plot.
One point is easy. How often do we only want one point, though? Probably, we will want to plot several species. First, with a partner, work out how to get GBIF ids for a set of taxa.
ants <- c("Martialis", "Atta", "Ectatomma", "Tatuidris", "Aneuretus", "Xymmer", "Aenictus")
#Get GBIF ids for our vector of species.
ids <-c()
for (ant in ants){
print(ant)
raw <- name_suggest(ant)
print(raw)
ids[ant] <- c(raw$data$key)
}
## [1] "Martialis"
## Records returned [74]
## No. unique hierarchies [0]
## Args [q=Martialis, limit=100, fields1=key, fields2=canonicalName, fields3=rank]
## # A tibble: 74 × 3
## key canonicalName rank
## <int> <chr> <chr>
## 1 7744356 Martialis GENUS
## 2 9599316 ? martialis UNRANKED
## 3 7437085 Martialis heureka SPECIES
## 4 5977251 Erynnis martialis SPECIES
## 5 4544067 Aculus martialis SPECIES
## 6 5168773 Hahnia martialis SPECIES
## 7 5025232 Agathis martialis SPECIES
## 8 5510493 Globulina martialis SPECIES
## 9 2829565 Habenaria martialis SPECIES
## 10 4499030 Entedon martialis SPECIES
## # … with 64 more rows
## Warning in ids[ant] <- c(raw$data$key): number of items to replace is not a
## multiple of replacement length
## [1] "Atta"
## Records returned [100]
## No. unique hierarchies [0]
## Args [q=Atta, limit=100, fields1=key, fields2=canonicalName, fields3=rank]
## # A tibble: 100 × 3
## key canonicalName rank
## <int> <chr> <chr>
## 1 6006602 Atta GENUS
## 2 1323108 Atta GENUS
## 3 5035740 Atta silvae SPECIES
## 4 5035749 Atta domicola SPECIES
## 5 8368352 Atta penetrans SPECIES
## 6 5035739 Atta robusta SPECIES
## 7 5035737 Atta goiana SPECIES
## 8 8166766 Atta diligens SPECIES
## 9 7390841 Atta scalpturata SPECIES
## 10 5035746 Atta saltensis SPECIES
## # … with 90 more rows
## Warning in ids[ant] <- c(raw$data$key): number of items to replace is not a
## multiple of replacement length
## [1] "Ectatomma"
## Records returned [87]
## No. unique hierarchies [0]
## Args [q=Ectatomma, limit=100, fields1=key, fields2=canonicalName, fields3=rank]
## # A tibble: 87 × 3
## key canonicalName rank
## <int> <chr> <chr>
## 1 1314135 Ectatomma GENUS
## 2 1314136 Ectatomma gibbum SPECIES
## 3 1314147 Ectatomma edentatum SPECIES
## 4 9175376 Ectatomma rimulosum SPECIES
## 5 1314150 Ectatomma tuberculatum SPECIES
## 6 4673125 Ectatomma rothneyi SPECIES
## 7 4673127 Ectatomma socrus SPECIES
## 8 7570313 Ectatomma parasiticum SPECIES
## 9 8217186 Ectatomma tornatum SPECIES
## 10 8224209 Ectatomma dolo SPECIES
## # … with 77 more rows
## Warning in ids[ant] <- c(raw$data$key): number of items to replace is not a
## multiple of replacement length
## [1] "Tatuidris"
## Records returned [3]
## No. unique hierarchies [0]
## Args [q=Tatuidris, limit=100, fields1=key, fields2=canonicalName, fields3=rank]
## # A tibble: 3 × 3
## key canonicalName rank
## <int> <chr> <chr>
## 1 1319364 Tatuidris GENUS
## 2 1319365 Tatuidris tatusia SPECIES
## 3 9541003 Tatuidris kapasi SPECIES
## Warning in ids[ant] <- c(raw$data$key): number of items to replace is not a
## multiple of replacement length
## [1] "Aneuretus"
## Records returned [4]
## No. unique hierarchies [0]
## Args [q=Aneuretus, limit=100, fields1=key, fields2=canonicalName, fields3=rank]
## # A tibble: 4 × 3
## key canonicalName rank
## <int> <chr> <chr>
## 1 1320037 Aneuretus GENUS
## 2 1320038 Aneuretus simoni SPECIES
## 3 1737071 Decodes aneuretus SPECIES
## 4 1590997 Bibio aneuretus SPECIES
## Warning in ids[ant] <- c(raw$data$key): number of items to replace is not a
## multiple of replacement length
## [1] "Xymmer"
## Records returned [3]
## No. unique hierarchies [0]
## Args [q=Xymmer, limit=100, fields1=key, fields2=canonicalName, fields3=rank]
## # A tibble: 3 × 3
## key canonicalName rank
## <int> <chr> <chr>
## 1 4674293 Xymmer GENUS
## 2 10000176 Xymmer ambvky SPECIES
## 3 8741954 Xymmer phungi SPECIES
## Warning in ids[ant] <- c(raw$data$key): number of items to replace is not a
## multiple of replacement length
## [1] "Aenictus"
## Records returned [100]
## No. unique hierarchies [0]
## Args [q=Aenictus, limit=100, fields1=key, fields2=canonicalName, fields3=rank]
## # A tibble: 100 × 3
## key canonicalName rank
## <int> <chr> <chr>
## 1 1317499 Aenictus GENUS
## 2 1317552 Aenictus vagans SPECIES
## 3 1317644 Aenictus lifuiae SPECIES
## 4 1317647 Aenictus fergusoni SPECIES
## 5 1317629 Aenictus obscurus SPECIES
## 6 9091750 Aenictus maneerati SPECIES
## 7 9534984 Aenictus fuchuanensis SPECIES
## 8 9055270 Aenictus punctatus SPECIES
## 9 9265079 Aenictus nishimurai SPECIES
## 10 1317525 Aenictus binghami SPECIES
## # … with 90 more rows
## Warning in ids[ant] <- c(raw$data$key): number of items to replace is not a
## multiple of replacement length
Now, we will get the actual specimen occurrences for these ants. Practice using an apply function to do this. Inspect the object you’ve queried when you’ve completed your loop. Have we dealt with an object like this before?
ids <-c()
for (ant in ants){
raw <- name_suggest(ant)
occ_search(taxonKey = raw$data$key, limit = 10)
}
This is a vector or dataframes. Each individual query made it’s own dataframe. For simplicity, lets combine them into one dataframe object.
#Combine the resultant dataframes into one large dataframe
mega_df <- bind_rows(.id = "column_label")
for (ant in ants){
print(df$ant$data)
}
Since we will be plotting these with a map, we need complete data in both columns - lat and long. Let’s drop any columns without data in both.
#Drop rows with NA values in the lat and long
no_na <- mega_df %>%
select(scientificName, decimalLatitude, decimalLongitude) %>%
drop_na()
Now let’s take a peek at our results!
# Plot the dataframe of observations
k <- leaflet::leaflet(no_na) %>%
addTiles() %>%
addMarkers(~decimalLongitude, ~decimalLatitude, popup = no_na$scientificName)
Below are six code snippets to do different types of map-based plotting. With a partner, discuss what you think would be interesting or important to view in these data. Choose a block of code to modify to accomplish your visualization. We will reconvene in about 30 minutes and do a Round Robin showing everyone’s maps. Have Fun!
leaflet(no_na) %>%
addTiles() %>%
addCircles(~decimalLongitude, ~decimalLatitude)
leaflet(no_na) %>%
addTiles() %>%
addCircleMarkers(~decimalLongitude, ~decimalLatitude, radius = runif(100, 4, 10), color = c('red'))
leaflet(no_na) %>%
addTiles() %>%
addMarkers(~decimalLongitude, ~decimalLatitude, clusterOptions = markerClusterOptions())
pal <- colorBin(
palette = "Blues",
no_na$scientificName,
pretty = TRUE)
## Warning in pretty.default(domain %||% x, n = bins): NAs introduced by coercion
levs <- factor(no_na$scientificName)
factpal <- colorFactor(topo.colors(5), levs)
no_na %>%
group_by(scientificName) %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers(
~decimalLongitude,
~decimalLatitude,
color = ~factpal(scientificName),
stroke = FALSE, fillOpacity = 0.5
)
pal <- colorBin(
palette = "Blues",
no_na$scientificName,
pretty = TRUE)
## Warning in pretty.default(domain %||% x, n = bins): NAs introduced by coercion
levs <- factor(no_na$scientificName)
factpal <- colorFactor(topo.colors(5), levs)
no_na %>%
group_by(scientificName) %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers(
~decimalLongitude,
~decimalLatitude,
color = ~factpal(scientificName),
stroke = FALSE, fillOpacity = 0.5
) %>%
setView( lng = -100,
lat = 20,
zoom = 11 ) %>%
setMaxBounds( lng1 = -100,
lat1 = 19.432241,
lng2 = -98,
lat2 = 20 )
We will need one package we haven’t used yet, Liam Revell’s Phytools.
devtools::install_github("liamrevell/phytools")
## Downloading GitHub repo liamrevell/phytools@HEAD
## Skipping 1 packages not available: phangorn
##
checking for file ‘/private/var/folders/54/9kd8nf1x4fnft0ymvb11qmc80000gn/T/RtmpXMOCYt/remotes4fd01c2d05f3/liamrevell-phytools-a54910f/DESCRIPTION’ ...
✓ checking for file ‘/private/var/folders/54/9kd8nf1x4fnft0ymvb11qmc80000gn/T/RtmpXMOCYt/remotes4fd01c2d05f3/liamrevell-phytools-a54910f/DESCRIPTION’
##
─ preparing ‘phytools’:
##
checking DESCRIPTION meta-information ...
✓ checking DESCRIPTION meta-information
##
─ checking for LF line-endings in source and make files and shell scripts
##
─ checking for empty or unneeded directories
##
─ looking to see if a ‘data/datalist’ file should be added
##
─ building ‘phytools_0.7-95.tar.gz’
##
##
## Installing package into '/private/var/folders/54/9kd8nf1x4fnft0ymvb11qmc80000gn/T/RtmppNL377/temp_libpath4e15404a9177'
## (as 'lib' is unspecified)
## Loading required package: ape
## Loading required package: maps
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
Plotting a phylogeny to a map is a fairly simple task, but has a lot of data preparation work involved. The basic steps look like this:
You may have noticed that the GBIF package includes citation information in the scientificName column. We will need to remove that. To do this, let’s try the strsplit
function. This function splits a character string on a defined character and returns a vector of the elements in that character string.
For example:
my_string = "This is my string"
split_up <- strsplit(my_string, " ")
split_up
## [[1]]
## [1] "This" "is" "my" "string"
Try indexing this object. What do you need to do to access data? Now, with a partner, make this iterable across every row in the scientificName
column. Then, unite the split objects into one new column called genusSpecies
.
split_names <- no_na %>%
mutate(genus = map_chr(scientificName, function(s) strsplit(s, " ")[[1]][1]))%>% mutate(species = map_chr(scientificName, function(s) strsplit(s, " ")[[1]][2])) %>%
unite(col = genusSpecies, genus, species)
If you look at the data, there are some obviously mistaken values in there. For example, BOLD is not an ant species. Let’s drop that.
# Use ROTL to resolve names
no_bold <- split_names[ grep("BOLD", split_names$genusSpecies, invert = TRUE) , ]
We also ended up with far more ant taxa than I thought. What a nice problem to have! But let’s filter down to, say, five of them:
a_couple_ants <- c("Martialis_heureka", "Atta_mexicana", "Atta_cephalotes", "Atta_sexdens", "Atta_Fabricius")
subset_data <- no_bold %>%
filter(genusSpecies %in% a_couple_ants)
Now, let’s use ROTL to make sure we don’t have spelling errors.
reconciled_names <- rotl::tnrs_match_names(unique(subset_data$genusSpecies))
good_names <- reconciled_names %>%
drop_na()
Now, we will query our tree from OpenTree.
tree <- rotl::tol_induced_subtree(good_names$ott_id, label="name")
Now, we must combine our GBIF data and our taxon names into a matrix, a two-dimensional data structure with fewer neat data parsing features than a dataframe or tibble. These structures are often preferred when speed is an issue. Matrices can only be one type, which means we must add the row names after generating the object.
And the big reveal! Let’s overlay our OpenTree with our map:
library(viridis)
cols<-setNames(sample(viridis(n=Ntip(tree))),
tree$tip.label)
obj<-phylo.to.map(tree,only_lat_long, plot=FALSE, direction="rightwards")
Oh no! What has gone wrong?
This tree has several quirks. Have a look at the object and see if you can spot them.
#Answer here
- No branch lengths
- Not fully bifurcating
First, let’s resolve the polytomy issue. We will do this using Ape’s multi2di
function, which arbitrarily resolves polytomies. In real life, you would probably want to think about this a little more.:
We’ll also need to add some branch lengths. In our case, we will draw them from an exponential distribution. The exponential is often assumed to be a reasonable approximation for banch lengths - you’ll hear more about this if you take my systematics lab ;)
In our case, we will use rexp
to make the draws, and we will map them to a new attribute edge.length
that is a standard attribute of the tree object. In reality, you would likely not want to do this for a publication quality analysis, and would want to estimate branch lengths from data. But being able to rescale branch lengths is a good skill for sensitivity and other similar analyses.
tree$edge.length <- rexp(tree$Nnode*2)
tree$edge.length
Now let’s try that map again.
plot_object<-phylo.to.map(tree,only_lat_long, plot=FALSE, direction="rightwards")
Finally, we will actually plot to space.