This week, we will be starting with geospatial data and mapping. Today, we will begin with locality data in GBIF. GBIF is a website for aggregating location data from other sources.

We will work with two packages today. One is RGBIF, an interface to GBIF data maintained by the ROpenSci group.

install.packages("rgbif")
install.packages("leaflet")

Later in this lesson, we will also use the rotl package we’ve seen last week.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(rotl)
library(rgbif)
library(leaflet)
library(tidyverse)

First, we’re going to try an example for one ant. We will query locality data and map it in a few ways. Then, you will figure out how to make the process of mapping data iterable over a set of taxa.

name_suggest("Atta mexicana")

## Records returned [1] 
## No. unique hierarchies [0] 
## Args [q=Atta mexicana, limit=100, fields1=key, fields2=canonicalName,
##      fields3=rank] 
## # A tibble: 1 × 3
##       key canonicalName rank   
##     <int> <chr>         <chr>  
## 1 5035745 Atta mexicana SPECIES

Next, we’re going to use occ_search to search the GBIF database for where these things are occurring.

 occurences <- occ_search(taxonKey = 5035745, limit = 20)

Next, we will filter the resultant dataset to name and lat and longitude data.

no_na <- occurences$data %>%
  select(scientificName, decimalLatitude, decimalLongitude) %>% 
  drop_na()

And finally, we will plot the data using leaflet

k <- leaflet::leaflet(no_na) %>% 
  addTiles() %>% 
  addMarkers(~decimalLongitude, ~decimalLatitude, popup = no_na$scientificName)

This looks much like plotting with ggplot and tidyverse. That’s because leaflet is based on the same principles. We establish a canvas and the data in the first line, then we add tiles (the actual map we will use), and then we add our points. When we think about it, this is quite similar to establishing a ggplot canvas, the axes, and the points to plot.

One point is easy. How often do we only want one point, though? Probably, we will want to plot several species. First, with a partner, work out how to get GBIF ids for a set of taxa.

ants <- c("Martialis", "Atta", "Ectatomma", "Tatuidris", "Aneuretus", "Xymmer", "Aenictus")

#Get GBIF ids for our vector of species.


ids <-c()
  for (ant in ants){
   print(ant)
   raw <- name_suggest(ant)
   print(raw)
   ids[ant] <- c(raw$data$key)
}

## [1] "Martialis"
## Records returned [74] 
## No. unique hierarchies [0] 
## Args [q=Martialis, limit=100, fields1=key, fields2=canonicalName, fields3=rank] 
## # A tibble: 74 × 3
##        key canonicalName       rank    
##      <int> <chr>               <chr>   
##  1 7744356 Martialis           GENUS   
##  2 9599316 ? martialis         UNRANKED
##  3 7437085 Martialis heureka   SPECIES 
##  4 5977251 Erynnis martialis   SPECIES 
##  5 4544067 Aculus martialis    SPECIES 
##  6 5168773 Hahnia martialis    SPECIES 
##  7 5025232 Agathis martialis   SPECIES 
##  8 5510493 Globulina martialis SPECIES 
##  9 2829565 Habenaria martialis SPECIES 
## 10 4499030 Entedon martialis   SPECIES 
## # … with 64 more rows

## Warning in ids[ant] <- c(raw$data$key): number of items to replace is not a
## multiple of replacement length

## [1] "Atta"
## Records returned [100] 
## No. unique hierarchies [0] 
## Args [q=Atta, limit=100, fields1=key, fields2=canonicalName, fields3=rank] 
## # A tibble: 100 × 3
##        key canonicalName    rank   
##      <int> <chr>            <chr>  
##  1 6006602 Atta             GENUS  
##  2 1323108 Atta             GENUS  
##  3 5035740 Atta silvae      SPECIES
##  4 5035749 Atta domicola    SPECIES
##  5 8368352 Atta penetrans   SPECIES
##  6 5035739 Atta robusta     SPECIES
##  7 5035737 Atta goiana      SPECIES
##  8 8166766 Atta diligens    SPECIES
##  9 7390841 Atta scalpturata SPECIES
## 10 5035746 Atta saltensis   SPECIES
## # … with 90 more rows

## Warning in ids[ant] <- c(raw$data$key): number of items to replace is not a
## multiple of replacement length

## [1] "Ectatomma"
## Records returned [87] 
## No. unique hierarchies [0] 
## Args [q=Ectatomma, limit=100, fields1=key, fields2=canonicalName, fields3=rank] 
## # A tibble: 87 × 3
##        key canonicalName          rank   
##      <int> <chr>                  <chr>  
##  1 1314135 Ectatomma              GENUS  
##  2 1314136 Ectatomma gibbum       SPECIES
##  3 1314147 Ectatomma edentatum    SPECIES
##  4 9175376 Ectatomma rimulosum    SPECIES
##  5 1314150 Ectatomma tuberculatum SPECIES
##  6 4673125 Ectatomma rothneyi     SPECIES
##  7 4673127 Ectatomma socrus       SPECIES
##  8 7570313 Ectatomma parasiticum  SPECIES
##  9 8217186 Ectatomma tornatum     SPECIES
## 10 8224209 Ectatomma dolo         SPECIES
## # … with 77 more rows

## Warning in ids[ant] <- c(raw$data$key): number of items to replace is not a
## multiple of replacement length

## [1] "Tatuidris"
## Records returned [3] 
## No. unique hierarchies [0] 
## Args [q=Tatuidris, limit=100, fields1=key, fields2=canonicalName, fields3=rank] 
## # A tibble: 3 × 3
##       key canonicalName     rank   
##     <int> <chr>             <chr>  
## 1 1319364 Tatuidris         GENUS  
## 2 1319365 Tatuidris tatusia SPECIES
## 3 9541003 Tatuidris kapasi  SPECIES

## Warning in ids[ant] <- c(raw$data$key): number of items to replace is not a
## multiple of replacement length

## [1] "Aneuretus"
## Records returned [4] 
## No. unique hierarchies [0] 
## Args [q=Aneuretus, limit=100, fields1=key, fields2=canonicalName, fields3=rank] 
## # A tibble: 4 × 3
##       key canonicalName     rank   
##     <int> <chr>             <chr>  
## 1 1320037 Aneuretus         GENUS  
## 2 1320038 Aneuretus simoni  SPECIES
## 3 1737071 Decodes aneuretus SPECIES
## 4 1590997 Bibio aneuretus   SPECIES

## Warning in ids[ant] <- c(raw$data$key): number of items to replace is not a
## multiple of replacement length

## [1] "Xymmer"
## Records returned [3] 
## No. unique hierarchies [0] 
## Args [q=Xymmer, limit=100, fields1=key, fields2=canonicalName, fields3=rank] 
## # A tibble: 3 × 3
##        key canonicalName rank   
##      <int> <chr>         <chr>  
## 1  4674293 Xymmer        GENUS  
## 2 10000176 Xymmer ambvky SPECIES
## 3  8741954 Xymmer phungi SPECIES

## Warning in ids[ant] <- c(raw$data$key): number of items to replace is not a
## multiple of replacement length

## [1] "Aenictus"
## Records returned [100] 
## No. unique hierarchies [0] 
## Args [q=Aenictus, limit=100, fields1=key, fields2=canonicalName, fields3=rank] 
## # A tibble: 100 × 3
##        key canonicalName         rank   
##      <int> <chr>                 <chr>  
##  1 1317499 Aenictus              GENUS  
##  2 1317552 Aenictus vagans       SPECIES
##  3 1317644 Aenictus lifuiae      SPECIES
##  4 1317647 Aenictus fergusoni    SPECIES
##  5 1317629 Aenictus obscurus     SPECIES
##  6 9091750 Aenictus maneerati    SPECIES
##  7 9534984 Aenictus fuchuanensis SPECIES
##  8 9055270 Aenictus punctatus    SPECIES
##  9 9265079 Aenictus nishimurai   SPECIES
## 10 1317525 Aenictus binghami     SPECIES
## # … with 90 more rows

## Warning in ids[ant] <- c(raw$data$key): number of items to replace is not a
## multiple of replacement length

Now, we will get the actual specimen occurrences for these ants. Practice using an apply function to do this. Inspect the object you’ve queried when you’ve completed your loop. Have we dealt with an object like this before?

ids <-c()
  for (ant in ants){
   raw <- name_suggest(ant)
   occ_search(taxonKey = raw$data$key, limit = 10)
}

This is a vector or dataframes. Each individual query made it’s own dataframe. For simplicity, lets combine them into one dataframe object.

#Combine the resultant dataframes into one large dataframe
mega_df <- bind_rows(.id = "column_label")
  for (ant in ants){
   print(df$ant$data)
}

Since we will be plotting these with a map, we need complete data in both columns - lat and long. Let’s drop any columns without data in both.

#Drop rows with NA values in the lat and long
no_na <- mega_df %>%
select(scientificName, decimalLatitude, decimalLongitude) %>%
drop_na()

Now let’s take a peek at our results!

# Plot the dataframe of observations
k <- leaflet::leaflet(no_na) %>%
addTiles() %>%
addMarkers(~decimalLongitude, ~decimalLatitude, popup = no_na$scientificName)

GeoSpatial Choose Your Own Adventure

Below are six code snippets to do different types of map-based plotting. With a partner, discuss what you think would be interesting or important to view in these data. Choose a block of code to modify to accomplish your visualization. We will reconvene in about 30 minutes and do a Round Robin showing everyone’s maps. Have Fun!

leaflet(no_na) %>% 
  addTiles() %>% 
  addCircles(~decimalLongitude, ~decimalLatitude)

Draw point sizes from a distribution

leaflet(no_na) %>% 
  addTiles() %>% 
  addCircleMarkers(~decimalLongitude, ~decimalLatitude, radius = runif(100, 4, 10), color = c('red'))

Cluster observations

leaflet(no_na) %>%
  addTiles() %>% 
  addMarkers(~decimalLongitude, ~decimalLatitude, clusterOptions = markerClusterOptions())

Size points by number of observations

no_na %>% 
  group_by(decimalLatitude) %>% 
  mutate(n_areas = n())  %>% 
  leaflet() %>%
  addTiles() %>% 
  addCircleMarkers(~decimalLongitude, ~decimalLatitude, radius = ~n_areas, color = c('red'))

Color points by species

pal <- colorBin(
  palette = "Blues",
  no_na$scientificName,
  pretty = TRUE)

## Warning in pretty.default(domain %||% x, n = bins): NAs introduced by coercion

levs <- factor(no_na$scientificName)
factpal <- colorFactor(topo.colors(5), levs)

no_na %>% 
  group_by(scientificName) %>% 
  leaflet() %>%
  addTiles() %>% 
  addCircleMarkers(
    ~decimalLongitude, 
    ~decimalLatitude,
    color = ~factpal(scientificName),
    stroke = FALSE, fillOpacity = 0.5
  )

Set view Width

pal <- colorBin(
  palette = "Blues",
  no_na$scientificName,
  pretty = TRUE)

## Warning in pretty.default(domain %||% x, n = bins): NAs introduced by coercion

levs <- factor(no_na$scientificName)
factpal <- colorFactor(topo.colors(5), levs)

no_na %>% 
  group_by(scientificName) %>% 
  leaflet() %>%
  addTiles() %>% 
  addCircleMarkers(
    ~decimalLongitude, 
    ~decimalLatitude,
    color = ~factpal(scientificName),
    stroke = FALSE, fillOpacity = 0.5
  ) %>% 
    setView( lng = -100,
            lat = 20,
            zoom = 11 ) %>%
  setMaxBounds( lng1 = -100,
                lat1 = 19.432241,
                lng2 = -98,
                lat2 = 20 )

Extra map color options (base maps)

https://rstudio.github.io/leaflet/basemaps.html

Mapping a tree to space

We will need one package we haven’t used yet, Liam Revell’s Phytools.

devtools::install_github("liamrevell/phytools")

## Downloading GitHub repo liamrevell/phytools@HEAD

## Skipping 1 packages not available: phangorn

##   
   checking for file ‘/private/var/folders/54/9kd8nf1x4fnft0ymvb11qmc80000gn/T/RtmpXMOCYt/remotes4fd01c2d05f3/liamrevell-phytools-a54910f/DESCRIPTION’ ...
  
✓  checking for file ‘/private/var/folders/54/9kd8nf1x4fnft0ymvb11qmc80000gn/T/RtmpXMOCYt/remotes4fd01c2d05f3/liamrevell-phytools-a54910f/DESCRIPTION’
## 
  
─  preparing ‘phytools’:
## 
  
   checking DESCRIPTION meta-information ...
  
✓  checking DESCRIPTION meta-information
## 
  
─  checking for LF line-endings in source and make files and shell scripts
## 
  
─  checking for empty or unneeded directories
## 
  
─  looking to see if a ‘data/datalist’ file should be added
## 
  
─  building ‘phytools_0.7-95.tar.gz’
## 
  
   
##

## Installing package into '/private/var/folders/54/9kd8nf1x4fnft0ymvb11qmc80000gn/T/RtmppNL377/temp_libpath4e15404a9177'
## (as 'lib' is unspecified)

library(phytools)

## Loading required package: ape

## Loading required package: maps

## 
## Attaching package: 'maps'

## The following object is masked from 'package:purrr':
## 
##     map

Plotting a phylogeny to a map is a fairly simple task, but has a lot of data preparation work involved. The basic steps look like this:

Synchronize names between GBIF and OpenTree
Format GBIF data into a matrix
Add branch lengths to our tree
Plot

You may have noticed that the GBIF package includes citation information in the scientificName column. We will need to remove that. To do this, let’s try the strsplit function. This function splits a character string on a defined character and returns a vector of the elements in that character string.

For example:

my_string = "This is my string"
split_up <- strsplit(my_string, " ")
split_up

## [[1]]
## [1] "This"   "is"     "my"     "string"

Try indexing this object. What do you need to do to access data? Now, with a partner, make this iterable across every row in the scientificName column. Then, unite the split objects into one new column called genusSpecies.

split_names <- no_na %>% 
mutate(genus = map_chr(scientificName, function(s) strsplit(s, " ")[[1]][1]))%>% mutate(species = map_chr(scientificName, function(s) strsplit(s, " ")[[1]][2])) %>% 
unite(col = genusSpecies, genus, species)

If you look at the data, there are some obviously mistaken values in there. For example, BOLD is not an ant species. Let’s drop that.

# Use ROTL to resolve names

no_bold <- split_names[ grep("BOLD", split_names$genusSpecies, invert = TRUE) , ]

We also ended up with far more ant taxa than I thought. What a nice problem to have! But let’s filter down to, say, five of them:

a_couple_ants <- c("Martialis_heureka", "Atta_mexicana", "Atta_cephalotes", "Atta_sexdens", "Atta_Fabricius")
subset_data <- no_bold %>% 
    filter(genusSpecies %in% a_couple_ants)

Now, let’s use ROTL to make sure we don’t have spelling errors.

reconciled_names <- rotl::tnrs_match_names(unique(subset_data$genusSpecies))
good_names <-  reconciled_names %>%
  drop_na()

Now, we will query our tree from OpenTree.

tree <- rotl::tol_induced_subtree(good_names$ott_id, label="name")

Now, we must combine our GBIF data and our taxon names into a matrix, a two-dimensional data structure with fewer neat data parsing features than a dataframe or tibble. These structures are often preferred when speed is an issue. Matrices can only be one type, which means we must add the row names after generating the object.

And the big reveal! Let’s overlay our OpenTree with our map:

library(viridis)
cols<-setNames(sample(viridis(n=Ntip(tree))),
    tree$tip.label)

obj<-phylo.to.map(tree,only_lat_long, plot=FALSE, direction="rightwards")

Oh no! What has gone wrong?

This tree has several quirks. Have a look at the object and see if you can spot them.


#Answer here

- No branch lengths
- Not fully bifurcating

First, let’s resolve the polytomy issue. We will do this using Ape’s multi2di function, which arbitrarily resolves polytomies. In real life, you would probably want to think about this a little more.:

We’ll also need to add some branch lengths. In our case, we will draw them from an exponential distribution. The exponential is often assumed to be a reasonable approximation for banch lengths - you’ll hear more about this if you take my systematics lab ;)

In our case, we will use rexp to make the draws, and we will map them to a new attribute edge.length that is a standard attribute of the tree object. In reality, you would likely not want to do this for a publication quality analysis, and would want to estimate branch lengths from data. But being able to rescale branch lengths is a good skill for sensitivity and other similar analyses.

tree$edge.length <- rexp(tree$Nnode*2)
tree$edge.length

Now let’s try that map again.

plot_object<-phylo.to.map(tree,only_lat_long, plot=FALSE, direction="rightwards")

Finally, we will actually plot to space.

plot(plot_object,direction="rightwards",colors=cols,cex.points=c(0,1),
    lwd=c(3,1),ftype="i")

GBIF And Spatial Data