This week, we will be starting with geospatial data and mapping. Today, we will begin with locality data in GBIF. GBIF is a website for aggregating location data from other sources.

We will work with two packages today. One is RGBIF, an interface to GBIF data maintained by the ROpenSci group.

Later in this lesson, we will also use the rotl package we’ve seen last week.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
##  dplyr     1.1.3      readr     2.1.4
##  forcats   1.0.0      stringr   1.5.0
##  ggplot2   3.4.4      tibble    3.2.1
##  lubridate 1.9.3      tidyr     1.3.0
##  purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
##  dplyr::filter() masks stats::filter()
##  dplyr::lag()    masks stats::lag()
##  Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

First, we’re going to try an example for one ant. We will query locality data and map it in a few ways. Then, you will figure out how to make the process of mapping data iterable over a set of taxa.

name_suggest("Atta mexicana")
## Records returned [1] 
## No. unique hierarchies [0] 
## Args [q=Atta mexicana, limit=100, fields1=key, fields2=canonicalName,
##      fields3=rank] 
## # A tibble: 1 × 3
##       key canonicalName rank   
##     <int> <chr>         <chr>  
## 1 5035745 Atta mexicana SPECIES

Next, we’re going to use occ_search to search the GBIF database for where these things are occurring.

 occurences <- occ_search(taxonKey = 5035745, limit = 20)

Next, we will filter the resultant dataset to name and lat and longitude data.

no_na <- occurences$data %>%
  select(scientificName, decimalLatitude, decimalLongitude) %>% 
  drop_na()

And finally, we will plot the data using leaflet

k <- leaflet::leaflet(no_na) %>% 
  addTiles() %>% 
  addMarkers(~decimalLongitude, ~decimalLatitude, popup = no_na$scientificName)

k

This looks much like plotting with ggplot and tidyverse. That’s because leaflet is based on the same principles. We establish a canvas and the data in the first line, then we add tiles (the actual map we will use), and then we add our points. When we think about it, this is quite similar to establishing a ggplot canvas, the axes, and the points to plot.

One point is easy. How often do we only want one point, though? Probably, we will want to plot several species. First, with a partner, work out how to get GBIF ids for a set of taxa.

ants <- c("Martialis", "Atta", "Ectatomma", "Tatuidris", "Aneuretus", "Xymmer")

Now, we will get the actual specimen occurrences for these ants. Practice using an apply function to do this. Inspect the object you’ve queried when you’ve completed your loop. Have we dealt with an object like this before?

search_results <- list()
for (ant in ants){
   query <- name_suggest(ant, rank = "genus")
   search <- occ_search(taxonKey = query$data$key, limit = 10)
   search_results[[ant]]<- search$data
 }

This is a vector or dataframes. Each individual query made it’s own dataframe. For simplicity, lets combine them into one dataframe object.

#Combine the resultant dataframes into one large dataframe
mega_df <- bind_rows(search_results, .id = "column_label")

Since we will be plotting these with a map, we need complete data in both columns - lat and long. Let’s drop any columns without data in both.

#Drop rows with NA values in the lat and long
no_na <- mega_df %>%
select(scientificName, decimalLatitude, decimalLongitude) %>%
drop_na()

Now let’s take a peek at our results!

# Plot the dataframe of observations
k <- leaflet::leaflet(no_na) %>%
addTiles() %>%
addMarkers(~decimalLongitude, ~decimalLatitude, popup = no_na$scientificName)

GeoSpatial Choose Your Own Adventure

Below are six code snippets to do different types of map-based plotting. With a partner, discuss what you think would be interesting or important to view in these data. Choose a block of code to modify to accomplish your visualization. We will reconvene in about 30 minutes and do a Round Robin showing everyone’s maps. Have Fun!

leaflet(no_na) %>% 
  addTiles() %>% 
  addCircles(~decimalLongitude, ~decimalLatitude)

Draw point sizes from a distribution

leaflet(no_na) %>% 
  addTiles() %>% 
  addCircleMarkers(~decimalLongitude, ~decimalLatitude, radius = runif(100, 4, 10), color = c('red'))

Cluster observations

leaflet(no_na) %>%
  addTiles() %>% 
  addMarkers(~decimalLongitude, ~decimalLatitude, clusterOptions = markerClusterOptions())

Size points by number of observations

no_na %>% 
  group_by(decimalLatitude) %>% 
  mutate(n_areas = n())  %>% 
  leaflet() %>%
  addTiles() %>% 
  addCircleMarkers(~decimalLongitude, ~decimalLatitude, radius = ~n_areas, color = c('red'))

Color points by species

pal <- colorBin(
  palette = "Blues",
  no_na$scientificName,
  pretty = TRUE)
## Warning in pretty.default(domain %||% x, n = bins): NAs introduced by coercion
levs <- factor(no_na$scientificName)
factpal <- colorFactor(topo.colors(5), levs)

no_na %>% 
  group_by(scientificName) %>% 
  leaflet() %>%
  addTiles() %>% 
  addCircleMarkers(
    ~decimalLongitude, 
    ~decimalLatitude,
    color = ~factpal(scientificName),
    stroke = FALSE, fillOpacity = 0.5
  ) 

Set view Width

pal <- colorBin(
  palette = "Blues",
  no_na$scientificName,
  pretty = TRUE)
## Warning in pretty.default(domain %||% x, n = bins): NAs introduced by coercion
levs <- factor(no_na$scientificName)
factpal <- colorFactor(topo.colors(5), levs)

no_na %>% 
  group_by(scientificName) %>% 
  leaflet() %>%
  addTiles() %>% 
  addCircleMarkers(
    ~decimalLongitude, 
    ~decimalLatitude,
    color = ~factpal(scientificName),
    stroke = FALSE, fillOpacity = 0.5
  ) %>% 
    setView( lng = -100,
            lat = 20,
            zoom = 11 ) %>%
  setMaxBounds( lng1 = -100,
                lat1 = 19.432241,
                lng2 = -98,
                lat2 = 20 )

Extra map color options (base maps)

https://rstudio.github.io/leaflet/basemaps.html

Mapping a tree to space

We will need one package we haven’t used yet, Liam Revell’s Phytools.

devtools::install_github("liamrevell/phytools")
## Downloading GitHub repo liamrevell/phytools@HEAD
## rlang  (1.1.3   -> 1.1.4  ) [CRAN]
## cli    (3.6.2   -> 3.6.3  ) [CRAN]
## digest (0.6.35  -> 0.6.36 ) [CRAN]
## Rcpp   (1.0.12  -> 1.0.13 ) [CRAN]
## igraph (1.5.1   -> 2.0.3  ) [CRAN]
## ape    (5.7-1   -> 5.8    ) [CRAN]
## expm   (0.999-7 -> 0.999-9) [CRAN]
## Installing 7 packages: rlang, cli, digest, Rcpp, igraph, ape, expm
## Installing packages into '/private/var/folders/54/9kd8nf1x4fnft0ymvb11qmc80000gn/T/Rtmpie5FUp/temp_libpath3b62461540c7'
## (as 'lib' is unspecified)
## 
##   There are binary versions available but the source versions are later:
##         binary  source needs_compilation
## rlang    1.1.2   1.1.4              TRUE
## cli      3.6.2   3.6.3              TRUE
## digest  0.6.33  0.6.36              TRUE
## Rcpp    1.0.11  1.0.13              TRUE
## igraph   1.5.1   2.0.3              TRUE
## ape      5.7-1     5.8              TRUE
## expm   0.999-8 0.999-9              TRUE
## installing the source packages 'rlang', 'cli', 'digest', 'Rcpp', 'igraph', 'ape', 'expm'
## Warning in i.p(...): installation of package 'expm' had non-zero exit status
## Warning in i.p(...): installation of package 'igraph' had non-zero exit status
## Warning in i.p(...): installation of package 'ape' had non-zero exit status
## ── R CMD build ─────────────────────────────────────────────────────────────────
## * checking for file ‘/private/var/folders/54/9kd8nf1x4fnft0ymvb11qmc80000gn/T/Rtmp8yxV8f/remotes44a77fccc966/liamrevell-phytools-8d0aa6d/DESCRIPTION’ ... OK
## * preparing ‘phytools’:
## * checking DESCRIPTION meta-information ... OK
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * looking to see if a ‘data/datalist’ file should be added
## * building ‘phytools_2.3-6.tar.gz’
## Installing package into '/private/var/folders/54/9kd8nf1x4fnft0ymvb11qmc80000gn/T/Rtmpie5FUp/temp_libpath3b62461540c7'
## (as 'lib' is unspecified)
## Loading required package: ape
## 
## Attaching package: 'ape'
## The following object is masked from 'package:dplyr':
## 
##     where
## Loading required package: maps
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map

Plotting a phylogeny to a map is a fairly simple task, but has a lot of data preparation work involved. The basic steps look like this:

  1. Synchronize names between GBIF and OpenTree
  2. Format GBIF data into a matrix
  3. Add branch lengths to our tree
  4. Plot

You may have noticed that the GBIF package includes citation information in the scientificName column. We will need to remove that. To do this, let’s try the strsplit function. This function splits a character string on a defined character and returns a vector of the elements in that character string.

For example:

my_string = "This is my string"
split_up <- strsplit(my_string, " ")
split_up
## [[1]]
## [1] "This"   "is"     "my"     "string"

Try indexing this object. What do you need to do to access data? Now, with a partner, make this iterable across every row in the scientificName column. Then, unite the split objects into one new column called genusSpecies.

split_names <- no_na %>% 
mutate(genus = map_chr(scientificName, function(s) strsplit(s, " ")[[1]][1]))%>% mutate(species = map_chr(scientificName, function(s) strsplit(s, " ")[[1]][2])) %>% 
unite(col = genusSpecies, genus, species)

If you look at the data, there are some obviously mistaken values in there. For example, BOLD is not an ant species. Let’s drop that.

# Use ROTL to resolve names

no_bold <- split_names[ grep("BOLD", split_names$genusSpecies, invert = TRUE) , ]

We also ended up with far more ant taxa than I thought. What a nice problem to have! But let’s filter down to, say, five of them:

a_couple_ants <- c("Martialis_heureka", "Ectatomma_ruidum", "Ectatomma_tuberculatum", "Aneuretus_simoni", "Ectatomma_opaciventre")
subset_data <- no_bold %>% 
    filter(genusSpecies %in% a_couple_ants)

Now, let’s use ROTL to make sure we don’t have spelling errors.

reconciled_names <- rotl::tnrs_match_names(unique(subset_data$genusSpecies))
good_names <-  reconciled_names %>%
  drop_na()

Now, we will query our tree from OpenTree.

tree <- rotl::tol_induced_subtree(good_names$ott_id, label="name")
## 
Progress [----------------------------------] 0/14 (  0) ?s
Progress [================================] 14/14 (100)  0s
                                                            
## Warning in collapse_singles(tr, show_progress): Dropping singleton nodes with
## labels: mrcaott1682ott35311, mrcaott1682ott17752, mrcaott1682ott7438,
## mrcaott1682ott34821, mrcaott34821ott950983, Aneuretinae, Aneuretus,
## mrcaott4706ott5525, mrcaott5525ott48019, mrcaott48019ott688185, Ectatomminae,
## mrcaott48019ott162057, Martialinae, Martialis
plot(tree)

Now, we must combine our GBIF data and our taxon names into a matrix, a two-dimensional data structure with fewer neat data parsing features than a dataframe or tibble. These structures are often preferred when speed is an issue. Matrices can only be one type, which means we must add the row names after generating the object.

only_lat_long <- subset_data %>% 
  select(decimalLatitude, decimalLongitude) %>% 
  as.matrix()

rownames(only_lat_long) <- subset_data$genusSpecies

Lastly, let’s pick some colors for each species:

## Loading required package: viridisLite
## 
## Attaching package: 'viridis'
## The following object is masked from 'package:maps':
## 
##     unemp
color_selection <- setNames(sample(viridis(n=Ntip(tree))),
    tree$tip.label)

And the big reveal! Let’s overlay our OpenTree with our map:

obj<-phylo.to.map(tree,only_lat_long, plot=FALSE, direction="rightwards")
plot(obj)

Oh no! What has gone wrong?

This tree has several quirks. Have a look at the object and see if you can spot them.


#Answer here

- No branch lengths
- Not fully bifurcating

First, let’s resolve the polytomy issue. We will do this using Ape’s multi2di function, which arbitrarily resolves polytomies. In real life, you would probably want to think about this a little more.:

library(ape)
tree <- multi2di(tree)

We’ll also need to add some branch lengths. In our case, we will draw them from an exponential distribution. The exponential is often assumed to be a reasonable approximation for banch lengths - you’ll hear more about this if you take my systematics lab ;)

In our case, we will use rexp to make the draws, and we will map them to a new attribute edge.length that is a standard attribute of the tree object. In reality, you would likely not want to do this for a publication quality analysis, and would want to estimate branch lengths from data. But being able to rescale branch lengths is a good skill for sensitivity and other similar analyses.

tree$edge.length <- rexp(tree$Nnode*2)
tree$edge.length
## [1] 0.8337838 1.2722607 0.4795754 0.6752742 0.9590611 1.0967122

Now let’s try that map again.

plot_object <- phylo.to.map(tree, only_lat_long, plot=FALSE)
## objective: 6
## objective: 2
## objective: 2
plot_object
## Object of class "phylo.to.map" containing:
## 
## (1) A phylogenetic tree with 4 tips and 3 internal nodes.
## 
## (2) A geographic map with range:
##      -85.19N, 83.6N
##      -180W, 180W.
## 
## (3) A table containing 27 geographic coordinates (may include
##     more than one set per species).
## 
## If optimized, tree nodes have been rotated to maximize alignment
## with the map when the tree is plotted in a downwards direction.

Finally, we will actually plot to space.

plot(plot_object,direction="rightwards",colors=color_selection, cex.points=c(0,1), lwd=c(3,1), ftype="off")
## "phylo.to.map" direction is "downwards" but plot direction has been given as "rightwards".
## Re-optimizing object....
## objective: 4
## objective: 4
## objective: 2