About Neotoma

Neotoma is a Paleoecological database that stores many different types of data. It focuses on data from 5.333 million years ago (Paleocene) to now. The majority of the data is from collection sites, where scientists dug into the ground, collected, and then carefully characterized the bones and fossils they found. They included their best guess on the geological age in which they believe the organism exisited on this planet. The data ranges from diatoms to pollen to insects to bones. Bones! So cool.

Neotoma is implemented in a Microsoft SQL server. You can use Neotoma data without SQL skills using the Neotoma R Package which is essentially an API wrapper package which is part of my favorite R community Ropensci! This package was largely written by Simon Goring, who appears to have spear-headed this project.

Note: If you are an SQL wizard, you are in luck, this database is extensive and has excellent documentation. You can read about all the details in the Neotoma Database Manual.

If you are working on your own machine, you must first install these packages.

Set Up

# If you are working on your own machine, you must first install these packages.
# install.packages(c("ggplot", "ggmap", "tidyverse", "skimr", "taxize", "neotoma", "WikipediR"))

library(ggplot2)
library(ggmap)
library(tidyverse)
library(skimr)
library(taxize)
library(neotoma)
library(WikipediR)
library(maps)

## Set working directory to source
# setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

Exploring with a short example

The neotoma::get_table() function allows you to retrieve whole datasets. This is a superfast way to get to the data as a data frame. Go here for a list of all 63 the tables. These can be accessed using get_table() function and is a way to search through the database by specific parameters.

For instance, if you are interested in a particular species, age, altitude, ect. you can use neotoma::get_dataset. Let’s try searching by taxonname by including just the genus. Can anyone tell me the common name of Smilodons?

smilodon <- get_dataset(taxonname = 'Smilodon*') 
## The API call was successful, you have returned 45 records.
head(smilodon)
## $`4564`
## A dataset for Blackwater Draw Loc. 1
## Accessed 2019-04-16 14:27h. 
##  dataset.id              site.name      long      lat             type
##        4564 Blackwater Draw Loc. 1 -103.3167 34.28333 vertebrate fauna
## 
## $`4939`
## A dataset for Avery Island
## Accessed 2019-04-16 14:27h. 
##  dataset.id    site.name   long      lat             type
##        4939 Avery Island -91.75 29.86667 vertebrate fauna
## 
## $`5366`
## A dataset for Clamp Cave
## Accessed 2019-04-16 14:27h. 
##  dataset.id  site.name   long      lat             type
##        5366 Clamp Cave -98.75 31.11667 vertebrate fauna
## 
## $`5371`
## A dataset for Friesenhahn Cave
## Accessed 2019-04-16 14:27h. 
##  dataset.id        site.name      long      lat             type
##        5371 Friesenhahn Cave -98.36667 29.61667 vertebrate fauna
## 
## $`5479`
## A dataset for First American Bank [40DV40]
## Accessed 2019-04-16 14:27h. 
##  dataset.id                    site.name      long      lat
##        5479 First American Bank [40DV40] -86.78333 36.18333
##              type
##  vertebrate fauna
## 
## $`6060`
## A dataset for Rancho La Brea
## Accessed 2019-04-16 14:27h. 
##  dataset.id      site.name      long     lat             type
##        6060 Rancho La Brea -118.3568 34.0637 vertebrate fauna

Activity (10 min)

  1. What does the * mean in this code?
  2. What genus is your favorite animal in?
  3. Find your favorite animal or group of animals using the get_dataset function.

Exploring with large example

I basically want to visualize everything that the database has to offer. Of course that would mean downloading the entire database, which I shouldn’t do because it is too big. I limited my question to an aspect of the dataset I was interested in: What is the distribution of animal samples through time?

My R workflow is always motivated by the question: How do I organize all the data that interests me into a data frame in the tidy format i.e. where each sample is a row.

The key to getting all the data I wanted was the get_data() function, which allows me to download all the data for a given site. I wanted all the species information in the vertebrate fauna dataset, which is still a lot of data, but manageable.

We will skip the part of finding all the vetebrate data in Neotoma for now, but if you are interested please see this post on curiositydata.org.

Instead we will just read in the data.

## Read in previous data.
all_fauna_data <- read.csv("../data/all_fauna_data.csv")

Discussion Its important to “get a feel” for your data. Here are a few ways to do that. What do each of these funcyions do in plain english?

## Check out all the pretty data
str(all_fauna_data)
## 'data.frame':    44468 obs. of  13 variables:
##  $ X               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ taxon.name      : Factor w/ 1768 levels "?Alces alces",..: 127 157 190 244 248 762 839 916 1027 1061 ...
##  $ variable.units  : Factor w/ 8 levels "1-4 scale","grams",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ variable.element: Factor w/ 312 levels "antler","antorbital",..: 16 13 16 16 16 16 16 16 16 16 ...
##  $ variable.context: Factor w/ 5 levels "articulated",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ taxon.group     : Factor w/ 9 levels "Birds","Crustaceans undiff.",..: 6 1 6 6 6 6 6 6 6 6 ...
##  $ site.id         : int  3531 3531 3531 3531 3531 3531 3531 3531 3531 3531 ...
##  $ site.name       : Factor w/ 3372 levels "101 Ranch","10AA15",..: 768 768 768 768 768 768 768 768 768 768 ...
##  $ long            : num  -106 -106 -106 -106 -106 ...
##  $ lat             : num  40.9 40.9 40.9 40.9 40.9 ...
##  $ iteration       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ age.older       : num  11980 11980 11980 11980 11980 ...
##  $ age.younger     : num  1450 1450 1450 1450 1450 1450 1450 1450 1450 1450 ...
head(all_fauna_data)
##   X            taxon.name variable.units variable.element variable.context
## 1 1 Antilocapra americana present/absent       bone/tooth             <NA>
## 2 2                  Aves present/absent        bone/bill             <NA>
## 3 3             Bison sp. present/absent       bone/tooth             <NA>
## 4 4         Canis latrans present/absent       bone/tooth             <NA>
## 5 5           Canis lupus present/absent       bone/tooth             <NA>
## 6 6           Cynomys sp. present/absent       bone/tooth             <NA>
##   taxon.group site.id                site.name    long      lat iteration
## 1     Mammals    3531 Chimney Rock Animal Trap -105.75 40.86667         1
## 2       Birds    3531 Chimney Rock Animal Trap -105.75 40.86667         1
## 3     Mammals    3531 Chimney Rock Animal Trap -105.75 40.86667         1
## 4     Mammals    3531 Chimney Rock Animal Trap -105.75 40.86667         1
## 5     Mammals    3531 Chimney Rock Animal Trap -105.75 40.86667         1
## 6     Mammals    3531 Chimney Rock Animal Trap -105.75 40.86667         1
##   age.older age.younger
## 1     11980        1450
## 2     11980        1450
## 3     11980        1450
## 4     11980        1450
## 5     11980        1450
## 6     11980        1450
dim(all_fauna_data) 
## [1] 44468    13

Visualizing

Let’s visualize the question: What type of taxon.groups do I have?

## Summarize the taxon.groups
ggplot(all_fauna_data, aes(taxon.group)) + 
  geom_histogram(stat = "count") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Looks like the Fauna dataset is mostly made up of mammals anyway, so I am just going to remove anything else to narrow my questions a bit.

## subset only mammals
mammals <- all_fauna_data %>%
  filter(taxon.group == "Mammals")

# Compare size to see how many samples we removed?
dim(all_fauna_data)
## [1] 44468    13
dim(mammals) 
## [1] 39739    13

Let’s visualize another question: What are the top taxa groups?

mammals %>% 
  count(taxon.name) %>% 
  top_n(35) %>%
  ggplot(., aes(x = reorder(taxon.name, -n), y = n)) +
    geom_bar(stat = "identity") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(x = "Taxon Name")
## Selecting by n

What are all these species?! With the exception of Bison bison (which is Bison or buffalo) and Homo sapiens, I really couldn’t tell you. To just explore a few of the other species I just Googled.

Activity (10 min)

  1. Google some of these taxa to get an idea of what animals are most represented in this data.
  2. If you want to be fancy you can access Wikipedia from R. This won’t work for every taxon name, but it’s worth a try.
wp_content <- page_content("en", "wikipedia", page_name = mammals$taxon.name[1], as_wikitext = T)
wp_content$parse$wikitext

## This is what happens when there is no page
wp_content2 <- page_content("en","wikipedia", page_name = mammals$taxon.name[2],as_wikitext = T) ## page doesn't exist

wp_content3 <- page_content("en", "wikipedia", page_name = mammals$taxon.name[3], as_wikitext = T)
wp_content3$parse$wikitext

Getting to know the data even better to ask even more questions

I used str and head to summarize our data, but now we want to know just a few more details about our data with a lot more style. So we are going to try out the fancy new skimr package….yet another Ropensci package.

Anyway, check out all the information you can get from all_fauna_data with skimr’s main function skimr::skim.

skim(all_fauna_data) 
## Skim summary statistics
##  n obs: 44468 
##  n variables: 13 
## 
## ── Variable type:factor ──────────────────────────────────────────────────────────────────────────────────────────────
##          variable missing complete     n n_unique
##         site.name       0    44468 44468     3372
##       taxon.group       0    44468 44468        9
##        taxon.name       0    44468 44468     1768
##  variable.context   44260      208 44468        5
##  variable.element       0    44468 44468      312
##    variable.units       0    44468 44468        8
##                                    top_counts ordered
##        Dry: 292, Pec: 274, Zet: 245, Bry: 220   FALSE
##   Mam: 39739, Bir: 2115, Rep: 1364, Fis: 1126   FALSE
##      Odo: 1276, Ave: 1093, Cas: 981, Pro: 962   FALSE
##         NA: 44260, int: 121, art: 44, red: 32   FALSE
##    bon: 37079, bon: 3210, bon: 1244, bon: 932   FALSE
##  pre: 17936, NIS: 14271, MNI: 12155, MNE: 100   FALSE
## 
## ── Variable type:integer ─────────────────────────────────────────────────────────────────────────────────────────────
##   variable missing complete     n     mean       sd  p0      p25     p50
##  iteration       0    44468 44468  1621.6   1133.84   1   594.75  1511  
##    site.id       0    44468 44468  5641.98  2839.89 244  4064     4865  
##          X       0    44468 44468 22234.5  12836.95   1 11117.75 22234.5
##       p75  p100     hist
##   2539     3853 ▇▅▅▅▃▃▅▃
##   5760    16264 ▁▅▇▁▁▁▁▁
##  33351.25 44468 ▇▇▇▇▇▇▇▇
## 
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────
##     variable missing complete     n     mean       sd      p0     p25
##    age.older    5654    38814 44468 15553.07 50860.17   29    1000   
##  age.younger    5654    38814 44468  5314.49 21743.28    0     350   
##          lat       3    44465 44468    39.66     7.15  -18.68   35.62
##         long       3    44465 44468   -99.64    28.55 -176.62 -110.62
##      p50      p75       p100     hist
##  2200    11135    1500000    ▇▁▁▁▁▁▁▁
##   950     6700      4e+05    ▇▁▁▁▁▁▁▁
##    38.87    42.77      82.5  ▁▁▁▁▇▂▁▁
##   -99.12   -89        177.63 ▁▇▂▁▁▁▁▁

Emoji Love interlude: 😍😍😍 * 100. This package is amazing!!! Please read about all the features on the skimr Github repo and the blog post describing the motivation and collaborative effort involved in making it. I just read it and I am bursting with love for the R community. Man. Now I just want to do a whole post on how 💣 bomb skimr is.

Mapping taxon onto a map

Right away I see that I have enough information to start visualizing species distributions in time.

# I have a love / hate relationship with this first species
humans <- mammals %>% 
  filter(taxon.name == "Homo sapiens" & long > -160) #ignoring samples out in the middle of the ocean

# Start with world
world <- map_data("world") 

ggplot() + 
  geom_polygon(data = world, 
               fill = "grey38", 
               aes(x = long, y = lat, group = group)) + 
  coord_fixed(1.3) +
  geom_point(data = humans, 
             aes(x = long, 
                 y = lat,
                 color = age.older),
             alpha = .7, size = .7) +
  scale_color_continuous(guide = guide_legend(title = "Age (years ago)"))  +
    guides(colour = guide_legend(override.aes = list(size = 3))) +
  ggtitle("Geographic Distribution of Homo sapien Samples")

Alert: We notice that the sampling is skewed towards North America for sure! Why is this the case? More often than not, it is because of sampling bias, a bias in which a sample is collected in such a way that some members of the intended population are less likely to be included than others. Sometimes the bias is because of the scope of the database, which is the case here. Neotoma is a US data aggregation database. Other times, it can because of limitations in geographic area, actually there have been some studies showing sampling occurs closest to roads! Other times, it is taxonomic bias, or “taxonomic chauvinism”, which is a bias in which organisms are sampled. It is very important to know your data as best you can before making any conclusions.

So let’s zoom in on the United States since that is what Neotoma focuses on.

usa <- map_data("state") 

ggplot() + 
  geom_polygon(data = usa, 
               fill = "grey38", 
               aes(x = long, y = lat, group = group)) + 
  coord_fixed(1.3) +
  geom_point(data = humans, 
             aes(x = long, 
                 y = lat,
                 color = age.older),
             alpha = .7, size = .4) +
  guides(colour = guide_legend(override.aes = list(size = 3))) +
   ggtitle("Geographic Distribution of Homo sapien Samples") 

Awesome! Looks like you can really see how the age of these samples are really skewed towards younger age samples. Which you can see clearly when you use skimr.

skim(mammals$age.older)
## 
## Skim summary statistics
## 
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────
##           variable missing complete     n     mean       sd p0  p25  p50
##  mammals$age.older    3995    35744 39739 16256.64 52095.64 29 1000 2250
##    p75    p100     hist
##  11470 1500000 ▇▁▁▁▁▁▁▁
skim(mammals$age.younger)
## 
## Skim summary statistics
## 
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────
##             variable missing complete     n    mean       sd p0 p25  p50
##  mammals$age.younger    3995    35744 39739 5523.97 22507.27  0 350 1000
##   p75  p100     hist
##  7805 4e+05 ▇▁▁▁▁▁▁▁

Let’s try another another genus, like a genus of Saber Tooth Tigers, Smilodon.

smilodon <- mammals[grepl("Smilodon ", mammals$taxon.name),]

unique(smilodon$taxon.name) # looks like 5 unique categories.
## [1] Smilodon fatalis        cf. Smilodon fatalis    Smilodon sp.           
## [4] cf. Smilodon floridanus Smilodon gracilis      
## 1768 Levels: ?Alces alces ?Alces sp. ?Bison bison antiquus ... Zoarcoidei
usa <- map_data("state") 

ggplot() + 
  geom_polygon(data = usa, fill = "grey30", 
               aes(x = long, y = lat, group = group)) + 
  coord_fixed(1.3) +
  geom_point(data = smilodon, 
             aes(x = long, y = lat, color = taxon.name),
             alpha = .9, size = .7) +
    scale_color_brewer(palette = "Pastel1") + # I heart pastels
    theme(legend.key = element_rect(fill = "white")) +
    guides(colour = guide_legend(override.aes = list(size = 3))) +
    ggtitle("Geographic Distribution of Smilodon Samples")

Ooooo Sabertooth Tigers seem to like warmer climates, that is, if North American climate during the times of these samples was anything like today’s climate, which it wasn’t really, but maybe an hypothesis to look into? You could even try and get climate temperature data and merge that in to go further into this hypothesis?

Activity (15 minute)

  1. Make a map of the group you chose in the first activity.
  2. Experiment with different inputs to scale_color_brewer, theme, and guides to customize your map.

Now I can only go so far in exploring the data this way. The groupings of taxon is limited to species or genus, but what if I really want to see the distribution of higher order taxonomic groups. This is where taxize comes in.

Adding taxonomic classifications with taxize

taxize is such a useful package. I cannot imagine doing work on biodiversiry data without this package. There have been a lot of people who have contributed to this package, but Scott Chamberlain (R wizard, Ropensci co-founder, and fellow cat lover) is the main person behind these efforts. Thanks Scott and all you other contributors!

The taxize package provides a nice tutorial to get an overview of how to use the package. There are a lot of great functions to add meaning to any dataset with taxa/species names.

For us to understand how to get the most out of taxize for this particular dataset, we should mess around with a smaller subset of species. In this case, let’s grab all the species we can find from the extinct genus Mammuthus (Mammoths).

The main reason we want to use this package is to group the taxon.name information into something more digestible - for instance higher taxonomic groups (order, class, genus).

## Mammuthus
mammoth_test <- mammals[grepl("Mammuthus", mammals$taxon.name),]
unique(mammoth_test$taxon.name) # 16 unique Mammoth taxon names
##  [1] Mammuthus columbi            Mammuthus sp.               
##  [3] cf. Mammuthus columbi        cf. Mammuthus jeffersonii   
##  [5] Mammuthus primigenius        Mammuthus imperator         
##  [7] Mammuthus jeffersonii        Mammuthus cf. M. columbi    
##  [9] cf. Mammuthus sp.            Mammuthus exilis            
## [11] Mammuthus cf. M. primigenius Mammuthus cf. M. jeffersonii
## [13] Mammuthus cf. M. imperator   Mammuthus                   
## [15] ?Mammuthus sp.               cf. Mammuthus primigenius   
## 1768 Levels: ?Alces alces ?Alces sp. ?Bison bison antiquus ... Zoarcoidei

Note 1: The key to this package is that it is the glue that unites many databases and some of these databases require you to obtain an API key to access. See full list on the Taxize Github site.

Note 2: Omitted in this tutorial is where I basically played with many of the functions with different data sources. This package is so diverse because of the different databases it interacts with. Each database might have specific qualities for a data set. I would play with different functions with different data sources quite a bit if you plan on using taxize for your own data set. We ended up using the Open Tree of Life (tol) database as a data source.

Note: If you are interested in a new package that instead of using an API to a database, uses a a local database, check out taxadb. It is still under development, but looks super promising.

Test of taxize with small example of Mammoth samples

The taxize::classification function works great using just the word “Mammuthus” and the Tree of Life Database.

classification("Mammuthus", db = 'tol') 
## 
## Retrieving data for taxon 'Mammuthus'
## $Mammuthus
##                    name       rank      id
## 1                  life    no rank  805080
## 2    cellular organisms    no rank   93302
## 3             Eukaryota     domain  304358
## 4          Opisthokonta    no rank  332573
## 5               Holozoa    no rank 5246131
## 6               Metazoa    kingdom  691846
## 7             Eumetazoa    no rank  641038
## 8             Bilateria    no rank  117569
## 9         Deuterostomia    no rank  147604
## 10             Chordata     phylum  125642
## 11             Craniata  subphylum  947318
## 12           Vertebrata  subphylum  801601
## 13        Gnathostomata superclass  278114
## 14           Teleostomi    no rank  114656
## 15         Euteleostomi    no rank  114654
## 16        Sarcopterygii      class  458402
## 17 Dipnotetrapodomorpha    no rank 4940726
## 18            Tetrapoda superclass  229562
## 19              Amniota    no rank  229560
## 20             Mammalia      class  244265
## 21               Theria   subclass  229558
## 22             Eutheria    no rank  683263
## 23           Afrotheria superorder  746703
## 24          Proboscidea      order  226176
## 25         Elephantidae     family  541924
## 26            Mammuthus      genus  106255
## 
## attr(,"class")
## [1] "classification"
## attr(,"db")
## [1] "tol"

But when we use the whole list, we run into taxa name problems. Some of the taxon.names from Neotoma do not work well with taxize data sources. See below.

## Takes a min to run
## Does not like how my taxa.names are written
whole_mammoth_test <- classification(unique(mammoth_test$taxon.name), db = 'tol') 

## Show which species did not work
as.data.frame(is.na(whole_mammoth_test))

Activity

  1. Find your favorite animal genus and use taxize to get all species in that genus.

In order to get the most out of taxize we have to clean up the taxon.names so that Neotoma and Tree of Life work friendlier together. I grabbed a regex I saw from a Github gist Simon Goring wrote gist to help with the issue of cleaning up taxon name data. See the longer version of this workshop (neotoma_tutorial_log.Rmd) or the original post to learn more on how the data was cleaned.

Apply to larger dataset - extinct species

While wondering around the Neotoma database I found out that if you use neotoma::get_table on the “taxa” dataset, you get a column that tells you if the animal is extinct or not. What is more interesting than extinct species? Nothing. Expecially in North America! There are Mammoths, huge armidillos, Saber Tooth Tigers, Giant Sloths and more!

Let’s try to get the taxonomic classifications of all the extinct animals.

First we subset by extinct taxa:

neotoma_taxa <- neotoma::get_table("taxa")
extinct_taxa <- neotoma_taxa %>% 
  filter(Extinct == "TRUE")

## Find the intersect of extinct taxa and our mammal dataset
extinct_species_in_mammals <- intersect(extinct_taxa$TaxonName, mammals$taxon.name)

## Great! over 350 of these extinct species are found in our
## mammal dataset.
skim(extinct_species_in_mammals)
## 
## Skim summary statistics
## 
## ── Variable type:character ───────────────────────────────────────────────────────────────────────────────────────────
##                    variable missing complete   n min max empty n_unique
##  extinct_species_in_mammals       0      382 382   6  39     0      382

Now let’s subset our mammals dataset to include only extinct species.

extinct_mammals <- mammals %>% 
                      filter(taxon.name %in% extinct_species_in_mammals)

## Check
## skim(extinct_mammals) 
nrow(extinct_mammals)
## [1] 3012

3,0012 (number may change) bone samples in our data set come from extinct animals. Now that we have a nice subset of our species, let’s go ahead and use taxize to merge in genus, order, and family details for each sample. So there is a bit more cleaning steps we will omit from this workshop today for times, but please see the longer version of this workshop (neotoma_tutorial_log.Rmd) or the original post to learn more on how the data was cleaned.

Visualizing extinct animals

Below is the cleaned data set of all the extinct mammals bone samples found in the Neotoma database. So exciting. Let’s start visualizing to 1. make sure everything is correct 2. See if there are patterns in the data to start exploring further.

## Read in data frame
extinct_merged <- read.csv("../data/part2_extinct_merged_27March2018.csv")

extinct_merged %>%
  group_by(order) %>%
  count() %>%
    ggplot(., aes(reorder(x = order, -n), y = n)) + 
      geom_bar(stat = "identity") +
      theme_bw() +
      theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
      xlab("Extinct Mammals") +
      ylab("# of samples") +
      ggtitle("Number of Extinct Mammal Samples in Each Animal Order")
## Warning: Factor `order` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `order` contains implicit NA, consider using
## `forcats::fct_explicit_na`

There are 18 different orders of species, with the most abundant being Cetarticodactyla or even-toed undulates. There are also orders that are clearly not supposed to be there. I may not know animal taxonomic information very well, but for sure I know plants. Zingerberales is not supposed to be here!

I tried using taxize to retrieve the common names, but in the end, I could not easily. I ended up just Googling each one and actually had wayyyyy too much fun doing it. I figured out which are animals and which are not and made a list for subsetting. It may not be programmatically reproducible, but that’s okay because I learned so much.

orders_to_keep <- c("Rodentia", "Proboscidea", "Pilosa", "Perissodactyla", "Lagomorpha", "Didelphimorphia", "Cingulata", "Chiroptera", "Cetartiodactyla", "Carnivora", "Anguilliformes", "Primates")    

extinct_merged %>%  
  filter(order %in% orders_to_keep) %>% 
  group_by(order) %>%   
  count %>% 
  ggplot(., aes(reorder(x = order, -n), y = n)) + 
    geom_bar(stat = "identity") +
    theme_bw() +    
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    xlab("Extinct Mammals") +
    ylab("# of samples") +
  ggtitle("Number of Extinct Mammal Samples in Each Animal Order")

Visualize on map

So cool!

usa <- map_data("state") 

extinct_merged %>%
  filter(order %in% orders_to_keep) %>%
  filter(lat < 50) %>%
    ggplot(.) + 
      geom_polygon(data = usa, 
              fill = "grey30", 
              aes(x = long, y = lat, group = group)) +
      coord_fixed(1.3) +
      geom_point( aes(x = long, y = lat, color = order),  
             alpha = .7, size = .5) +
  scale_color_brewer(palette = "Set3") +
  theme(legend.key = element_rect(fill = "white")) +
  guides(colour = guide_legend(override.aes = list(size = 3))) +
  ggtitle("Geographic Distribution of Extinct Mammal Samples")

What is going on with the Animal Order Proboscidea?

We see some clustering in Michigan, let’s break that down a little bit to see what is going on.

Note: Proboscidea is the animal order in which elephants belong. Interestingly, the only Proboscidea two species left on earth are in Elephantidae 😭🐘

extinct_merged %>%
  filter(order %in% orders_to_keep) %>%
  filter(order == "Proboscidea") %>%
  filter(lat < 50) %>%
    ggplot(.) + 
      geom_polygon(data = usa, 
              fill = "grey30", 
              aes(x = long, y = lat, group = group)) +
      coord_fixed(1.3) +
      geom_point( aes(x = long, y = lat, color = genus ),  
             alpha = .7, size = .6) +
    scale_color_brewer(palette = "Pastel1") +
    theme(legend.key = element_rect(fill = "white")) +
    guides(colour = guide_legend(override.aes = list(size = 3))) +
    facet_grid(family~.) +
    ggtitle("Geographic Distribution of Proboscidea Samples")
## Warning: Removed 6 rows containing missing values (geom_point).

Wow! This is awesome. I didn’t really know anything about Gomphothere before. This family is distinct from elephants and became extinct before Elephants and Mammoths / Mastodons, so it makes sense the scarcity of samples in the North America during these times. They were slowly replaced by modern elephants until their extinction around 10,000 years ago.

extinct_merged %>%
  filter(order %in% orders_to_keep) %>%
  filter(order == "Proboscidea") %>%
  skim(age.older)
## Skim summary statistics
##  n obs: 754 
##  n variables: 17 
## 
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────
##   variable missing complete   n     mean       sd   p0   p25   p50    p75
##  age.older     162      592 754 57976.92 46002.23 1150 15000 35000 110000
##    p100     hist
##  160000 ▇▃▁▁▁▇▁▁

That Michigan cluster of samples seems to be prodomidently from the Mammutidae Family.

extinct_merged %>%
  filter(order %in% orders_to_keep) %>%
  filter(family == "Mammutidae") %>%
  filter(lat < 50) %>%
    ggplot(.) + 
      geom_polygon(data = usa, 
              fill = "grey30", 
              aes(x = long, y = lat, group = group)) +
      coord_fixed(1.3) +
      geom_point( aes(x = long, y = lat, color = taxon.name ),  
             alpha = .7, size = .7) +
    scale_color_brewer(palette="Pastel1") +
    theme(legend.key = element_rect(fill = "white")) +
    guides(colour = guide_legend(override.aes = list(size = 3))) +
    ggtitle("Geographic Distribution of Mammutidae Samples")

The majority of the taxa is made up of Mastodon samples Mammut americanum, which apparently were the most successful North American Proboscidea species.

Incorporation of More Data

I wanted to see if I could make sense of the habitat of these species or climate during the Pliocene to map onto the plot, but unfortunatley I could not find mappable data. I wish I had this paper’s data. This is a question I had many times during these this analysis. Does anyone know where we can get predicted ancient global temperature?

As we explore more databases we hope to gain a better understanding of the data available and how they can be combined to provide insights into our natural world.

Stay tuned for more exploration!

Activity (rest of the time)

  1. Brainstorm question(s) that you could answer using this dataset.
  2. How would you approach you question(s)?
  3. If you pursue your idea further, consider contributing to the Cabinet of Curiosity.

Resources