Neotoma is a Paleoecological database that stores many different types of data. It focuses on data from 5.333 million years ago (Paleocene) to now. The majority of the data is from collection sites, where scientists dug into the ground, collected, and then carefully characterized the bones and fossils they found. They included their best guess on the geological age in which they believe the organism exisited on this planet. The data ranges from diatoms to pollen to insects to bones. Bones! So cool.
Neotoma is implemented in a Microsoft SQL server. You can use Neotoma data without SQL skills using the Neotoma
R Package which is essentially an API wrapper package which is part of my favorite R community Ropensci! This package was largely written by Simon Goring, who appears to have spear-headed this project.
Note: If you are an SQL wizard, you are in luck, this database is extensive and has excellent documentation. You can read about all the details in the Neotoma Database Manual.
If you are working on your own machine, you must first install these packages.
# If you are working on your own machine, you must first install these packages.
# install.packages(c("ggplot", "ggmap", "tidyverse", "skimr", "taxize", "neotoma", "WikipediR"))
library(ggplot2)
library(ggmap)
library(tidyverse)
library(skimr)
library(taxize)
library(neotoma)
library(WikipediR)
library(maps)
## Set working directory to source
# setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
The neotoma::get_table()
function allows you to retrieve whole datasets. This is a superfast way to get to the data as a data frame. Go here for a list of all 63 the tables. These can be accessed using get_table()
function and is a way to search through the database by specific parameters.
For instance, if you are interested in a particular species, age, altitude, ect. you can use neotoma::get_dataset
. Let’s try searching by taxonname by including just the genus. Can anyone tell me the common name of Smilodons?
smilodon <- get_dataset(taxonname = 'Smilodon*')
## The API call was successful, you have returned 45 records.
head(smilodon)
## $`4564`
## A dataset for Blackwater Draw Loc. 1
## Accessed 2019-04-16 14:27h.
## dataset.id site.name long lat type
## 4564 Blackwater Draw Loc. 1 -103.3167 34.28333 vertebrate fauna
##
## $`4939`
## A dataset for Avery Island
## Accessed 2019-04-16 14:27h.
## dataset.id site.name long lat type
## 4939 Avery Island -91.75 29.86667 vertebrate fauna
##
## $`5366`
## A dataset for Clamp Cave
## Accessed 2019-04-16 14:27h.
## dataset.id site.name long lat type
## 5366 Clamp Cave -98.75 31.11667 vertebrate fauna
##
## $`5371`
## A dataset for Friesenhahn Cave
## Accessed 2019-04-16 14:27h.
## dataset.id site.name long lat type
## 5371 Friesenhahn Cave -98.36667 29.61667 vertebrate fauna
##
## $`5479`
## A dataset for First American Bank [40DV40]
## Accessed 2019-04-16 14:27h.
## dataset.id site.name long lat
## 5479 First American Bank [40DV40] -86.78333 36.18333
## type
## vertebrate fauna
##
## $`6060`
## A dataset for Rancho La Brea
## Accessed 2019-04-16 14:27h.
## dataset.id site.name long lat type
## 6060 Rancho La Brea -118.3568 34.0637 vertebrate fauna
*
mean in this code?get_dataset
function.I basically want to visualize everything that the database has to offer. Of course that would mean downloading the entire database, which I shouldn’t do because it is too big. I limited my question to an aspect of the dataset I was interested in: What is the distribution of animal samples through time?
My R workflow is always motivated by the question: How do I organize all the data that interests me into a data frame in the tidy format i.e. where each sample is a row.
The key to getting all the data I wanted was the get_data()
function, which allows me to download all the data for a given site. I wanted all the species information in the vertebrate fauna dataset, which is still a lot of data, but manageable.
We will skip the part of finding all the vetebrate data in Neotoma for now, but if you are interested please see this post on curiositydata.org.
Instead we will just read in the data.
## Read in previous data.
all_fauna_data <- read.csv("../data/all_fauna_data.csv")
Discussion Its important to “get a feel” for your data. Here are a few ways to do that. What do each of these funcyions do in plain english?
## Check out all the pretty data
str(all_fauna_data)
## 'data.frame': 44468 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ taxon.name : Factor w/ 1768 levels "?Alces alces",..: 127 157 190 244 248 762 839 916 1027 1061 ...
## $ variable.units : Factor w/ 8 levels "1-4 scale","grams",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ variable.element: Factor w/ 312 levels "antler","antorbital",..: 16 13 16 16 16 16 16 16 16 16 ...
## $ variable.context: Factor w/ 5 levels "articulated",..: NA NA NA NA NA NA NA NA NA NA ...
## $ taxon.group : Factor w/ 9 levels "Birds","Crustaceans undiff.",..: 6 1 6 6 6 6 6 6 6 6 ...
## $ site.id : int 3531 3531 3531 3531 3531 3531 3531 3531 3531 3531 ...
## $ site.name : Factor w/ 3372 levels "101 Ranch","10AA15",..: 768 768 768 768 768 768 768 768 768 768 ...
## $ long : num -106 -106 -106 -106 -106 ...
## $ lat : num 40.9 40.9 40.9 40.9 40.9 ...
## $ iteration : int 1 1 1 1 1 1 1 1 1 1 ...
## $ age.older : num 11980 11980 11980 11980 11980 ...
## $ age.younger : num 1450 1450 1450 1450 1450 1450 1450 1450 1450 1450 ...
head(all_fauna_data)
## X taxon.name variable.units variable.element variable.context
## 1 1 Antilocapra americana present/absent bone/tooth <NA>
## 2 2 Aves present/absent bone/bill <NA>
## 3 3 Bison sp. present/absent bone/tooth <NA>
## 4 4 Canis latrans present/absent bone/tooth <NA>
## 5 5 Canis lupus present/absent bone/tooth <NA>
## 6 6 Cynomys sp. present/absent bone/tooth <NA>
## taxon.group site.id site.name long lat iteration
## 1 Mammals 3531 Chimney Rock Animal Trap -105.75 40.86667 1
## 2 Birds 3531 Chimney Rock Animal Trap -105.75 40.86667 1
## 3 Mammals 3531 Chimney Rock Animal Trap -105.75 40.86667 1
## 4 Mammals 3531 Chimney Rock Animal Trap -105.75 40.86667 1
## 5 Mammals 3531 Chimney Rock Animal Trap -105.75 40.86667 1
## 6 Mammals 3531 Chimney Rock Animal Trap -105.75 40.86667 1
## age.older age.younger
## 1 11980 1450
## 2 11980 1450
## 3 11980 1450
## 4 11980 1450
## 5 11980 1450
## 6 11980 1450
dim(all_fauna_data)
## [1] 44468 13
Let’s visualize the question: What type of taxon.groups
do I have?
## Summarize the taxon.groups
ggplot(all_fauna_data, aes(taxon.group)) +
geom_histogram(stat = "count") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Looks like the Fauna dataset is mostly made up of mammals anyway, so I am just going to remove anything else to narrow my questions a bit.
## subset only mammals
mammals <- all_fauna_data %>%
filter(taxon.group == "Mammals")
# Compare size to see how many samples we removed?
dim(all_fauna_data)
## [1] 44468 13
dim(mammals)
## [1] 39739 13
Let’s visualize another question: What are the top taxa groups?
mammals %>%
count(taxon.name) %>%
top_n(35) %>%
ggplot(., aes(x = reorder(taxon.name, -n), y = n)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Taxon Name")
## Selecting by n
What are all these species?! With the exception of Bison bison (which is Bison or buffalo) and Homo sapiens, I really couldn’t tell you. To just explore a few of the other species I just Googled.
wp_content <- page_content("en", "wikipedia", page_name = mammals$taxon.name[1], as_wikitext = T)
wp_content$parse$wikitext
## This is what happens when there is no page
wp_content2 <- page_content("en","wikipedia", page_name = mammals$taxon.name[2],as_wikitext = T) ## page doesn't exist
wp_content3 <- page_content("en", "wikipedia", page_name = mammals$taxon.name[3], as_wikitext = T)
wp_content3$parse$wikitext
I used str
and head
to summarize our data, but now we want to know just a few more details about our data with a lot more style. So we are going to try out the fancy new skimr
package….yet another Ropensci package.
Anyway, check out all the information you can get from all_fauna_data
with skimr
’s main function skimr::skim
.
skim(all_fauna_data)
## Skim summary statistics
## n obs: 44468
## n variables: 13
##
## ── Variable type:factor ──────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n n_unique
## site.name 0 44468 44468 3372
## taxon.group 0 44468 44468 9
## taxon.name 0 44468 44468 1768
## variable.context 44260 208 44468 5
## variable.element 0 44468 44468 312
## variable.units 0 44468 44468 8
## top_counts ordered
## Dry: 292, Pec: 274, Zet: 245, Bry: 220 FALSE
## Mam: 39739, Bir: 2115, Rep: 1364, Fis: 1126 FALSE
## Odo: 1276, Ave: 1093, Cas: 981, Pro: 962 FALSE
## NA: 44260, int: 121, art: 44, red: 32 FALSE
## bon: 37079, bon: 3210, bon: 1244, bon: 932 FALSE
## pre: 17936, NIS: 14271, MNI: 12155, MNE: 100 FALSE
##
## ── Variable type:integer ─────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50
## iteration 0 44468 44468 1621.6 1133.84 1 594.75 1511
## site.id 0 44468 44468 5641.98 2839.89 244 4064 4865
## X 0 44468 44468 22234.5 12836.95 1 11117.75 22234.5
## p75 p100 hist
## 2539 3853 ▇▅▅▅▃▃▅▃
## 5760 16264 ▁▅▇▁▁▁▁▁
## 33351.25 44468 ▇▇▇▇▇▇▇▇
##
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25
## age.older 5654 38814 44468 15553.07 50860.17 29 1000
## age.younger 5654 38814 44468 5314.49 21743.28 0 350
## lat 3 44465 44468 39.66 7.15 -18.68 35.62
## long 3 44465 44468 -99.64 28.55 -176.62 -110.62
## p50 p75 p100 hist
## 2200 11135 1500000 ▇▁▁▁▁▁▁▁
## 950 6700 4e+05 ▇▁▁▁▁▁▁▁
## 38.87 42.77 82.5 ▁▁▁▁▇▂▁▁
## -99.12 -89 177.63 ▁▇▂▁▁▁▁▁
Emoji Love interlude: 😍😍😍 * 100. This package is amazing!!! Please read about all the features on the skimr
Github repo and the blog post describing the motivation and collaborative effort involved in making it. I just read it and I am bursting with love for the R community. Man. Now I just want to do a whole post on how 💣 bomb skimr
is.
Right away I see that I have enough information to start visualizing species distributions in time.
# I have a love / hate relationship with this first species
humans <- mammals %>%
filter(taxon.name == "Homo sapiens" & long > -160) #ignoring samples out in the middle of the ocean
# Start with world
world <- map_data("world")
ggplot() +
geom_polygon(data = world,
fill = "grey38",
aes(x = long, y = lat, group = group)) +
coord_fixed(1.3) +
geom_point(data = humans,
aes(x = long,
y = lat,
color = age.older),
alpha = .7, size = .7) +
scale_color_continuous(guide = guide_legend(title = "Age (years ago)")) +
guides(colour = guide_legend(override.aes = list(size = 3))) +
ggtitle("Geographic Distribution of Homo sapien Samples")
Alert: We notice that the sampling is skewed towards North America for sure! Why is this the case? More often than not, it is because of sampling bias, a bias in which a sample is collected in such a way that some members of the intended population are less likely to be included than others. Sometimes the bias is because of the scope of the database, which is the case here. Neotoma is a US data aggregation database. Other times, it can because of limitations in geographic area, actually there have been some studies showing sampling occurs closest to roads! Other times, it is taxonomic bias, or “taxonomic chauvinism”, which is a bias in which organisms are sampled. It is very important to know your data as best you can before making any conclusions.
So let’s zoom in on the United States since that is what Neotoma focuses on.
usa <- map_data("state")
ggplot() +
geom_polygon(data = usa,
fill = "grey38",
aes(x = long, y = lat, group = group)) +
coord_fixed(1.3) +
geom_point(data = humans,
aes(x = long,
y = lat,
color = age.older),
alpha = .7, size = .4) +
guides(colour = guide_legend(override.aes = list(size = 3))) +
ggtitle("Geographic Distribution of Homo sapien Samples")
Awesome! Looks like you can really see how the age of these samples are really skewed towards younger age samples. Which you can see clearly when you use skimr
.
skim(mammals$age.older)
##
## Skim summary statistics
##
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50
## mammals$age.older 3995 35744 39739 16256.64 52095.64 29 1000 2250
## p75 p100 hist
## 11470 1500000 ▇▁▁▁▁▁▁▁
skim(mammals$age.younger)
##
## Skim summary statistics
##
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50
## mammals$age.younger 3995 35744 39739 5523.97 22507.27 0 350 1000
## p75 p100 hist
## 7805 4e+05 ▇▁▁▁▁▁▁▁
Let’s try another another genus, like a genus of Saber Tooth Tigers, Smilodon.
smilodon <- mammals[grepl("Smilodon ", mammals$taxon.name),]
unique(smilodon$taxon.name) # looks like 5 unique categories.
## [1] Smilodon fatalis cf. Smilodon fatalis Smilodon sp.
## [4] cf. Smilodon floridanus Smilodon gracilis
## 1768 Levels: ?Alces alces ?Alces sp. ?Bison bison antiquus ... Zoarcoidei
usa <- map_data("state")
ggplot() +
geom_polygon(data = usa, fill = "grey30",
aes(x = long, y = lat, group = group)) +
coord_fixed(1.3) +
geom_point(data = smilodon,
aes(x = long, y = lat, color = taxon.name),
alpha = .9, size = .7) +
scale_color_brewer(palette = "Pastel1") + # I heart pastels
theme(legend.key = element_rect(fill = "white")) +
guides(colour = guide_legend(override.aes = list(size = 3))) +
ggtitle("Geographic Distribution of Smilodon Samples")
Ooooo Sabertooth Tigers seem to like warmer climates, that is, if North American climate during the times of these samples was anything like today’s climate, which it wasn’t really, but maybe an hypothesis to look into? You could even try and get climate temperature data and merge that in to go further into this hypothesis?
scale_color_brewer
, theme
, and guides
to customize your map.Now I can only go so far in exploring the data this way. The groupings of taxon is limited to species or genus, but what if I really want to see the distribution of higher order taxonomic groups. This is where taxize comes in.
taxize
taxize
is such a useful package. I cannot imagine doing work on biodiversiry data without this package. There have been a lot of people who have contributed to this package, but Scott Chamberlain (R wizard, Ropensci co-founder, and fellow cat lover) is the main person behind these efforts. Thanks Scott and all you other contributors!
The taxize
package provides a nice tutorial to get an overview of how to use the package. There are a lot of great functions to add meaning to any dataset with taxa/species names.
For us to understand how to get the most out of taxize
for this particular dataset, we should mess around with a smaller subset of species. In this case, let’s grab all the species we can find from the extinct genus Mammuthus (Mammoths).
The main reason we want to use this package is to group the taxon.name
information into something more digestible - for instance higher taxonomic groups (order, class, genus).
## Mammuthus
mammoth_test <- mammals[grepl("Mammuthus", mammals$taxon.name),]
unique(mammoth_test$taxon.name) # 16 unique Mammoth taxon names
## [1] Mammuthus columbi Mammuthus sp.
## [3] cf. Mammuthus columbi cf. Mammuthus jeffersonii
## [5] Mammuthus primigenius Mammuthus imperator
## [7] Mammuthus jeffersonii Mammuthus cf. M. columbi
## [9] cf. Mammuthus sp. Mammuthus exilis
## [11] Mammuthus cf. M. primigenius Mammuthus cf. M. jeffersonii
## [13] Mammuthus cf. M. imperator Mammuthus
## [15] ?Mammuthus sp. cf. Mammuthus primigenius
## 1768 Levels: ?Alces alces ?Alces sp. ?Bison bison antiquus ... Zoarcoidei
Note 1: The key to this package is that it is the glue that unites many databases and some of these databases require you to obtain an API key to access. See full list on the Taxize Github site.
Note 2: Omitted in this tutorial is where I basically played with many of the functions with different data sources. This package is so diverse because of the different databases it interacts with. Each database might have specific qualities for a data set. I would play with different functions with different data sources quite a bit if you plan on using taxize
for your own data set. We ended up using the Open Tree of Life (tol
) database as a data source.
Note: If you are interested in a new package that instead of using an API to a database, uses a a local database, check out taxadb
. It is still under development, but looks super promising.
The taxize::classification
function works great using just the word “Mammuthus” and the Tree of Life Database.
classification("Mammuthus", db = 'tol')
##
## Retrieving data for taxon 'Mammuthus'
## $Mammuthus
## name rank id
## 1 life no rank 805080
## 2 cellular organisms no rank 93302
## 3 Eukaryota domain 304358
## 4 Opisthokonta no rank 332573
## 5 Holozoa no rank 5246131
## 6 Metazoa kingdom 691846
## 7 Eumetazoa no rank 641038
## 8 Bilateria no rank 117569
## 9 Deuterostomia no rank 147604
## 10 Chordata phylum 125642
## 11 Craniata subphylum 947318
## 12 Vertebrata subphylum 801601
## 13 Gnathostomata superclass 278114
## 14 Teleostomi no rank 114656
## 15 Euteleostomi no rank 114654
## 16 Sarcopterygii class 458402
## 17 Dipnotetrapodomorpha no rank 4940726
## 18 Tetrapoda superclass 229562
## 19 Amniota no rank 229560
## 20 Mammalia class 244265
## 21 Theria subclass 229558
## 22 Eutheria no rank 683263
## 23 Afrotheria superorder 746703
## 24 Proboscidea order 226176
## 25 Elephantidae family 541924
## 26 Mammuthus genus 106255
##
## attr(,"class")
## [1] "classification"
## attr(,"db")
## [1] "tol"
But when we use the whole list, we run into taxa name problems. Some of the taxon.name
s from Neotoma do not work well with taxize
data sources. See below.
## Takes a min to run
## Does not like how my taxa.names are written
whole_mammoth_test <- classification(unique(mammoth_test$taxon.name), db = 'tol')
## Show which species did not work
as.data.frame(is.na(whole_mammoth_test))
Activity
In order to get the most out of taxize
we have to clean up the taxon.names
so that Neotoma and Tree of Life work friendlier together. I grabbed a regex I saw from a Github gist Simon Goring wrote gist to help with the issue of cleaning up taxon name data. See the longer version of this workshop (neotoma_tutorial_log.Rmd) or the original post to learn more on how the data was cleaned.
While wondering around the Neotoma database I found out that if you use neotoma::get_table
on the “taxa” dataset, you get a column that tells you if the animal is extinct or not. What is more interesting than extinct species? Nothing. Expecially in North America! There are Mammoths, huge armidillos, Saber Tooth Tigers, Giant Sloths and more!
Let’s try to get the taxonomic classifications of all the extinct animals.
First we subset by extinct taxa:
neotoma_taxa <- neotoma::get_table("taxa")
extinct_taxa <- neotoma_taxa %>%
filter(Extinct == "TRUE")
## Find the intersect of extinct taxa and our mammal dataset
extinct_species_in_mammals <- intersect(extinct_taxa$TaxonName, mammals$taxon.name)
## Great! over 350 of these extinct species are found in our
## mammal dataset.
skim(extinct_species_in_mammals)
##
## Skim summary statistics
##
## ── Variable type:character ───────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## extinct_species_in_mammals 0 382 382 6 39 0 382
Now let’s subset our mammals
dataset to include only extinct species.
extinct_mammals <- mammals %>%
filter(taxon.name %in% extinct_species_in_mammals)
## Check
## skim(extinct_mammals)
nrow(extinct_mammals)
## [1] 3012
3,0012 (number may change) bone samples in our data set come from extinct animals. Now that we have a nice subset of our species, let’s go ahead and use taxize
to merge in genus, order, and family details for each sample. So there is a bit more cleaning steps we will omit from this workshop today for times, but please see the longer version of this workshop (neotoma_tutorial_log.Rmd) or the original post to learn more on how the data was cleaned.
Below is the cleaned data set of all the extinct mammals bone samples found in the Neotoma database. So exciting. Let’s start visualizing to 1. make sure everything is correct 2. See if there are patterns in the data to start exploring further.
## Read in data frame
extinct_merged <- read.csv("../data/part2_extinct_merged_27March2018.csv")
extinct_merged %>%
group_by(order) %>%
count() %>%
ggplot(., aes(reorder(x = order, -n), y = n)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Extinct Mammals") +
ylab("# of samples") +
ggtitle("Number of Extinct Mammal Samples in Each Animal Order")
## Warning: Factor `order` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `order` contains implicit NA, consider using
## `forcats::fct_explicit_na`
There are 18 different orders of species, with the most abundant being Cetarticodactyla or even-toed undulates. There are also orders that are clearly not supposed to be there. I may not know animal taxonomic information very well, but for sure I know plants. Zingerberales is not supposed to be here!
I tried using taxize
to retrieve the common names, but in the end, I could not easily. I ended up just Googling each one and actually had wayyyyy too much fun doing it. I figured out which are animals and which are not and made a list for subsetting. It may not be programmatically reproducible, but that’s okay because I learned so much.
orders_to_keep <- c("Rodentia", "Proboscidea", "Pilosa", "Perissodactyla", "Lagomorpha", "Didelphimorphia", "Cingulata", "Chiroptera", "Cetartiodactyla", "Carnivora", "Anguilliformes", "Primates")
extinct_merged %>%
filter(order %in% orders_to_keep) %>%
group_by(order) %>%
count %>%
ggplot(., aes(reorder(x = order, -n), y = n)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Extinct Mammals") +
ylab("# of samples") +
ggtitle("Number of Extinct Mammal Samples in Each Animal Order")
So cool!
usa <- map_data("state")
extinct_merged %>%
filter(order %in% orders_to_keep) %>%
filter(lat < 50) %>%
ggplot(.) +
geom_polygon(data = usa,
fill = "grey30",
aes(x = long, y = lat, group = group)) +
coord_fixed(1.3) +
geom_point( aes(x = long, y = lat, color = order),
alpha = .7, size = .5) +
scale_color_brewer(palette = "Set3") +
theme(legend.key = element_rect(fill = "white")) +
guides(colour = guide_legend(override.aes = list(size = 3))) +
ggtitle("Geographic Distribution of Extinct Mammal Samples")
What is going on with the Animal Order Proboscidea?
We see some clustering in Michigan, let’s break that down a little bit to see what is going on.
Note: Proboscidea is the animal order in which elephants belong. Interestingly, the only Proboscidea two species left on earth are in Elephantidae 😭🐘
extinct_merged %>%
filter(order %in% orders_to_keep) %>%
filter(order == "Proboscidea") %>%
filter(lat < 50) %>%
ggplot(.) +
geom_polygon(data = usa,
fill = "grey30",
aes(x = long, y = lat, group = group)) +
coord_fixed(1.3) +
geom_point( aes(x = long, y = lat, color = genus ),
alpha = .7, size = .6) +
scale_color_brewer(palette = "Pastel1") +
theme(legend.key = element_rect(fill = "white")) +
guides(colour = guide_legend(override.aes = list(size = 3))) +
facet_grid(family~.) +
ggtitle("Geographic Distribution of Proboscidea Samples")
## Warning: Removed 6 rows containing missing values (geom_point).
Wow! This is awesome. I didn’t really know anything about Gomphothere before. This family is distinct from elephants and became extinct before Elephants and Mammoths / Mastodons, so it makes sense the scarcity of samples in the North America during these times. They were slowly replaced by modern elephants until their extinction around 10,000 years ago.
extinct_merged %>%
filter(order %in% orders_to_keep) %>%
filter(order == "Proboscidea") %>%
skim(age.older)
## Skim summary statistics
## n obs: 754
## n variables: 17
##
## ── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75
## age.older 162 592 754 57976.92 46002.23 1150 15000 35000 110000
## p100 hist
## 160000 ▇▃▁▁▁▇▁▁
That Michigan cluster of samples seems to be prodomidently from the Mammutidae Family.
extinct_merged %>%
filter(order %in% orders_to_keep) %>%
filter(family == "Mammutidae") %>%
filter(lat < 50) %>%
ggplot(.) +
geom_polygon(data = usa,
fill = "grey30",
aes(x = long, y = lat, group = group)) +
coord_fixed(1.3) +
geom_point( aes(x = long, y = lat, color = taxon.name ),
alpha = .7, size = .7) +
scale_color_brewer(palette="Pastel1") +
theme(legend.key = element_rect(fill = "white")) +
guides(colour = guide_legend(override.aes = list(size = 3))) +
ggtitle("Geographic Distribution of Mammutidae Samples")
The majority of the taxa is made up of Mastodon samples Mammut americanum, which apparently were the most successful North American Proboscidea species.
I wanted to see if I could make sense of the habitat of these species or climate during the Pliocene to map onto the plot, but unfortunatley I could not find mappable data. I wish I had this paper’s data. This is a question I had many times during these this analysis. Does anyone know where we can get predicted ancient global temperature?
As we explore more databases we hope to gain a better understanding of the data available and how they can be combined to provide insights into our natural world.
Stay tuned for more exploration!