Mapping Proximity of Dinosaur and Plant Fossils

Mapping Proximity of Dinosaur and Plant Fossils

2019, May 28    
Zoe Liu

Zoe Liu


This post is a continuation of my first tutorial where I looked at the dinosaur fossil location data, in this post I will walk through other fossil types in the the Paleobiology Database (PBDB) - plants! We will be exploring how to map different types of fossils onto an interacitve map using plot.ly. We will also look at where and which plant species fossil are most common and introduce a strategy to find which dinosaur fossils are closest to plant fossils to understand the plant landscape and ecosystem in which specific dinosaur species lived (and possibly even ate!).


import pandas as pd
from pathlib import Path
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.mode.chained_assignment = None #ignore Setting with copy warnings
data_file = Path("data/part1", "dino_data.hdf")
interesting = pd.read_hdf(data_file, "interesting")
dino_plants = pd.read_hdf(data_file, "dino_plants")
dino_df = pd.read_hdf(data_file, "dino_df")

Now, let’s take the dinosaur and plant data we gathered above and combine them to see what we can find!

interesting.head() #our cleaned dinosaur data
accepted_name state lat lng environment created country early_interval late_interval interval
19 Theropoda Connecticut 41.566666 -72.633331 terrestrial indet. 2011-07-28 02:09:51 US Hettangian Sinemurian Jurassic
20 Camarasaurus grandis Colorado 39.068802 -108.699989 fluvial-lacustrine indet. 2017-11-02 14:56:21 US Kimmeridgian Tithonian Jurassic
21 Camarasaurus supremus Colorado 39.111668 -108.717499 fluvial-lacustrine indet. 2001-09-19 09:11:44 US Kimmeridgian NaN Jurassic
22 Ankylosaurus magniventris Montana 47.637699 -106.569901 terrestrial indet. 2001-09-19 10:03:19 US Maastrichtian NaN Cretaceous
23 Titanosauriformes Oklahoma 34.180000 -96.278053 coastal indet. 2005-08-25 14:56:00 US Late Aptian Early Albian Cretaceous
dino_plants.head() #our cleaned plant data
accepted_name state lat lng environment created country early_interval late_interval interval
96 Neocalamites New Mexico 35.200001 -105.783333 fluvial indet. 2001-06-08 06:53:35 US Late Triassic NaN Triassic
97 Brachyphyllum New Mexico 35.200001 -105.783333 fluvial indet. 2001-06-08 06:53:35 US Late Triassic NaN Triassic
98 Masculostrobus New Mexico 35.200001 -105.783333 fluvial indet. 2001-06-08 06:53:35 US Late Triassic NaN Triassic
99 Samaropsis New Mexico 35.200001 -105.783333 fluvial indet. 2001-06-08 06:46:34 US Late Triassic NaN Triassic
100 Samaropsis New Mexico 35.200001 -105.783333 fluvial indet. 2001-06-08 06:46:34 US Late Triassic NaN Triassic
dino_df.head() #our original dinosaur data
authorizer authorizer_no country collection_no county cx_int_no created modified max_ma reid_no ... srb sro regional_section stratscale state szn diference tec accepted_no accepted_name
0 M. Carrano prs:14 CA col:11890 NaN 39 2011-05-13 07:45:44 2011-05-12 16:46:47 83.5 rei:24752 ... 4.135 bottom to top Dinosaur Park bed Alberta NaN subjective synonym of NaN txn:53194 Gorgosaurus libratus
1 M. Carrano prs:14 CA col:11892 NaN 39 2001-09-18 13:58:56 2013-05-10 07:22:24 83.5 NaN ... NaN NaN NaN bed Alberta NaN NaN NaN txn:38755 Hadrosauridae
2 M. Carrano prs:14 CA col:11893 NaN 39 2010-07-27 08:36:19 2010-07-27 10:37:07 83.5 rei:23301 ... NaN NaN NaN bed Alberta NaN NaN NaN txn:53194 Gorgosaurus libratus
3 M. Carrano prs:14 CA col:11894 NaN 39 2001-09-18 14:04:56 2006-04-21 12:43:50 83.5 NaN ... 4.03 NaN Dinosaur Park bed Alberta NaN NaN NaN txn:63911 Centrosaurus apertus
4 M. Carrano prs:14 CA col:11895 NaN 39 2010-07-27 08:39:37 2010-07-27 10:39:58 83.5 rei:23302 ... NaN NaN NaN bed Alberta NaN NaN NaN txn:53194 Gorgosaurus libratus

5 rows × 61 columns

First, we would like to visualize our cleaned and accumulated data. We will use plotly to plot the data points onto a map of the United States.

import plotly
import plotly.plotly as py
import pandas as pd  
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
from ipywidgets import interactive, HBox, VBox, widgets, interact

plotly.offline.init_notebook_mode()

interesting["text"] = "name: " + interesting["accepted_name"].astype(str)
dinos = {'lat': interesting["lat"],
  'lon': interesting["lng"],
  'marker': {'color': 'rgb(116,0,217)',
   'line': {'color': 'rgb(40,40,40)', 'width': 0.5},
   'size': 2.700000000000003,
   'sizemode': 'diameter'},
  'text': interesting["text"],
  'type': 'scattergeo',
        "name": "Dinosaurs"}

dino_plants["text"] = "name: " + dino_plants["accepted_name"].astype(str)
plants = {'lat': dino_plants["lat"],
  'lon': dino_plants["lng"],
  'marker': {'color': 'rgb(0, 217, 108)',
   'line': {'color': 'rgb(40,40,40)', 'width': 0.5},
   'size': 2.700000000000003,
   'sizemode': 'diameter'},
  'text': dino_plants["text"],
  'type': 'scattergeo',
         "name": "Plants"}

era_range = ["Triassic", "Jurassic", "Cretaceous", "Paleogene", "Neogene", "Quaternary"]

for era in era_range:
    slider_step = {'args': [
            [era]
         ],
         'label': era,
         }

layout = go.Layout(
    title = "Dinosaurs and Plant Fossils of America (Triassic - Quaternary)",
    showlegend = True,
    geo = dict(
            scope='usa',
            projection=dict( type='albers usa'),
            showland = True,
            landcolor = 'rgb(217, 217, 217)',
            subunitwidth=1,
            countrywidth=1,
            subunitcolor="rgb(255, 255, 255)",
            countrycolor="rgb(255, 255, 255)"
        ))
comp_data = [dinos, plants]

fig = go.Figure(layout=layout, data=comp_data)
iplot(fig, validate=False)

Above, we see all the dinosaur and plant fossils found in the United States! We notice a large clustering of data points in Central America. This is due to the fact that much of central US was covered by an inland sea in the Cretaceous era. Rivers covered dinosaur remains with sediment, preserving them as fossils, and the formation of the Rocky Mountains also aided in burying dinosaur remains and fossilizing them.

Dinosaur tracks left at Dinosaur Ridge, Colorado, one of the world’s most famous dinosaur fossil sites.

Hadrosaur Diet

From our previous analysis we saw that some of the most prominant dinosaurs in our dataset were Hadrosuars. There is a lot of debate on Hadrosaur diet, it is generally believed that teir diet consisted of vegetation. I thought it would be fun to examine Hadrosaur diet based on plant fossil presence in proximity to Hadrosaurs.

With our newly cleaned data, we can now answer questions we have using our data. For example, what did hadrosaur eat?. Again, this question has been a subject of debate among paleontologists over the past century. From the 1870’s to the 1960’s, it was widely believed that hadrosaurs were only suited to eat soft, aquatic plants. However, later research contended that hadrosaurs only ate land plants such as leaves and twigs. We will attempt to gain further insight on this debate using our data.

First, let’s visualize the spread of hadrosaur fossils. Maybe the location of their fossils can tell us something about what they ate.

interesting["text"] = "name: " + interesting["accepted_name"].astype(str)
hadro = interesting[interesting["accepted_name"] == "Hadrosauridae"]
hadros = {'lat': hadro["lat"],
  'lon': hadro["lng"],
  'marker': {'color': 'rgb(116,0,217)',
   'line': {'color': 'rgb(40,40,40)', 'width': 0.5},
   'size': 2.700000000000003,
   'sizemode': 'diameter'},
  'text': hadro["text"],
  'type': 'scattergeo',
        "name": "Dinosaurs"}

dino_plants["text"] = "name: " + dino_plants["accepted_name"].astype(str)
nm = dino_plants[dino_plants["state"] == "New Mexico"]
plants = {'lat': nm["lat"],
  'lon': nm["lng"],
  'marker': {'color': 'rgb(0, 217, 108)',
   'line': {'color': 'rgb(40,40,40)', 'width': 0.5},
   'size': 2.700000000000003,
   'sizemode': 'diameter'},
  'text': nm["text"],
  'type': 'scattergeo',
         "name": "Plants"}


layout = go.Layout(
    title = "Hadrosauridae Fossils of America (Triassic - Quaternary)",
    showlegend = True,
    geo = dict(
            scope='usa',
            projection=dict( type='albers usa'),
            showland = True,
            landcolor = 'rgb(217, 217, 217)',
            subunitwidth=1,
            countrywidth=1,
            subunitcolor="rgb(255, 255, 255)",
            countrycolor="rgb(255, 255, 255)"
        ))

fig = go.Figure(layout=layout, data=[hadros])
iplot(fig, validate=False)

We notice that the majority of the hadrosaurs are located in Central North America. While we don’t know exactly the river and lakes that could have exsisted in this area during this time. We can rule out that hadrosaurs ate oceann aquatic plants due to the location of the fossils in landlocked areas.

Let’s examine the relationship between hadrosaur fossil coordinates and plant fossil coordinates. We’ll take the coordinates of the hadrosaur and plant fossils, zip them into a list, and add them to our dataframe. But first, let’s remove all instances of “Plantae” from our dino_plant dataframe, since Plantae is a broad generalization of any plant and isn’t very useful.

dino_plants = dino_plants[dino_plants["accepted_name"] != "Plantae"]
hadro["coords"] = list(zip(hadro.lat, hadro.lng))
dino_plants["coords"] = list(zip(dino_plants.lat, dino_plants.lng))
hadro.head()
accepted_name state lat lng environment created country early_interval late_interval interval text coords nearest_plant
26 Hadrosauridae Montana 47.695831 -106.227776 "channel" 2002-01-15 15:16:43 US Maastrichtian NaN Cretaceous name: Hadrosauridae (47.695831, -106.227776) Plantae
63 Hadrosauridae Wyoming 43.349400 -104.482002 "channel" 2002-07-10 20:49:32 US Maastrichtian NaN Cretaceous name: Hadrosauridae (43.3494, -104.482002) Celastrus
118 Hadrosauridae Texas 29.138056 -103.196945 fine channel fill 2002-07-10 20:49:32 US Late Campanian NaN Cretaceous name: Hadrosauridae (29.138056, -103.196945) Selaginella
147 Hadrosauridae Montana 48.966599 -112.650002 lacustrine - small 2002-07-10 20:49:32 US Campanian NaN Cretaceous name: Hadrosauridae (48.966599, -112.650002) Protophyllum
186 Hadrosauridae New Jersey 40.299999 -74.300003 estuary/bay 2002-08-20 14:43:52 US Judithian NaN Cretaceous name: Hadrosauridae (40.299999, -74.300003) Microaltingia apocarpela
hadro["interval"].value_counts()
Cretaceous    460
Name: interval, dtype: int64
hadro["interval"].value_counts().index[0]
'Cretaceous'

Because the hadrosaurs were only found in the Cretaceous era, we can filter our dino_plant dataframe to only contain plants in the Cretaceous era.

cret_plants = dino_plants[dino_plants["interval"] == "Cretaceous"]
cret_plants.head()
accepted_name state lat lng environment created country early_interval late_interval interval text coords
3175 Ginkgo Arizona 31.811390 -110.418053 fluvial indet. 2002-01-21 11:54:06 US Late Albian Early Cenomanian Cretaceous name: Ginkgo (31.81139, -110.418053)
3176 Cycadophyta Arizona 31.811390 -110.418053 fluvial indet. 2002-01-21 11:54:06 US Late Albian Early Cenomanian Cretaceous name: Cycadophyta (31.81139, -110.418053)
3579 Magnoliid Nebraska 40.056946 -97.182777 fluvial indet. 2002-06-07 13:31:38 US Early Cenomanian Middle Cenomanian Cretaceous name: Magnoliid (40.056946, -97.182777)
3580 Pandemophyllum Nebraska 40.056946 -97.182777 fluvial indet. 2002-06-07 13:59:40 US Early Cenomanian Middle Cenomanian Cretaceous name: Pandemophyllum (40.056946, -97.182777)

2776 rows × 12 columns

Next, we’ll use the KDTree, or a k-dimensional tree, data structure to find the nearest plant neighbor to each hadrosaurus fossil. Then, we’ll extract the accepted name of each plant and add it to our DataFrame for examination.

from scipy import spatial
closest = []
all_plants = cret_plants["coords"].values.tolist()
tree = spatial.KDTree(all_plants)
for coord in hadro["coords"]:
    query = tree.query([coord])
    closest += [query]
import numpy as np
hadro_plant = []
for i in np.arange(len(closest)):
    index = closest[i][1][0]
    hadro_plant += [cret_plants.iloc[index]["accepted_name"]]
hadro["nearest_plant"] = hadro_plant
hadro.head()
accepted_name state lat lng environment created country early_interval late_interval interval text coords nearest_plant
26 Hadrosauridae Montana 47.695831 -106.227776 "channel" 2002-01-15 15:16:43 US Maastrichtian NaN Cretaceous name: Hadrosauridae (47.695831, -106.227776) Azolla
63 Hadrosauridae Wyoming 43.349400 -104.482002 "channel" 2002-07-10 20:49:32 US Maastrichtian NaN Cretaceous name: Hadrosauridae (43.3494, -104.482002) Celastrus
118 Hadrosauridae Texas 29.138056 -103.196945 fine channel fill 2002-07-10 20:49:32 US Late Campanian NaN Cretaceous name: Hadrosauridae (29.138056, -103.196945) Selaginella
147 Hadrosauridae Montana 48.966599 -112.650002 lacustrine - small 2002-07-10 20:49:32 US Campanian NaN Cretaceous name: Hadrosauridae (48.966599, -112.650002) Protophyllum
186 Hadrosauridae New Jersey 40.299999 -74.300003 estuary/bay 2002-08-20 14:43:52 US Judithian NaN Cretaceous name: Hadrosauridae (40.299999, -74.300003) Microaltingia apocarpela

460 rows × 13 columns

Now let’s visually examine the nearest plant to hadrosaurs.

plt.figure(figsize=(12,6)) #create a figure that is 12 x 6
ax = sns.countplot(x='nearest_plant', data=hadro, order=hadro['nearest_plant'].value_counts().index.tolist()[:20])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.title("Top 20 Nearest Plant to Hadrosauridae (Except Plantae)")
plt.show()

png

Our most common plant is of the clade tracheophyta, or vascular plants, which is a broad umbrella term for a number of land plants with xylems that conduct water throughout the plant body. Second on our list is coniferophyta, or cone-bearing trees like pines, cypresses, and spruces. This is phylogenetically a descendant of the tracheophyta, and it is plausible that some of the tracheophyta fossils found are potentially conifers. This abundance of land plants near the hadrosaur fossils supports the idea that hadrosaurs ate land plants like conifer needles and fruit. As a matter of fact, John Ostrom, a revolutionary paleontologist, showed in 1964 that there was little evidence of hadrosaurs having an aquatic diet. For one, aquatic plant pollen was very uncommon near the hadrosaur herds he was investigating (as we can see from our data above). This argument was further supported by a 1922 study of the gut contents of hadrosaurs, which found conifer needles, fruit, and seeds.

An example of a tracheophyte. Note the feathery fronds of the plant, which are vascular tissues that carry water throughout the plant.

Again, this is a very superficial way to explore Dinosaur diet, but let’s look into plant proximity to dinosaurs a bit broader. Now that we’ve explored what the hadrosaurs might have eaten based on the nearest fossilized plants, let’s expand this to all other dinosaurs. What did every type of dinosaurs eat based on proximity?

To answer this question, we have to know:

  1. The coordinates of every dinosaur, grouped by type
  2. The coordinates of every plant fossil originating from the same geologic period

Using KDTrees to find proximity of latitude and longitude points

KDtrees are a way to organize points in space and a very useful strategy for finding nearest neighbor searches. We can construct a KDTree and find the nearest plant fossil to each dinosaur. To make this easier, we can construct six KDTrees, each corresponding to a different geologic period, and run them on the dinosaurs in the same period.

First, let’s create a function that will filter out plant fossils by geologic period. This will aid us in creating the period-grouped KDTrees.

def make_period_plant(period):
    period_plants = dino_plants[dino_plants["interval"] == period]
    return period_plants
cret_plants = make_period_plant("Cretaceous")
jur_plants = make_period_plant("Jurassic")
tri_plants = make_period_plant("Triassic")
neo_plants = make_period_plant("Neogene")
paleo_plants = make_period_plant("Paleogene")
quat_plants = make_period_plant("Quaternary")

Next, we’ll construct a KDTree for each geologic period. Below, I’ve grouped the dinosaur dataframe interesting by interval, which will make it easier to iterate through every item in each group and apply every dinosaur data point to the constructed KDTree. At the end, we end up with six dataframes, with rows corresponding to each dinosaur fossil and their nearest plant fossil. We merge these together and our resulting dataframe, dinonplants, is the same as interesting, except with a nearest_plant column.

dinonplants = pd.DataFrame()
grouped = interesting.groupby("interval")
for interval, groups in grouped:
    if interval == "Cretaceous":
        plants = cret_plants
    elif interval == "Jurassic":
        plants = jur_plants
    elif interval == "Triassic":
        plants = tri_plants
    elif interval == "Neogene":
        plants = neo_plants
    elif interval == "Paleogene":
        plants = paleo_plants
    elif interval == "Quaternary":
        plants = quat_plants
    
    print("Calculating: ", interval)
    values = grouped.groups.get(interval)
    new_df = interesting[interesting.index.isin(values)]
    
    new_df["coords"] = list(zip(new_df.lat, new_df.lng))
    plants["coords"] = list(zip(plants.lat, plants.lng))
    closest = []
    all_plants = plants["coords"].values.tolist()
    tree = spatial.KDTree(all_plants)
    for coord in new_df["coords"]:
        query = tree.query([coord], k=2)
        closest += [query[1][0][0]] #list of nearest plant indices
    nearest_plant = []
    for i in closest:
        nearest_plant += [plants.iloc[i]["accepted_name"]]
    new_df["nearest_plant"] = nearest_plant
    dinonplants = dinonplants.append(new_df, ignore_index=False)

Calculating:  Cretaceous
Calculating:  Jurassic
Calculating:  Neogene
Calculating:  Paleogene
Calculating:  Quaternary
Calculating:  Triassic
dinonplants = dinonplants.sort_index() #this sorts the merged dataframe by increasing index 
dinonplants.head()
accepted_name state lat lng environment created country early_interval late_interval interval text coords nearest_plant
19 Theropoda Connecticut 41.566666 -72.633331 terrestrial indet. 2011-07-28 02:09:51 US Hettangian Sinemurian Jurassic name: Theropoda (41.566666, -72.633331) Baiera
20 Camarasaurus grandis Colorado 39.068802 -108.699989 fluvial-lacustrine indet. 2017-11-02 14:56:21 US Kimmeridgian Tithonian Jurassic name: Camarasaurus grandis (39.068802, -108.699989) Tracheophyta
21 Camarasaurus supremus Colorado 39.111668 -108.717499 fluvial-lacustrine indet. 2001-09-19 09:11:44 US Kimmeridgian NaN Jurassic name: Camarasaurus supremus (39.111668, -108.717499) Tracheophyta
22 Ankylosaurus magniventris Montana 47.637699 -106.569901 terrestrial indet. 2001-09-19 10:03:19 US Maastrichtian NaN Cretaceous name: Ankylosaurus magniventris (47.637699, -106.569901) Azolla
23 Titanosauriformes Oklahoma 34.180000 -96.278053 coastal indet. 2005-08-25 14:56:00 US Late Aptian Early Albian Cretaceous name: Titanosauriformes (34.18, -96.278053) Dichastopollenites

9143 rows × 13 columns

Now that we’ve obtained the nearest plant to every individual fossil, let’s group these fossils by name and pick out the highest occurring nearest plant. What we end up with is a table that shows every type of dinosaur fossil and their most frequently seen nearest plant.

vals = dinonplants.groupby("accepted_name")["nearest_plant"].agg(lambda x:x.value_counts().index[0])
dinos_nearest_plant = vals.to_frame()
dinos_nearest_plant.head()
nearest_plant
accepted_name
Abydosaurus mcintoshi Magnoliopsida
Accipiter cooperii Vitis
Accipiter gentilis Pinus
Accipiter striatus Carya
Accipiter striatus velox Zosteraceae

As a sanity check, let’s make sure that the nearest plant to Hadrosauridae is the Tracheophyta (which we found in our exercise on hadrosaur diet above).

dinos_nearest_plant.loc["Hadrosauridae"]
nearest_plant    Tracheophyta
Name: Hadrosauridae, dtype: object

Great, our code looks correct. Now let’s perform some visual analysis on the data we collected. What plants were every type of dinosaur eating the most?

plt.figure(figsize=(12,6)) #create a figure that is 12 x 6
ax = sns.countplot(x='nearest_plant', data=dinonplants, order=dinonplants['nearest_plant'].value_counts().index.tolist()[:20])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.title("Nearest Plant (Except Plantae) to Each Type of Dinosaur")
plt.show()

png

Pinus, or pine trees, equisetum, or horsetail grass, and zosteraceae, a type of seagrass, are the most commonly found plants. These findings aren’t too surprising. Paleontologists have known for decades that herbivorous dinosaurs fed on plants like pine needles, evergreen conifers (such as redwoods, and pine trees), horsetail grass, and seagrasses.

Horsetail grass, or equisetum, are considered living fossils due to the fact that they haven’t changed in the past 100 million years. In the age of the dinosaurs, they could grow up to 30 meters tall, though they typically only grow 1 m tall in the modern age.

Now let’s visualize some period-specific dinosaur and plant relations.

period_plants = dinonplants.groupby(["accepted_name", "interval"])["nearest_plant"].agg(lambda x:x.value_counts().index[0])
period_plant_df = period_plants.to_frame()
period_plant_df.head()
nearest_plant
accepted_name interval
Abydosaurus mcintoshi Cretaceous Magnoliopsida
Accipiter cooperii Quaternary Vitis
Accipiter gentilis Quaternary Pinus
Accipiter striatus Quaternary Carya
Accipiter striatus velox Quaternary Zosteraceae
cret_dinonplants = period_plant_df[period_plant_df.index.get_level_values(1) == "Cretaceous"]
plt.figure(figsize=(12,6)) #create a figure that is 12 x 6
ax = sns.countplot(x='nearest_plant', data=cret_dinonplants, order=cret_dinonplants['nearest_plant'].value_counts().index.tolist()[:20])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.title("Nearest Plant (Except Plantae) to Each Type of Dinosaurs in Cretaceous Era")
plt.show()

png

This graph shows much of the same trends as the chart showing plants closest to hadrosaurs, which makes sense, as hadrosaurs were common in the Cretaceous period.

quat = period_plant_df[period_plant_df.index.get_level_values(1) == "Quaternary"]
plt.figure(figsize=(12,6)) #create a figure that is 12 x 6
ax = sns.countplot(x='nearest_plant', data=quat, order=quat['nearest_plant'].value_counts().index.tolist()[:20])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.title("Nearest Plant (Except Plantae) to Each Type of Dinosaurs in Quaternary Period")
plt.show()

png

In the Quaternary period, we start moving into familiar territory. The dinosaurs in this period (which have now evolved to be the birds we know today) were around common plants like pine trees (pinus) and grape vines (vitis).

In Conclusion

In this section, we visualized our plant and dinsoaur data by plotting them onto a map. We were able to visually comprehend the spatial relationship between plant and dinosaur fossils, and saw that central USA was a popular dinosaur graveyard. With our cleaned data, we were able to answer questions like “what did hadrosaur eat?” by performing a nearest neighbor search of every hadrosaur fossil on plant fossils in the same geologic period. We saw that hadrosaurs were likely feeding on conifers due to the abundance of conferophyta and land plant fossils. We were able to scale up our question by looking at the likely diets of every other type of dinosaur. We ran six different KDTrees, corresponding to the geologic periods, and found the nearest plant fossil to every dinosaur fossil in each. By looking at the most common nearest plant to every dinosaur fossil, we were able to make a guess that the dinosaur in question was feeding on that plant. Some things to keep in mind:

  1. Our analysis did not account for carnivorous dinosaurs, since we assumed all dinosaurs eat plants. Our question can be rephrased to “what is the spatial relationship between dinosaur fossils and plant fossils?” instead, if we cared about this difference.
  2. Visualization is a fantastic way of guiding research questions. Think back to how we second guessed the hadrosaur diet of sea plants based on the spread of hadrosaur fossils in land-locked areas. Visualizations can be just as important as more math-based methods when it comes to analytics.

Thank you for reading, and I hope this tutorial has clarified the methods of exploratory analysis, data cleaning, and data analysis.