Measuring the interactions of species with GloBI

Measuring the interactions of species with GloBI

2019, Feb 07    
Yikang Li

Yikang Li

I am Yikang Li, an international student from Tianjin, China. I recently just graduated from UC Berkeley with BA degree in Statistics. I have long been interested in data science and therefore seized every possible means to be part of projects to improve my practical ability in data mining. This interest lead me to exploring Biodiversity Data with the Cabinet of Curiosity Team. During this research experience, I, together with other interns, studied different natural history databases and revised each other’s work using Github. During our first week togerther we all explored what types of data was available to us and together created a brief list with summaries of the databases we could find. We each were given the option to explore one database of our choosing (link). I choose Global Biotic Interactions (GLoBI) globalbioticinteractions.org, a database of biodiversity, which as the name implies, collates species interactions data from around the world.

GloBI does a fantastic job of explaining itself:

Global Biotic Interactions (GloBI) provides open access to species interaction data (e.g., predator-prey, pollinator-plant, pathogen-host, parasite-host) by combining existing open datasets using open source software. By providing an infrastructure to capture and share interaction data, individual biologists can focus on gathering new interaction data and analyzing existing datasets without having to spend resources on (re-) building a cyberinfrastructure to do so.

I was intitially interested in the GloBI database because it provides data in the format of interactions, which is different from other biodiversity databases. Instead of focusing on one species at a time, it connects different species by describing interactions between them. My work will be posted in two parts 1. Accessing and understanding the data 2. Exploring the data with network visualizations. Hope you enjoy!

How to access GloBi

To start, we must get the access the data! I will discuss a few options that I tried.

1. The R package rglobi

There is a package called rglobi in R which allows us to access the database on Global Biotic Interactions (GloBI). Here is a description from the documentation of the package:

A programmatic interface to the web service methods provided by Global Biotic Interactions (GloBI). GloBI provides access to spatial-temporal species interaction records from sources all over the world. rglobi provides methods to search species interactions by location, interaction type, and taxonomic name. In addition, it supports Cypher, a graph query language, to allow for executing custom queries on the GloBI aggregate species interaction data set.”

To use its methods and functions, we need to install and load the package “rglobi” in R.

install.packages("rglobi")
library(rglobi)

Users are able to search data on species interactions by location, interaction type, taxonomic names and so on. Please check out the rglobi vignette to learn more about the use of this package. While the R package provides built in methods and functions, it has limitation on the maximum amount of data displayed. Look into Pagination options to understand the limitations: https://github.com/ropensci/rglobi/blob/master/vignettes/rglobi_vignette.Rmd#L410

By default, the amount of results are limited. If you’d like to retrieve all results, you can used pagination. For instance, to retrieve parasitic interactions using pagination, you can use:

otherkeys = list("limit"=10, "skip"=0)
first_page_of_ten <- get_interactions_by_type(interactiontype = c("hasParasite"), otherkeys = otherkeys)
otherkeys = list("limit"=10, "skip"=10)
second_page_of_ten <- get_interactions_by_type(interactiontype = c("hasParasite"), otherkeys = otherkeys)

Basically you have to exhaust all available interactions, you can keep paging results until the size of the page is less than the limit (e.g., nrows(interactions) < limit).

2. GloBI API

Another way to access the GloBI data is through the API: https://github.com/jhpoelen/eol-globi-data/wiki/API. The link above contains the API which provide access to interaction data for the purpose of integrating the data into wikis, custom webpages or other interaction exploration tools.

3. Download Everything

The third option is to download the whole dataset directly at https://www.globalbioticinteractions.org/data. Datasets are available to download in different formats including tsv, csv and N-Quads/RDF. I chose the .tsv version.

Basic data exploration and characteristics

I ended up choosing Choice 3 and explored the dataset with Python in the Jupyter notebook enviroment. One of the reasons is that I don’t want to be limited by the built-in functions in rglobi package. Importing the whole dataset allows me to explore in whatever ways I want to. Also, by Choice 3, I have the same dataset everytime so the results can be fully reproducible.

If you would like to follow along to follow along in a Jupyter notebook, please checkout the notebook here: Notebook. You will first need to download the interactions.tsv file here: interactions.tsv.gz.

%matplotlib inline
import pandas as pd
# Takes a few mintutes to load.
# If following along please download and unzip interactions.tsv.gz from 
# https://depot.globalbioticinteractions.org/snapshot/target/data/tsv/interactions.tsv.gz
# Unziping the file is ~6.5 GB
# Don't forget to change path to the file on your computer

data = pd.read_csv('~/Desktop/interactions.tsv', delimiter='\t', encoding='utf-8')
/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3020: DtypeWarning: Columns (21,22,23,24,25,26,27,28,29,30,41,42,43,44,45,46,47,48,49,50,55,58,59,60,61,62,63,64,65,68,69,72,73,78) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
# See the first few rows
data.head()
sourceTaxonId sourceTaxonIds sourceTaxonName sourceTaxonRank sourceTaxonPathNames sourceTaxonPathIds sourceTaxonPathRankNames sourceTaxonSpeciesName sourceTaxonSpeciesId sourceTaxonGenusName ... eventDateUnixEpoch argumentTypeId referenceCitation referenceDoi referenceUrl sourceCitation sourceNamespace sourceArchiveURI sourceDOI sourceLastSeenAtUnixEpoch
0 EOL:4472733 EOL:4472733 | EOL:4472733 Deinosuchus genus Deinosuchus EOL:4472733 genus NaN NaN Deinosuchus ... NaN https://en.wiktionary.org/wiki/support Rivera-Sylva H.E., E. Frey and J.R. Guzmán-Gui... 10.4267/2042/28152 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-12-14T23:59:22.189Z
1 EOL:4433651 EOL:4433651 | EOL:4433651 Daspletosaurus genus Daspletosaurus EOL:4433651 genus NaN NaN Daspletosaurus ... NaN https://en.wiktionary.org/wiki/support doi:10.1666/0022-3360(2001)075<0401:GCFACT>2.0... 10.1666/0022-3360(2001)075<0401:GCFACT>2.0.CO;2 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-12-14T23:59:22.189Z
2 EOL_V2:24210058 EOL_V2:24210058 | OTT:3617018 | GBIF:4975216 |... Repenomamus robustus species Eucarya | Opisthokonta | Metazoa | Eumetazoa |... EOL:5610326 | EOL:2910700 | EOL:42196910 | EOL... | | subkingdom | | | | | | | | | supe... Repenomamus robustus EOL_V2:24210058 Repenomamus ... NaN https://en.wiktionary.org/wiki/support doi:10.1038/nature03102 10.1038/nature03102 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-12-14T23:59:22.189Z
3 EOL:4433892 EOL:4433892 | EOL:4433892 Sinocalliopteryx gigas species Sinocalliopteryx gigas EOL:4433892 species Sinocalliopteryx gigas EOL:4433892 NaN ... NaN https://en.wiktionary.org/wiki/support doi:10.1371/journal.pone.0044012 10.1371/journal.pone.0044012 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-12-14T23:59:22.189Z
4 EOL:4433892 EOL:4433892 | EOL:4433892 Sinocalliopteryx gigas species Sinocalliopteryx gigas EOL:4433892 species Sinocalliopteryx gigas EOL:4433892 NaN ... NaN https://en.wiktionary.org/wiki/support doi:10.1371/journal.pone.0044012 10.1371/journal.pone.0044012 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-12-14T23:59:22.189Z

5 rows × 80 columns

# Check the number of rows
len(data)
3456395
# How many columns?
len(data.columns)
80
# What are the 80 columns of this dataset?
data.columns
Index(['sourceTaxonId', 'sourceTaxonIds', 'sourceTaxonName', 'sourceTaxonRank',
       'sourceTaxonPathNames', 'sourceTaxonPathIds',
       'sourceTaxonPathRankNames', 'sourceTaxonSpeciesName',
       'sourceTaxonSpeciesId', 'sourceTaxonGenusName', 'sourceTaxonGenusId',
       'sourceTaxonFamilyName', 'sourceTaxonFamilyId', 'sourceTaxonOrderName',
       'sourceTaxonOrderId', 'sourceTaxonClassName', 'sourceTaxonClassId',
       'sourceTaxonPhylumName', 'sourceTaxonPhylumId',
       'sourceTaxonKingdomName', 'sourceTaxonKingdomId', 'sourceId',
       'sourceOccurrenceId', 'sourceCatalogNumber', 'sourceBasisOfRecordId',
       'sourceBasisOfRecordName', 'sourceLifeStageId', 'sourceLifeStageName',
       'sourceBodyPartId', 'sourceBodyPartName', 'sourcePhysiologicalStateId',
       'sourcePhysiologicalStateName', 'interactionTypeName',
       'interactionTypeId', 'targetTaxonId', 'targetTaxonIds',
       'targetTaxonName', 'targetTaxonRank', 'targetTaxonPathNames',
       'targetTaxonPathIds', 'targetTaxonPathRankNames',
       'targetTaxonSpeciesName', 'targetTaxonSpeciesId',
       'targetTaxonGenusName', 'targetTaxonGenusId', 'targetTaxonFamilyName',
       'targetTaxonFamilyId', 'targetTaxonOrderName', 'targetTaxonOrderId',
       'targetTaxonClassName', 'targetTaxonClassId', 'targetTaxonPhylumName',
       'targetTaxonPhylumId', 'targetTaxonKingdomName', 'targetTaxonKingdomId',
       'targetId', 'targetOccurrenceId', 'targetCatalogNumber',
       'targetBasisOfRecordId', 'targetBasisOfRecordName', 'targetLifeStageId',
       'targetLifeStageName', 'targetBodyPartId', 'targetBodyPartName',
       'targetPhysiologicalStateId', 'targetPhysiologicalStateName',
       'decimalLatitude', 'decimalLongitude', 'localityId', 'localityName',
       'eventDateUnixEpoch', 'argumentTypeId', 'referenceCitation',
       'referenceDoi', 'referenceUrl', 'sourceCitation', 'sourceNamespace',
       'sourceArchiveURI', 'sourceDOI', 'sourceLastSeenAtUnixEpoch'],
      dtype='object')

How many different types of taxons as sources & target?

You can see that many of the columns start with either “source”, “target”. Columns in which start with “source” describe the organisms or group of organisms that act upon the “target” organism. These columns are different ways to describe those organisms. The TaxonIDs columns are columns that link the organisms to an established database of organisms such as the Encyclopedia of Life. The great part of these columns is that they are unique IDs.

Let’s check out how many unique organims or organims groups there are in GloBi.

# Source taxon
len(data['sourceTaxonId'].unique())
147510
#Target taxon
len(data['targetTaxonId'].unique())
106613

What interaction types are there?

The source and target organisms are connected by the action in which they interact and are described by the interaction columns which must fit into 37 interaction types.

data['interactionTypeName'].unique()
array(['eats', 'preysOn', 'interactsWith', 'pollinates', 'parasiteOf',
       'pathogenOf', 'visitsFlowersOf', 'adjacentTo', 'dispersalVectorOf',
       'hasHost', 'endoparasitoidOf', 'symbiontOf', 'endoparasiteOf',
       'hasVector', 'ectoParasiteOf', 'vectorOf', 'livesOn', 'livesNear',
       'parasitoidOf', 'guestOf', 'livesInsideOf', 'farms',
       'ectoParasitoid', 'inhabits', 'kills', 'hasDispersalVector',
       'livesUnder', 'kleptoparasiteOf', 'hostOf', 'eatenBy',
       'flowersVisitedBy', 'preyedUponBy', 'hasParasite', 'pollinatedBy',
       'visits', 'commensalistOf', 'hasPathogen'], dtype=object)
# number of different types of interaction
len(data['interactionTypeName'].unique())
37

Each record in GloBI comes from a specific dataset. One of the great parts of GloBI is the transparency on exactly where that data is coming from. GloBI has a system set up that continually gathers the information from its sources on a daily basis. Because of this, the database can fix a mistake on their end and without intervention GloBi will incorporate those changes into their data set. You can tell the source of the data from a few columns, but what is especially interesting is the sourceNamespace column which displays the exact place on GitHub where the data is coming from.

# Top 10 data sources ranked by amount of records contributed to GloBI
data['sourceNamespace'].value_counts().head(10)
globalbioticinteractions/fishbase                                            504260
globalbioticinteractions/arthropodEasyCaptureAMNH                            350213
millerse/Wardeh-et-al.-2015                                                  271904
globalbioticinteractions/natural-history-museum-london-interactions-bank     242429
millerse/Dapstrom-integrated-database-and-portal-for-fish-stomach-records    225564
globalbioticinteractions/ices                                                183935
EOL/pseudonitzchia                                                           183773
globalbioticinteractions/noaa-reem                                           122328
millerse/US-National-Parasite-Collection                                      99713
globalbioticinteractions/roopnarine                                           96647
Name: sourceNamespace, dtype: int64

To look at where GloBI is getting this data from simply add the first column to github.com/.

Example: The largest contributer appears to be Fishbase github.com/globalbioticinteractions/fishbase. You can also get the status of GloBi’s interaction with the data sources here: https://www.globalbioticinteractions.org/status.html.

Many of the columns are related to the type of organism being described and the most intersting

I’m interested in how many unique interaction type records are found in GloBi. The most interesting columns and really the heart of the database is sourceTaxonId, interactionTypeName, and targetTaxonId. With these three columns you can see what an animal interacts with and how.

data[['sourceTaxonId', 'interactionTypeName', 'targetTaxonId', 'sourceTaxonName']].head(10)
sourceTaxonId interactionTypeName targetTaxonId sourceTaxonName
0 EOL:4472733 eats EOL_V2:42417811 Deinosuchus
1 EOL:4433651 eats EOL_V2:42417811 Daspletosaurus
2 EOL_V2:24210058 eats EOL:4532049 Repenomamus robustus
3 EOL:4433892 eats EOL_V2:4433896 Sinocalliopteryx gigas
4 EOL:4433892 eats EOL:4433563 Sinocalliopteryx gigas
5 EOL:4433551 eats EOL:42331729 Microraptor gui
6 EOL:4531246 eats EOL_V2:4530741 Baryonyx walkeri
7 EOL:4531246 eats EOL:4653801 Baryonyx walkeri
8 EOL:4433582 eats EOL_V2:4531936 Deinonychus antirrhopus
9 EOL:4433881 preysOn EOL:4518630 Compsognathus longipes

How to search by Organism - Sanity check with bats

vampire bat

There are many columns that describe the species or order, you can search by any of the columns. One of the main ways in which researchers would want to use this data is to find the data corresponding to the species or taxa they are interested in. If you want to search for a specific taxa, you can just search using a organism string. I explored this feature a bit to try and understand if the data is making sense i.e. sanity check.

I choose to search a few types of bats and just browse the results to see if they made sense. First off I choose to see what Carollia, a genus of short tail fruit bats, eats.

# Subset by the term Carollia
corollia = data[data['sourceTaxonName'].str.contains('Carollia', na=False)]

#subset by only what Carollia eats
corollia = corollia.loc[corollia.interactionTypeName == 'eats']

# Show only relevant columns
corollia[['sourceTaxonName','sourceTaxonId', 'interactionTypeName', 'targetTaxonName','targetTaxonId']].head(10)
sourceTaxonName sourceTaxonId interactionTypeName targetTaxonName targetTaxonId
785626 Carollia perspicillata EOL:327438 eats Terminalia catappa GBIF:3189394
785631 Carollia perspicillata EOL:327438 eats Terminalia catappa GBIF:3189394
785683 Carollia perspicillata EOL:327438 eats Syzygium malaccense EOL:2508662
785688 Carollia perspicillata EOL:327438 eats Syzygium jambos EOL:2508661
785693 Carollia perspicillata EOL:327438 eats Syzygium jambos EOL:2508661
785727 Carollia perspicillata EOL:327438 eats Syzygium cumini EOL:2508660
785900 Carollia perspicillata EOL:327438 eats Spondias EOL:61097
785963 Carollia perspicillata EOL:327438 eats Solanum EOL:590245
785968 Carollia perspicillata EOL:327438 eats Solanum EOL:590245
785970 Carollia perspicillata EOL:327438 eats Solanum EOL:590245

From above you can see that Carollia perspicillata eats yummy things like Terminalia catappa which is some type of nut and Syzygium malaccense some apple-like fruit. Seems right.

Fruit

Now lets try another type of bat, Desmodus - the Vampire Bats!

# Subset by the term Desmodus
Desmodus = data[data['sourceTaxonName'].str.contains('Desmodus', na=False)]

#subset by only what Carollia eats or preysOn
Desmodus = Desmodus.loc[(Desmodus.interactionTypeName == 'eats') | (Desmodus.interactionTypeName == 'preysOn')]

# Show only relevant columns
Desmodus[['sourceTaxonName','sourceTaxonId', 'interactionTypeName', 'targetTaxonName','targetTaxonId']].head(10)
sourceTaxonName sourceTaxonId interactionTypeName targetTaxonName targetTaxonId
1542376 Desmodus rotundus GBIF:2433298 eats Bos taurus GBIF:2441022

So if you look up Bos taurus, you see that this animal is “cattle”. A mammal with blood. So creepy. So cool. I highly recommend just trying the above code with

Conclusions

Now that I have a handle on the data I see a few different ways in which to explore the data. If you are like me, and need to google every species or taxa that is in the dataset, you should read next weeks post on automation of hyperlinking species names directly into a Jupyter Notebook, which makes exploring this GLoBI data really intuitive. Also, in the next post I will be creating tools that wrap the interaction data into informative network visualizations. Below is a sneak peak into the type of visualizations I will be creating.

my alt text
Network visualization using GLoBI data to visualize what species humans interacts with.