Yikang Test
How to access Globi database :
Author: Yikang Li
Date: Nov, 17th, 2018
What is GloBi:
Global Biotic Interactions (GloBI) provides open access to species interaction data (e.g., predator-prey, pollinator-plant, pathogen-host, parasite-host) by combining existing open datasets using open source software. By providing an infrastructure to capture and share interaction data, individual biologists can focus on gathering new interaction data and analyzing existing datasets without having to spend resources on (re-) building a cyberinfrastructure to do so.
GloBI is made possible by a community of software engineers, bioinformaticists and biologists. Software engineers such as Jorrit Poelen, Göran Bodtitleenschatz, and Robert Reiz collaborate with bioinformaticists like Chris Mungall, data managers like Sarah E. Miller and biologists like Jim Simons, Anne Thessen, Jen Hammock and Brian Hayden to capture, provide access to and use interaction data that is provided by biologists and citizen scientists around the world. GloBI is funded by EOL’s Rubenstein Fellowship Program.
How to access GloBi:
There is a package called “rglobi” in R which allows us to access the database on Global Biotic Interactions (GloBI).
Description from the documentation of package:
“A programmatic interface to the web service methods provided by Global Biotic Interactions (GloBI). GloBI provides access to spatial-temporal species interaction records from sources all over the world. rglobi provides methods to search species interactions by location, interaction type, and taxonomic name. In addition, it supports Cypher, a graph query language, to allow for executing custom queries on the GloBI aggregate species interaction data set.”
To use its methods and functions, we need to install and library the package “rglobi” in R.
install.packages("rglobi")
library(rglobi)
Users are able to search data on species interactions by location, interaction type, and taxonomic names and so on.
While the r package provides built in methods and functions, it has limitation on the maximum amount of data displayed.
To access all the data, getting one of the archives:
Choice 1:
Use Pagination: https://github.com/ropensci/rglobi/blob/master/vignettes/rglobi_vignette.Rmd#L410
“By default, the amount of results are limited. If you’d like to retrieve all results, you can used pagination. For instance, to retrieve parasitic interactions using pagination, you can use:
otherkeys = list("limit"=10, "skip"=0)
first_page_of_ten <- get_interactions_by_type(interactiontype = c("hasParasite"), otherkeys = otherkeys)
otherkeys = list("limit"=10, "skip"=10)
second_page_of_ten <- get_interactions_by_type(interactiontype = c("hasParasite"), otherkeys = otherkeys)
To exhaust all available interactions, you can keep paging results until the size of the page is less than the limit (e.g., nrows(interactions) < limit
).”
Choice 2:
Through API: https://github.com/jhpoelen/eol-globi-data/wiki/API
The link above contains API which provide access to interaction data for the purpose of integrating the data into wikis, custom webpages or other interaction exploration tools.
Choice 3:
Download the whole dataset directly at https://www.globalbioticinteractions.org/data
Datasets are available to download in different formats including tsv, csv and N-Quads/RDF.
Import data into Jupyter notebook:
import pandas as pd
data =pd.read_csv('./with_new_csv/interactions.tsv', delimiter='\t', encoding='utf-8')
/Users/glance/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3020: DtypeWarning: Columns (13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,29,30,41,42,43,44,45,46,47,48,49,50,55,58,59,60,61,62,63,64,65,68,69,71,72,77) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
data.head()
sourceTaxonId | sourceTaxonIds | sourceTaxonName | sourceTaxonRank | sourceTaxonPathNames | sourceTaxonPathIds | sourceTaxonPathRankNames | sourceTaxonSpeciesName | sourceTaxonSpeciesId | sourceTaxonGenusName | ... | localityName | eventDateUnixEpoch | referenceCitation | referenceDoi | referenceUrl | sourceCitation | sourceNamespace | sourceArchiveURI | sourceDOI | sourceLastSeenAtUnixEpoch | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | EOL:4472733 | EOL:4472733 | EOL:4472733 | Deinosuchus | genus | Deinosuchus | EOL:4472733 | genus | NaN | NaN | Deinosuchus | ... | NaN | NaN | Rivera-Sylva H.E., E. Frey and J.R. Guzmán-Gui... | 10.4267/2042/28152 | NaN | Katja Schulz. 2015. Information about dinosaur... | KatjaSchulz/dinosaur-biotic-interactions | https://github.com/KatjaSchulz/dinosaur-biotic... | NaN | 2018-11-14T23:55:44.895Z |
1 | EOL:4433651 | EOL:4433651 | EOL:4433651 | Daspletosaurus | genus | Daspletosaurus | EOL:4433651 | genus | NaN | NaN | Daspletosaurus | ... | NaN | NaN | doi:10.1666/0022-3360(2001)075<0401:GCFACT>2.0... | 10.1666/0022-3360(2001)075<0401:GCFACT>2.0.CO;2 | NaN | Katja Schulz. 2015. Information about dinosaur... | KatjaSchulz/dinosaur-biotic-interactions | https://github.com/KatjaSchulz/dinosaur-biotic... | NaN | 2018-11-14T23:55:44.895Z |
2 | EOL:24210058 | EOL:24210058 | OTT:3617018 | GBIF:4975216 | EO... | Repenomamus robustus | species | Eucarya | Opisthokonta | Metazoa | Eumetazoa |... | EOL:5610326 | EOL:2910700 | EOL:42196910 | EOL... | | | subkingdom | | | | | | | | | supe... | Repenomamus robustus | EOL:24210058 | Repenomamus | ... | NaN | NaN | doi:10.1038/nature03102 | 10.1038/nature03102 | NaN | Katja Schulz. 2015. Information about dinosaur... | KatjaSchulz/dinosaur-biotic-interactions | https://github.com/KatjaSchulz/dinosaur-biotic... | NaN | 2018-11-14T23:55:44.895Z |
3 | EOL:4433892 | EOL:4433892 | EOL:4433892 | Sinocalliopteryx gigas | species | Sinocalliopteryx gigas | EOL:4433892 | species | Sinocalliopteryx gigas | EOL:4433892 | NaN | ... | NaN | NaN | doi:10.1371/journal.pone.0044012 | 10.1371/journal.pone.0044012 | NaN | Katja Schulz. 2015. Information about dinosaur... | KatjaSchulz/dinosaur-biotic-interactions | https://github.com/KatjaSchulz/dinosaur-biotic... | NaN | 2018-11-14T23:55:44.895Z |
4 | EOL:4433892 | EOL:4433892 | EOL:4433892 | Sinocalliopteryx gigas | species | Sinocalliopteryx gigas | EOL:4433892 | species | Sinocalliopteryx gigas | EOL:4433892 | NaN | ... | NaN | NaN | doi:10.1371/journal.pone.0044012 | 10.1371/journal.pone.0044012 | NaN | Katja Schulz. 2015. Information about dinosaur... | KatjaSchulz/dinosaur-biotic-interactions | https://github.com/KatjaSchulz/dinosaur-biotic... | NaN | 2018-11-14T23:55:44.895Z |
5 rows × 79 columns
Basic data exploration and characteristics:
#check the number of rows:
len(data)
3445494
Columns in the database:
data.columns
Index(['sourceTaxonId', 'sourceTaxonIds', 'sourceTaxonName', 'sourceTaxonRank',
'sourceTaxonPathNames', 'sourceTaxonPathIds',
'sourceTaxonPathRankNames', 'sourceTaxonSpeciesName',
'sourceTaxonSpeciesId', 'sourceTaxonGenusName', 'sourceTaxonGenusId',
'sourceTaxonFamilyName', 'sourceTaxonFamilyId', 'sourceTaxonOrderName',
'sourceTaxonOrderId', 'sourceTaxonClassName', 'sourceTaxonClassId',
'sourceTaxonPhylumName', 'sourceTaxonPhylumId',
'sourceTaxonKingdomName', 'sourceTaxonKingdomId', 'sourceId',
'sourceOccurrenceId', 'sourceCatalogNumber', 'sourceBasisOfRecordId',
'sourceBasisOfRecordName', 'sourceLifeStageId', 'sourceLifeStageName',
'sourceBodyPartId', 'sourceBodyPartName', 'sourcePhysiologicalStateId',
'sourcePhysiologicalStateName', 'interactionTypeName',
'interactionTypeId', 'targetTaxonId', 'targetTaxonIds',
'targetTaxonName', 'targetTaxonRank', 'targetTaxonPathNames',
'targetTaxonPathIds', 'targetTaxonPathRankNames',
'targetTaxonSpeciesName', 'targetTaxonSpeciesId',
'targetTaxonGenusName', 'targetTaxonGenusId', 'targetTaxonFamilyName',
'targetTaxonFamilyId', 'targetTaxonOrderName', 'targetTaxonOrderId',
'targetTaxonClassName', 'targetTaxonClassId', 'targetTaxonPhylumName',
'targetTaxonPhylumId', 'targetTaxonKingdomName', 'targetTaxonKingdomId',
'targetId', 'targetOccurrenceId', 'targetCatalogNumber',
'targetBasisOfRecordId', 'targetBasisOfRecordName', 'targetLifeStageId',
'targetLifeStageName', 'targetBodyPartId', 'targetBodyPartName',
'targetPhysiologicalStateId', 'targetPhysiologicalStateName',
'decimalLatitude', 'decimalLongitude', 'localityId', 'localityName',
'eventDateUnixEpoch', 'referenceCitation', 'referenceDoi',
'referenceUrl', 'sourceCitation', 'sourceNamespace', 'sourceArchiveURI',
'sourceDOI', 'sourceLastSeenAtUnixEpoch'],
dtype='object')
How many different types of taxons as sources & target?
#source taxon
len(data['sourceTaxonId'].unique())
147156
#Target taxon
len(data['targetTaxonId'].unique())
105196
What interaction types are there?
data['interactionTypeName'].unique()
array(['eats', 'preysOn', 'interactsWith', 'pollinates', 'parasiteOf',
'pathogenOf', 'visitsFlowersOf', 'adjacentTo', 'dispersalVectorOf',
'hasHost', 'endoparasitoidOf', 'symbiontOf', 'endoparasiteOf',
'hasVector', 'ectoParasiteOf', 'vectorOf', 'livesOn', 'livesNear',
'parasitoidOf', 'guestOf', 'livesInsideOf', 'farms',
'ectoParasitoid', 'inhabits', 'kills', 'hasDispersalVector',
'livesUnder', 'kleptoparasiteOf', 'hostOf', 'visits', 'eatenBy',
'flowersVisitedBy', 'preyedUponBy', 'hasParasite', 'pollinatedBy',
'hasPathogen'], dtype=object)
#number of different types of interaction
len(data['interactionTypeName'].unique())
36
Drop duplicates:
data.drop_duplicates(['sourceTaxonId', 'interactionTypeName', 'targetTaxonId'], inplace = True)
len(data)
956380