Yikang Test

Yikang Test

2018, Oct 10    

How to access Globi database :

Author: Yikang Li

Date: Nov, 17th, 2018

What is GloBi:

Global Biotic Interactions (GloBI) provides open access to species interaction data (e.g., predator-prey, pollinator-plant, pathogen-host, parasite-host) by combining existing open datasets using open source software. By providing an infrastructure to capture and share interaction data, individual biologists can focus on gathering new interaction data and analyzing existing datasets without having to spend resources on (re-) building a cyberinfrastructure to do so.

GloBI is made possible by a community of software engineers, bioinformaticists and biologists. Software engineers such as Jorrit Poelen, Göran Bodtitleenschatz, and Robert Reiz collaborate with bioinformaticists like Chris Mungall, data managers like Sarah E. Miller and biologists like Jim Simons, Anne Thessen, Jen Hammock and Brian Hayden to capture, provide access to and use interaction data that is provided by biologists and citizen scientists around the world. GloBI is funded by EOL’s Rubenstein Fellowship Program.

How to access GloBi:

There is a package called “rglobi” in R which allows us to access the database on Global Biotic Interactions (GloBI).

Description from the documentation of package:
“A programmatic interface to the web service methods provided by Global Biotic Interactions (GloBI). GloBI provides access to spatial-temporal species interaction records from sources all over the world. rglobi provides methods to search species interactions by location, interaction type, and taxonomic name. In addition, it supports Cypher, a graph query language, to allow for executing custom queries on the GloBI aggregate species interaction data set.”

To use its methods and functions, we need to install and library the package “rglobi” in R.

install.packages("rglobi")
library(rglobi)

Users are able to search data on species interactions by location, interaction type, and taxonomic names and so on.

While the r package provides built in methods and functions, it has limitation on the maximum amount of data displayed.
To access all the data, getting one of the archives:

Choice 1:

Use Pagination: https://github.com/ropensci/rglobi/blob/master/vignettes/rglobi_vignette.Rmd#L410

“By default, the amount of results are limited. If you’d like to retrieve all results, you can used pagination. For instance, to retrieve parasitic interactions using pagination, you can use:

otherkeys = list("limit"=10, "skip"=0)
first_page_of_ten <- get_interactions_by_type(interactiontype = c("hasParasite"), otherkeys = otherkeys)
otherkeys = list("limit"=10, "skip"=10)
second_page_of_ten <- get_interactions_by_type(interactiontype = c("hasParasite"), otherkeys = otherkeys)

To exhaust all available interactions, you can keep paging results until the size of the page is less than the limit (e.g., nrows(interactions) < limit).”

Choice 2:

Through API: https://github.com/jhpoelen/eol-globi-data/wiki/API
The link above contains API which provide access to interaction data for the purpose of integrating the data into wikis, custom webpages or other interaction exploration tools.

Choice 3:

Download the whole dataset directly at https://www.globalbioticinteractions.org/data
Datasets are available to download in different formats including tsv, csv and N-Quads/RDF.

Import data into Jupyter notebook:

import pandas as pd
data =pd.read_csv('./with_new_csv/interactions.tsv', delimiter='\t', encoding='utf-8')
/Users/glance/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3020: DtypeWarning: Columns (13,14,15,16,17,18,19,21,22,23,24,25,26,27,28,29,30,41,42,43,44,45,46,47,48,49,50,55,58,59,60,61,62,63,64,65,68,69,71,72,77) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
data.head()
sourceTaxonId sourceTaxonIds sourceTaxonName sourceTaxonRank sourceTaxonPathNames sourceTaxonPathIds sourceTaxonPathRankNames sourceTaxonSpeciesName sourceTaxonSpeciesId sourceTaxonGenusName ... localityName eventDateUnixEpoch referenceCitation referenceDoi referenceUrl sourceCitation sourceNamespace sourceArchiveURI sourceDOI sourceLastSeenAtUnixEpoch
0 EOL:4472733 EOL:4472733 | EOL:4472733 Deinosuchus genus Deinosuchus EOL:4472733 genus NaN NaN Deinosuchus ... NaN NaN Rivera-Sylva H.E., E. Frey and J.R. Guzmán-Gui... 10.4267/2042/28152 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-11-14T23:55:44.895Z
1 EOL:4433651 EOL:4433651 | EOL:4433651 Daspletosaurus genus Daspletosaurus EOL:4433651 genus NaN NaN Daspletosaurus ... NaN NaN doi:10.1666/0022-3360(2001)075<0401:GCFACT>2.0... 10.1666/0022-3360(2001)075<0401:GCFACT>2.0.CO;2 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-11-14T23:55:44.895Z
2 EOL:24210058 EOL:24210058 | OTT:3617018 | GBIF:4975216 | EO... Repenomamus robustus species Eucarya | Opisthokonta | Metazoa | Eumetazoa |... EOL:5610326 | EOL:2910700 | EOL:42196910 | EOL... | | subkingdom | | | | | | | | | supe... Repenomamus robustus EOL:24210058 Repenomamus ... NaN NaN doi:10.1038/nature03102 10.1038/nature03102 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-11-14T23:55:44.895Z
3 EOL:4433892 EOL:4433892 | EOL:4433892 Sinocalliopteryx gigas species Sinocalliopteryx gigas EOL:4433892 species Sinocalliopteryx gigas EOL:4433892 NaN ... NaN NaN doi:10.1371/journal.pone.0044012 10.1371/journal.pone.0044012 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-11-14T23:55:44.895Z
4 EOL:4433892 EOL:4433892 | EOL:4433892 Sinocalliopteryx gigas species Sinocalliopteryx gigas EOL:4433892 species Sinocalliopteryx gigas EOL:4433892 NaN ... NaN NaN doi:10.1371/journal.pone.0044012 10.1371/journal.pone.0044012 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-11-14T23:55:44.895Z

5 rows × 79 columns

Basic data exploration and characteristics:

#check the number of rows:
len(data)
3445494

Columns in the database:

data.columns
Index(['sourceTaxonId', 'sourceTaxonIds', 'sourceTaxonName', 'sourceTaxonRank',
       'sourceTaxonPathNames', 'sourceTaxonPathIds',
       'sourceTaxonPathRankNames', 'sourceTaxonSpeciesName',
       'sourceTaxonSpeciesId', 'sourceTaxonGenusName', 'sourceTaxonGenusId',
       'sourceTaxonFamilyName', 'sourceTaxonFamilyId', 'sourceTaxonOrderName',
       'sourceTaxonOrderId', 'sourceTaxonClassName', 'sourceTaxonClassId',
       'sourceTaxonPhylumName', 'sourceTaxonPhylumId',
       'sourceTaxonKingdomName', 'sourceTaxonKingdomId', 'sourceId',
       'sourceOccurrenceId', 'sourceCatalogNumber', 'sourceBasisOfRecordId',
       'sourceBasisOfRecordName', 'sourceLifeStageId', 'sourceLifeStageName',
       'sourceBodyPartId', 'sourceBodyPartName', 'sourcePhysiologicalStateId',
       'sourcePhysiologicalStateName', 'interactionTypeName',
       'interactionTypeId', 'targetTaxonId', 'targetTaxonIds',
       'targetTaxonName', 'targetTaxonRank', 'targetTaxonPathNames',
       'targetTaxonPathIds', 'targetTaxonPathRankNames',
       'targetTaxonSpeciesName', 'targetTaxonSpeciesId',
       'targetTaxonGenusName', 'targetTaxonGenusId', 'targetTaxonFamilyName',
       'targetTaxonFamilyId', 'targetTaxonOrderName', 'targetTaxonOrderId',
       'targetTaxonClassName', 'targetTaxonClassId', 'targetTaxonPhylumName',
       'targetTaxonPhylumId', 'targetTaxonKingdomName', 'targetTaxonKingdomId',
       'targetId', 'targetOccurrenceId', 'targetCatalogNumber',
       'targetBasisOfRecordId', 'targetBasisOfRecordName', 'targetLifeStageId',
       'targetLifeStageName', 'targetBodyPartId', 'targetBodyPartName',
       'targetPhysiologicalStateId', 'targetPhysiologicalStateName',
       'decimalLatitude', 'decimalLongitude', 'localityId', 'localityName',
       'eventDateUnixEpoch', 'referenceCitation', 'referenceDoi',
       'referenceUrl', 'sourceCitation', 'sourceNamespace', 'sourceArchiveURI',
       'sourceDOI', 'sourceLastSeenAtUnixEpoch'],
      dtype='object')

How many different types of taxons as sources & target?

#source taxon
len(data['sourceTaxonId'].unique())
147156
#Target taxon
len(data['targetTaxonId'].unique())
105196

What interaction types are there?

data['interactionTypeName'].unique()
array(['eats', 'preysOn', 'interactsWith', 'pollinates', 'parasiteOf',
       'pathogenOf', 'visitsFlowersOf', 'adjacentTo', 'dispersalVectorOf',
       'hasHost', 'endoparasitoidOf', 'symbiontOf', 'endoparasiteOf',
       'hasVector', 'ectoParasiteOf', 'vectorOf', 'livesOn', 'livesNear',
       'parasitoidOf', 'guestOf', 'livesInsideOf', 'farms',
       'ectoParasitoid', 'inhabits', 'kills', 'hasDispersalVector',
       'livesUnder', 'kleptoparasiteOf', 'hostOf', 'visits', 'eatenBy',
       'flowersVisitedBy', 'preyedUponBy', 'hasParasite', 'pollinatedBy',
       'hasPathogen'], dtype=object)
#number of different types of interaction
len(data['interactionTypeName'].unique())
36

Drop duplicates:

data.drop_duplicates(['sourceTaxonId', 'interactionTypeName', 'targetTaxonId'], inplace = True)
len(data)
956380