Yikang Test

Yikang Test

2018, Oct 15    

<!DOCTYPE html>

Globi_exploration_new

Globi database exploration:

Author: Yikang Li

Date: Nov, 17th, 2018

In [2]:
import pandas as pd
import pytaxize
import re
import matplotlib.pyplot as plt

Import interaction data:

In [3]:
data =pd.read_csv('/Users/iamciera/Desktop/interactions.tsv', delimiter='\t', encoding='utf-8')
//anaconda/envs/ipykernel_py3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (21,22,23,24,25,26,27,28,29,30,41,42,43,44,45,46,47,48,49,50,55,58,59,60,61,62,63,64,65,68,69,72,73,78) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [5]:
data.head()
Out[5]:
sourceTaxonId sourceTaxonIds sourceTaxonName sourceTaxonRank sourceTaxonPathNames sourceTaxonPathIds sourceTaxonPathRankNames sourceTaxonSpeciesName sourceTaxonSpeciesId sourceTaxonGenusName ... eventDateUnixEpoch argumentTypeId referenceCitation referenceDoi referenceUrl sourceCitation sourceNamespace sourceArchiveURI sourceDOI sourceLastSeenAtUnixEpoch
0 EOL:4472733 EOL:4472733 | EOL:4472733 Deinosuchus genus Deinosuchus EOL:4472733 genus NaN NaN Deinosuchus ... NaN https://en.wiktionary.org/wiki/support Rivera-Sylva H.E., E. Frey and J.R. Guzmán-Gui... 10.4267/2042/28152 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-12-14T23:59:22.189Z
1 EOL:4433651 EOL:4433651 | EOL:4433651 Daspletosaurus genus Daspletosaurus EOL:4433651 genus NaN NaN Daspletosaurus ... NaN https://en.wiktionary.org/wiki/support doi:10.1666/0022-3360(2001)075<0401:GCFACT>2.0... 10.1666/0022-3360(2001)075<0401:GCFACT>2.0.CO;2 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-12-14T23:59:22.189Z
2 EOL_V2:24210058 EOL_V2:24210058 | OTT:3617018 | GBIF:4975216 |... Repenomamus robustus species Eucarya | Opisthokonta | Metazoa | Eumetazoa |... EOL:5610326 | EOL:2910700 | EOL:42196910 | EOL... | | subkingdom | | | | | | | | | supe... Repenomamus robustus EOL_V2:24210058 Repenomamus ... NaN https://en.wiktionary.org/wiki/support doi:10.1038/nature03102 10.1038/nature03102 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-12-14T23:59:22.189Z
3 EOL:4433892 EOL:4433892 | EOL:4433892 Sinocalliopteryx gigas species Sinocalliopteryx gigas EOL:4433892 species Sinocalliopteryx gigas EOL:4433892 NaN ... NaN https://en.wiktionary.org/wiki/support doi:10.1371/journal.pone.0044012 10.1371/journal.pone.0044012 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-12-14T23:59:22.189Z
4 EOL:4433892 EOL:4433892 | EOL:4433892 Sinocalliopteryx gigas species Sinocalliopteryx gigas EOL:4433892 species Sinocalliopteryx gigas EOL:4433892 NaN ... NaN https://en.wiktionary.org/wiki/support doi:10.1371/journal.pone.0044012 10.1371/journal.pone.0044012 NaN Katja Schulz. 2015. Information about dinosaur... KatjaSchulz/dinosaur-biotic-interactions https://github.com/KatjaSchulz/dinosaur-biotic... NaN 2018-12-14T23:59:22.189Z

5 rows × 80 columns

In [6]:
data['interactionTypeName'].unique()
Out[6]:
array(['eats', 'preysOn', 'interactsWith', 'pollinates', 'parasiteOf',
       'pathogenOf', 'visitsFlowersOf', 'adjacentTo', 'dispersalVectorOf',
       'hasHost', 'endoparasitoidOf', 'symbiontOf', 'endoparasiteOf',
       'hasVector', 'ectoParasiteOf', 'vectorOf', 'livesOn', 'livesNear',
       'parasitoidOf', 'guestOf', 'livesInsideOf', 'farms',
       'ectoParasitoid', 'inhabits', 'kills', 'hasDispersalVector',
       'livesUnder', 'kleptoparasiteOf', 'hostOf', 'eatenBy',
       'flowersVisitedBy', 'preyedUponBy', 'hasParasite', 'pollinatedBy',
       'visits', 'commensalistOf', 'hasPathogen'], dtype=object)

Drop duplicates:

In [7]:
data.drop_duplicates(['sourceTaxonId', 'interactionTypeName', 'targetTaxonId'], inplace = True)
In [8]:
len(data)
Out[8]:
965611

Data Exploration:

Let's look at certain taxon:

For example, suppose we are interested in the interactions involving 'Homo sapiens'

In [9]:
#Types of interactions involving Homo sapiens as sourceTaxon:
data[data['sourceTaxonName'] == 'Homo sapiens']['interactionTypeName'].unique()
Out[9]:
array(['interactsWith', 'eats', 'hostOf'], dtype=object)
In [10]:
#Number of records of interactions involving Homo sapiens as sourceTaxon:
len(data[data['sourceTaxonName'] == 'Homo sapiens'])
Out[10]:
667

Let's focus on certain type of interaction involving Homo_sapiens as sourceTaxon, for example "eats":

In [11]:
hs_eats_data = data[(data['sourceTaxonName'] == 'Homo sapiens') & (data['interactionTypeName'] == 'eats')]
In [12]:
hs_eats_data.head()
Out[12]:
sourceTaxonId sourceTaxonIds sourceTaxonName sourceTaxonRank sourceTaxonPathNames sourceTaxonPathIds sourceTaxonPathRankNames sourceTaxonSpeciesName sourceTaxonSpeciesId sourceTaxonGenusName ... localityName eventDateUnixEpoch referenceCitation referenceDoi referenceUrl sourceCitation sourceNamespace sourceArchiveURI sourceDOI sourceLastSeenAtUnixEpoch
755562 EOL:327955 EOL:327955 | INAT_TAXON:43584 | NBN:NHMSYS0000... Homo sapiens species Animalia | Chordata | Mammalia | Primates | Ho... EOL:1 | EOL:694 | EOL:1642 | EOL:1645 | EOL:16... kingdom | phylum | class | order | family | ge... Homo sapiens EOL:327955 Homo ... Barro Colorado Island, Panama NaN Worthington, A. 1989. Adaptations for avian fr... 10.1007/BF00379040. NaN F. Gabriel. Muñoz. 2017. Palm-Animal frugivore... fgabriel1891/Plant-Frugivore-Interactions-Sout... https://github.com/fgabriel1891/Plant-Frugivor... NaN 2018-11-14T23:08:24.277Z
756855 EOL:327955 EOL:327955 | INAT_TAXON:43584 | NBN:NHMSYS0000... Homo sapiens species Animalia | Chordata | Mammalia | Primates | Ho... EOL:1 | EOL:694 | EOL:1642 | EOL:1645 | EOL:16... kingdom | phylum | class | order | family | ge... Homo sapiens EOL:327955 Homo ... Mizoram, India NaN Hazarika, T.k. Lalramchuana. Nautiyal. B.P. 20... 10.1007/s10722-012-9799-5 NaN F. Gabriel. Muñoz. 2017. Palm-Animal frugivore... fgabriel1891/Plant-Frugivore-Interactions-Sout... https://github.com/fgabriel1891/Plant-Frugivor... NaN 2018-11-14T23:08:24.277Z
756856 EOL:327955 EOL:327955 | INAT_TAXON:43584 | NBN:NHMSYS0000... Homo sapiens species Animalia | Chordata | Mammalia | Primates | Ho... EOL:1 | EOL:694 | EOL:1642 | EOL:1645 | EOL:16... kingdom | phylum | class | order | family | ge... Homo sapiens EOL:327955 Homo ... Mizoram, India NaN Hazarika, T.k. Lalramchuana. Nautiyal. B.P. 20... 10.1007/s10722-012-9799-5 NaN F. Gabriel. Muñoz. 2017. Palm-Animal frugivore... fgabriel1891/Plant-Frugivore-Interactions-Sout... https://github.com/fgabriel1891/Plant-Frugivor... NaN 2018-11-14T23:08:24.277Z
756857 EOL:327955 EOL:327955 | INAT_TAXON:43584 | NBN:NHMSYS0000... Homo sapiens species Animalia | Chordata | Mammalia | Primates | Ho... EOL:1 | EOL:694 | EOL:1642 | EOL:1645 | EOL:16... kingdom | phylum | class | order | family | ge... Homo sapiens EOL:327955 Homo ... Mizoram, India NaN Hazarika, T.k. Lalramchuana. Nautiyal. B.P. 20... 10.1007/s10722-012-9799-5 NaN F. Gabriel. Muñoz. 2017. Palm-Animal frugivore... fgabriel1891/Plant-Frugivore-Interactions-Sout... https://github.com/fgabriel1891/Plant-Frugivor... NaN 2018-11-14T23:08:24.277Z
756858 EOL:327955 EOL:327955 | INAT_TAXON:43584 | NBN:NHMSYS0000... Homo sapiens species Animalia | Chordata | Mammalia | Primates | Ho... EOL:1 | EOL:694 | EOL:1642 | EOL:1645 | EOL:16... kingdom | phylum | class | order | family | ge... Homo sapiens EOL:327955 Homo ... Mizoram, India NaN Hazarika, T.k. Lalramchuana. Nautiyal. B.P. 20... 10.1007/s10722-012-9799-5 NaN F. Gabriel. Muñoz. 2017. Palm-Animal frugivore... fgabriel1891/Plant-Frugivore-Interactions-Sout... https://github.com/fgabriel1891/Plant-Frugivor... NaN 2018-11-14T23:08:24.277Z

5 rows × 79 columns

In [13]:
len(hs_eats_data)
Out[13]:
379
In [14]:
#Drop missing values
target_hs_eats = hs_eats_data[['targetTaxonId',
       'targetTaxonName','targetTaxonPathNames',
       'targetTaxonPathIds', 'targetTaxonPathRankNames',
       'targetTaxonSpeciesName', 'targetTaxonSpeciesId',
       'targetTaxonGenusName', 'targetTaxonGenusId', 'targetTaxonFamilyName',
       'targetTaxonFamilyId', 'targetTaxonOrderName', 'targetTaxonOrderId',
       'targetTaxonClassName', 'targetTaxonClassId', 'targetTaxonPhylumName',
       'targetTaxonPhylumId', 'targetTaxonKingdomName', 'targetTaxonKingdomId']].dropna(subset=['targetTaxonId',
       'targetTaxonName','targetTaxonPathNames','targetTaxonPathIds'])
target_hs_eats.head()
Out[14]:
targetTaxonId targetTaxonName targetTaxonPathNames targetTaxonPathIds targetTaxonPathRankNames targetTaxonSpeciesName targetTaxonSpeciesId targetTaxonGenusName targetTaxonGenusId targetTaxonFamilyName targetTaxonFamilyId targetTaxonOrderName targetTaxonOrderId targetTaxonClassName targetTaxonClassId targetTaxonPhylumName targetTaxonPhylumId targetTaxonKingdomName targetTaxonKingdomId
755562 EOL:1142757 Hyphaene petersiana Plantae | Tracheophyta | Liliopsida | Arecales... EOL:281 | EOL:4077 | EOL:4074 | EOL:8192 | EOL... kingdom | phylum | class | order | family | ge... Hyphaene petersiana EOL:1142757 Hyphaene EOL:29186 Arecaceae EOL:8193 Arecales EOL:8192 Liliopsida EOL:4074 Tracheophyta EOL:4077 Plantae EOL:281
756856 EOL:2508660 Syzygium cumini Plantae | Tracheophyta | Magnoliopsida | Myrta... EOL:281 | EOL:4077 | EOL:283 | EOL:4328 | EOL:... kingdom | phylum | class | order | family | ge... Syzygium cumini EOL:2508660 Syzygium EOL:2508658 Myrtaceae EOL:8095 Myrtales EOL:4328 Magnoliopsida EOL:283 Tracheophyta EOL:4077 Plantae EOL:281
756857 EOL:4263 Styracaceae Plantae | Tracheophyta | Magnoliopsida | Erica... EOL:281 | EOL:4077 | EOL:283 | EOL:4186 | EOL:... kingdom | phylum | class | order | family NaN NaN NaN NaN Styracaceae EOL:4263 Ericales EOL:4186 Magnoliopsida EOL:283 Tracheophyta EOL:4077 Plantae EOL:281
756858 EOL:2888768 Spondias pinnata Plantae | Tracheophyta | Magnoliopsida | Sapin... EOL:281 | EOL:4077 | EOL:283 | EOL:4311 | EOL:... kingdom | phylum | class | order | family | ge... Spondias pinnata EOL:2888768 Spondias EOL:61097 Anacardiaceae EOL:4410 Sapindales EOL:4311 Magnoliopsida EOL:283 Tracheophyta EOL:4077 Plantae EOL:281
756859 EOL:1082661 Smilax ovalifolia Plantae | Tracheophyta | Liliopsida | Liliales... EOL:281 | EOL:4077 | EOL:4074 | EOL:4173 | EOL... kingdom | phylum | class | order | family | ge... Smilax ovalifolia EOL:1082661 Smilax EOL:107257 Smilacaceae EOL:8171 Liliales EOL:4173 Liliopsida EOL:4074 Tracheophyta EOL:4077 Plantae EOL:281
In [15]:
len(target_hs_eats)
Out[15]:
304
In [16]:
target_hs_eats.groupby(target_hs_eats['targetTaxonClassName']).size().sort_values(ascending = False)
Out[16]:
targetTaxonClassName
Mammalia           102
Magnoliopsida       52
Actinopterygii      49
Aves                26
Bivalvia            19
Liliopsida           8
Malacostraca         7
Gastropoda           5
Reptilia             4
Elasmobranchii       3
Ascidiacea           3
Insecta              3
Anthozoa             2
Holothuroidea        2
Cephalopoda          2
Anopla               1
Bangiophyceae        1
Ulvophyceae          1
Chondrichthyes       1
Chrysophyceae        1
Dothideomycetes      1
Teleostei            1
Phaeophyceae         1
Echinoidea           1
dtype: int64

Above all, we have found a list of top target classes of 'Homo sapiens' for interactiontype 'eats'.
Similarly, we could get a list of any rank for any source taxon and any interactiontype by the following function 'find_top_target':

In [17]:
def find_top_target(source, interaction_type, rank):
    """ Function that takes inputs of interests and finds corresponding top targets.
    Args:
        source: the source taxon that we are interested in, can be in any level.
        interaction_type: the interaction type that we are interested in, 
                          should be consistent with the names of interaction types from tsv.file.
        rank: the level of target taxon that we are interested in, 
              should be consistent with the column names of tsv.file, such as 'targetTaxonFamilyName', 'targetTaxonOrderName', 
              'targetTaxonClassName'...
    Returns:
        The top target taxons in certain rank for certain source taxon and certain interaction type, 
        in descending order of number of records.
    """
    d = data[data['sourceTaxonName'] == source]
    interacts_d = d[d['interactionTypeName'] == interaction_type]
    interacts_d_cleaned = interacts_d[['targetTaxonId',
       'targetTaxonName','targetTaxonPathNames',
       'targetTaxonPathIds', 'targetTaxonPathRankNames',
       'targetTaxonSpeciesName', 'targetTaxonSpeciesId',
       'targetTaxonGenusName', 'targetTaxonGenusId', 'targetTaxonFamilyName',
       'targetTaxonFamilyId', 'targetTaxonOrderName', 'targetTaxonOrderId',
       'targetTaxonClassName', 'targetTaxonClassId', 'targetTaxonPhylumName',
       'targetTaxonPhylumId', 'targetTaxonKingdomName', 'targetTaxonKingdomId']].dropna(subset=['targetTaxonId',
       'targetTaxonName','targetTaxonPathNames','targetTaxonPathIds'])
    return interacts_d_cleaned.groupby(interacts_d_cleaned[rank]).size().sort_values(ascending = False)

Examples:

In [18]:
#Find top target taxons in Class for homo sapiens with interaction type 'eats'
find_top_target('Homo sapiens', 'eats', 'targetTaxonClassName')
Out[18]:
targetTaxonClassName
Mammalia           102
Magnoliopsida       52
Actinopterygii      49
Aves                26
Bivalvia            19
Liliopsida           8
Malacostraca         7
Gastropoda           5
Reptilia             4
Elasmobranchii       3
Ascidiacea           3
Insecta              3
Anthozoa             2
Holothuroidea        2
Cephalopoda          2
Anopla               1
Bangiophyceae        1
Ulvophyceae          1
Chondrichthyes       1
Chrysophyceae        1
Dothideomycetes      1
Teleostei            1
Phaeophyceae         1
Echinoidea           1
dtype: int64
In [19]:
#Find top target taxons in Family for homo sapiens with interaction type 'hostOf'
find_top_target('Homo sapiens', 'hostOf', 'targetTaxonFamilyName')
Out[19]:
targetTaxonFamilyName
Ixodidae              11
Diphyllobothriidae     4
Rhopalopsyllidae       3
Pulicidae              3
Trombiculidae          1
Taeniidae              1
Pediculidae            1
Oxyuridae              1
Echinorhynchidae       1
dtype: int64

Instead of inputting a source species, what if we input a source in other levels like class or family?

In [20]:
#Find top target taxons in Class for Actinopterygii with interaction type 'preysOn'
find_top_target('Actinopterygii', 'preysOn', 'targetTaxonClassName')
Out[20]:
targetTaxonClassName
Actinopterygii    7
Cephalopoda       1
dtype: int64

Here, the source 'Actinopterygii' itself is in Class level.
And we can see that the top target Class of 'Actinopterygii' preys on is also 'Actinopterygii', which means the species under 'Actinopterygii' always preys on species under same Class.

If we want to know more about our result taxons, we can also link them with their wikipedia pages:

In [21]:
def make_clickable_both(val): 
    name, url = val.split('#')
    return f'<a href="{url}">{name}</a>'
In [22]:
def top_targets_with_wiki(source, interaction_type, rank):
    """ Function that takes inputs of interests and finds corresponding top targets linked to their wikipedia pages.
    Args:
        source: the source taxon that we are interested in, can be in any level.
        interaction_type: the interaction type that we are interested in, 
                          should be consistent with the names of interaction types from tsv.file.
        rank: the level of target taxon that we are interested in, 
              should be consistent with the column names of tsv.file, such as 'targetTaxonFamilyName', 'targetTaxonOrderName', 
              'targetTaxonClassName'...
    Returns:
        The top target taxons in certain rank with clickable wikipedia links for certain source taxon and certain interaction type, 
        in descending order of number of records.
    """
    top_targets = find_top_target(source, interaction_type, rank)
    target_df = pd.DataFrame(top_targets)
    target_df.columns = ['count']

    urls = dict(name= list(target_df.index), 
    url= ['https://en.wikipedia.org/wiki/' + str(i) for i in list(target_df.index)])
    target_df.index = [i + '#' + j for i,j in zip(urls['name'], urls['url'])]
    index_list = list(target_df.index)
    target_df.index =[make_clickable_both(i) for i in index_list]
    df = target_df.style.format({'wiki': make_clickable_both})
    
    return df

Examples:

In [23]:
top_targets_with_wiki('Homo sapiens', 'eats', 'targetTaxonClassName')
In [24]:
top_targets_with_wiki('Homo sapiens', 'hostOf', 'targetTaxonFamilyName')
In [25]:
top_targets_with_wiki('Actinopterygii', 'preysOn', 'targetTaxonClassName')
Out[25]:

Make directed graphs:

In [28]:
import networkx as nx
In [38]:
def plot(source, interaction_type, rank, n = None):
    """ Function that plots directed graphs of results from 'find_top_target'.
    Args:
        source: the source taxon that we are interested in, can be in any level.
        interaction_type: the interaction type that we are interested in, 
                          should be consistent with the names of interaction types from tsv.file.
        rank: the level of target taxon that we are interested in, 
              should be consistent with the column names of tsv.file, such as 'targetTaxonFamilyName', 'targetTaxonOrderName', 
              'targetTaxonClassName'...
        n: select first n top targets to plot, default to plot all top targets.
    Returns:
        A directed graph containing information of the source and target taxons, interaction_type
    """
    G = nx.DiGraph()
    
    if n:
        top_targets = find_top_target(source, interaction_type, rank)[: n]
    else:
        top_targets = find_top_target(source, interaction_type, rank)

    for name in ([source]+ list(top_targets.index)):
        G.add_node(name)

    for target in top_targets.index:
        G.add_edge(source, target, label = interaction_type)

    plt.figure(figsize=(8,8))
    edge_labels = nx.get_edge_attributes(G,'label')

    pos = nx.spring_layout(G) 
    nx.draw_networkx_edge_labels(G,pos, edge_labels = edge_labels, font_size=15, font_color='orange')

    nx.draw_networkx(G, pos, with_labels=True, node_size=1500, node_color="skyblue", alpha= 1, arrows=True, 
                    linewidths=1, font_color="grey", font_size=15, style = 'dashed')

    plt.axis('off')
    plt.tight_layout()
    plt.show()
In [34]:
plot('Homo sapiens', 'eats', 'targetTaxonClassName', 5)
In [35]:
plot('Homo sapiens', 'eats', 'targetTaxonClassName', 10)
In [39]:
plot('Homo sapiens', 'eats', 'targetTaxonClassName')
In [36]:
plot('Homo sapiens', 'hostOf', 'targetTaxonFamilyName', 5)
In [37]:
plot('Actinopterygii', 'preysOn', 'targetTaxonClassName', 5)
In [ ]: