Isobase (IsoRank PPI Network Alignment Based Ortholog Database) is a database of functionally related orthologs, which we term "isologs", developed from the multiple alignment of five major eukaryotic PPI networks, as computed by the global network alignment tools IsoRank & IsoRankN - the "iso-" being motivated by the connection of our work to graph isomorphism. Isologs are proteins that perform functionally equivalent roles in different species. By emphasizing both sequence similarity and functional similarity, isologs are intended to address some of the shortcomings of traditional sequence-only orthology-prediction approaches.

How to use Isobase: You can bulk-download all isologs across multiple species, and search for orthologs based on a constituent gene id/name or keyword. A detailed tutorial is available here.

Predictions as a result of combining both protein-protein and genetic interactions in the Isorank algorithm are also available. Gene-gene interactions for yeast, fly, worm, and human were considered. Check the checkbox on the right side of the navigation bar.

Summary
Eukaryotic species
(yeast, fly, worm, mouse, human)
5
Protein sequences87773
Orthologs12693
Constituent proteins48120
Last updated on 11/24/14



Downloads

Cluster predicitions, cluster entropies, and id maps
FileDescriptionTab format
cid.entropy.nentropy.pids.txt Ortholog clusters and entropy scores cluster id, entropy, normalized entropy, space-delimited list of internal pids
pid.symbol.id.syns.txt Id map from internal pid to external ids pid, gene symbol, gene id, synonym1|symnonym2|...
pid.go.goevid.goaspect.txt Pid to various GO information pid, go id, go evidence, go aspect
go.desc.txt GO to description map go id, go description


GO heirarchy tools
FileDescriptionTab format
gene_ontology_ext.obo.txt Full ontology file (version 1.2, 04/2010), including cross-products, inter-ontology links, and has-part relationships See Gene Ontology website
go.dag.obo.v1.2.txt Directed acyclic graph compact representation of full ontology file. GO id followed by it's children, space-delimited
OBOfile_to_GOdag.py Python script to convert a GO OBO file to a directed acyclic graph (DAG)
OBOfile_to_GOdesc.py Python script to convert GO OBO file to a list of GO id/descriptions
calcEntropy.py This is a generic script to calculate entropy scores for a given set of values. For details on input file format, execute the script without any parameters to display help. The script was originally used to calculate mean normalized entropies for clusters of predicted orthologs. GO ids were provided for each protein or gene id in a given cluster.


IsoRank and IsoRankN binaries, PPI networks, GI networks, and BLAST scores
FileDescription
IsoRankN-HighSpeed.tar IsoRankN High Speed Version executable and README
IsoRankN.tgz IsoRankN executable and README
IsoRankN2.tgz IsoRankN2 executable and README. IsoRankN2 accepts and simultaneously aligns two unrelated sets of networks. Algorithm was used to integrate network alignments of genetic interactions (GI). Optional parameter includes a --beta flag which takes an argument from 0 to 1. A beta of 0.5 weights both sets of networks equally and a beta of 0.75 weights the first set of networks 3 times more than the second set of networks.
BLAST_Bit_Scores.tar.gz BLAST bit scores (sequences from Ensembl) for C. elegans, D. melanogaster, H. sapiens, M. musculus, S. cerevisiae
ppi_networks.1.0.2.tar.gz PPI networks from
Biogrid (release 3.0.68, 08/31/2010),
DIP (06/14/2010),
HPRD (release 9, 04/13/2010),
MINT (07/28/2010),
IntAct (10/07/2010)
for C. elegans, D. melanogaster, H. Sapiens, M. musculus, S. cerevisiae
gi_networks.1.0.1.tar.gz Gene-gene interaction networks from
Biogrid (release 3.0.68, 08/31/2010)
for C. elegans, D. melanogaster, H. Sapiens, M. musculus, S. cerevisiae

Statistics


Datasets

We used IsoRank & IsoRankN on five eukaryotic PPI networks: H. sapiens (Human), M. musculus (Mouse), D. melanogaster (Fly), C. elegans (Worm), and S. cerevisiae (Yeast). Two forms of data were required as inputs, PPI networks and sequence similarity scores. The PPI networks were constructed by combining data from the DIP (06/14/2010), BioGRID (release 3.0.68, 08/31/2010), and HPRD (release 9, 04/13/2010) databases. In total, these five networks contained 87,737 proteins and 114,897 known interactions. The sequence similarity scores of pairs of proteins were the BLAST Bit-values of the sequences as retrieved from Ensembl.


SpeciesNumber of ProteinsNumber of Interactions
H. sapiens 22369 43757
M. musculus 24855 452
D. melanogaster 14098 26726
C. elegans 19756 5853
S. cerevisiae 6659 38109

Evaluation

We evaluated the biological relevance of our results against the Gene Ontology database (GO). We first measured the consistency of the predicted network alignment by computing the mean entropy of the predicted clusters. The entropy of a given cluster S*v is:

where pi is the fraction of S*v with GO group ID i. Thus a cluster has lower entropy if its GO annotations are more within-cluster consistent. We also measured the fraction of clusters which are exact, i.e. those in which all proteins have the same GO ID. With regards to choosing a set of proper GO annotations, we projected all GO terms to the same level of the GO heirarchy (k=5), removing questions of generality of terms and relatedness of annotations having different IDs. Note that only 60-70% of the proteins in any of the aligned networks have an assigned GO ID, comparable to the fraction of all known proteins included in GO. Additionally the relative performance of this consistency measure does not change when projecting GO terms to GO heirarchy levels of k=4, k=5, or k=6.

Consistency IsoRank & IsoRankN Homologene OrthoMCL
Mean normalized entropy (all species) 0.086 0.262 0.206
Mean normalized entropy (human, fly) 0.066 0.298 0.260
Exact cluster ratio* 0.250
(1752 of 7010)
0.232
(1805 of 7769)
0.220
(794 of 3602)
Exact protein ratio* 0.253
(7488 of 29636)
0.288
(5196 of 18057)
0.270
(1996 of 7387)
*The fraction of predicted clusters which are exact, fraction of proteins in exact clusters.


Coverage* (# of species) IsoRank & IsoRankN Homologene OrthoMCL
Total 12848/48978 11746/22527 10008/27601
2 3844/8739 5584/11151 2576/5213
3 4022/13533 1940/5801 638/2016
4 2926/13991 1652/6615 470/1977
5 2056/12715 745/3729 366/1931
*The number of predicted clusters containing exactly # species and number of constituent proteins in those clusters (#cluster / #proteins)


GO/KEGG IsoRank & IsoRankN
p-value* 1.28 e-90
GO/KEGG category 712/2490
Human 632/2200
Mouse 605/2124
Fly 574/1787
Worm 552/1698
Yeast 368/938

The number of GO/KEGG categories enriched by IsoRank & IsoRankN. *As computed by GO TermFinder, we remark that this excludes those proteins tagged IEA (inferred from electronic annotation).

FAQ


1. How can I cite Isobase?

If you would like to cite Isobase, the references are given as follows:

1. Rohit Singh, Jinbo Xu, and Bonnie Berger. (2008) Global alignment of multiple protein interaction networks with application to functional orthology detection, Proc. Natl. Acad. Sci. USA, 105:12763-12768.

2. Chung-Shou Liao, Kanghao Lu, Michael Baym, Rohit Singh, and Bonnie Berger. (2009) IsoRankN: Spectral methods for global alignment of multiple protein networks, Bioinformatics, 25:i253-i258.

3. Daniel Park, Rohit Singh, Michael Baym, Chung-Shou Liao, and Bonnie Berger. (2011) IsoBase: A Database of Functionally Related Proteins across PPI Networks, Nucleic Acids Research, 39:D295-D300.


2. What is the main difference between Isobase and other ortholog databases?

Isobase has been developed by global network alignment on multiple PPI networks. We demonstrate that incorporating PPI network data in ortholog prediction results in improvements over existing sequence-only approaches and over predictions from local alignments. In addition, our network alignment tools outperform existing algorithms for global network alignment in coverage and consistency on multiple alignments of the five available eukaryotic PPI networks.


3. How does Isobase generate functionally related proteins across multiple species?

We use the global PPI network alignment tools, IsoRank & IsoRankN,  based on an idea similar to PageRank and graph spectral clustering, to detect and generate functionally related orthologs (isologs) for Isobase. We also evaluate the biological relevance of our predictions against two gene ontology databases: GO and KEGG.


4. What ortholog information is provided?

For each ortholog across multiple species, brief information such as constituent protein names and their respective synonyms, is provided. Moreover, we give the entropy of every ortholog/cluster to represent the consistency. That is, a cluster has lower entropy if its GO and KEGG annotations are more within-cluster consistent. The GO and KEGG categories the constituent proteins belong to are also shown.


5. Why does Isobase collect orthologs from only five Eukaryotic species?

Isobase is a collection of functionally related orthologs (isologs) predicted by our network alignment tools. Hence we only provide the orthologs from five available eukaryotic PPI networks so far. The isologs across prokaryotic species (PPI networks) will be presented in the near future. With the increasing availability of large PPI networks, Isobase will collect orthologs from more species.


Help




Table of Content
1. Searching for isologs of a gene
2. Including genetic interactions in Isorank network alignments
3. Supported gene ids
4. Walkthrough
5. Input page: isolog search
6. Results page: isolog search

1. Searching for isologs of a gene

To identify isologs of a specific gene, enter the gene id, symbol, or any synonynm/keyword.


2. Including genetic interactions in Isorank network alignments

The Isorank executables support an option to include an additional unrelated network in the alignment. This additional network is simultaneously aligned with the sequence data and the PPI network. Genes with low coverage in the PPI network can be covered with an additional genetic interaction network.


3. Supported gene ids

Isobase supports querying for isologs by multiple gene id types. Currently, supported ids include Wormbase, FlyBase, SGD (Saccharomyces Genome Database), HPRD (Human Protein Reference Database), MGI (Mouse Genome Informatics), gene names and gene symbols.


4. Walkthrough
5. Input page: isology search


The user can query for isologs by supplying a gene id, symbol, name, or any of it's synonyms. In this example, Isobase searches for the isolog(s) having id "CG4252".




Including genetic interactions in the network alignment


To incorporate a genetic interaction network with the PPI network alignment, the Isorank executables can accept two unrelated sets of networks to simultaneously align. The two network types are averaged together with the parameter beta, which is the weight of the primary network. A beta of 0.5 weights both networks equally and a beta of 0.75 weights the first network 3 times more than the second. This secondary network is enabled in the executable with the flag --beta which takes an argument from 0 to 1. The connections for this network are looked for in the supplied data directory in files with .tab2 suffixes. For example, if your human PPI data is in data/hsapi.tab, the human genetic interaction data will go in data/hsapi.tab2






6. Search results page: isologs of the gene nemo



Search results
Entropy:
Species
Gene name
GO
Synonyms






D. Park, R. Singh, M. Baym, C. Liao, and B. Berger. 2011. "IsoBase: A Database of Functionally Related Proteins across PPI Networks." Nucleic Acids Research, doi:10.1093/nar/gkq1234. [PDF]
R. Singh, J. Xu, and B. Berger. 2008. "Global alignment of multiple protein interaction networks with application to functional orthology detection." Proc. Natl Acad. Sci. USA, 105:12763-768. [PDF]
C. Liao, K. Lu, M. Baym, R. Singh, and B. Berger. 2009. "IsoRankN: Spectral methods for global alignment of multiple protein networks." Bioinformatics, 25:i253-i258. [PDF]

Questions or comments? Please contact isobase at csail.mit.edu Berger Lab | CSAIL | MIT