Isobase (IsoRank PPI Network Alignment Based Ortholog Database) is a database of functionally related orthologs, which we term "isologs", developed from the multiple alignment of five major eukaryotic PPI networks, as computed by the global network alignment tools IsoRank & IsoRankN - the "iso-" being motivated by the connection of our work to graph isomorphism. Isologs are proteins that perform functionally equivalent roles in different species. By emphasizing both sequence similarity and functional similarity, isologs are intended to address some of the shortcomings of traditional sequence-only orthology-prediction approaches.

How to use Isobase: You can bulk-download all isologs across multiple species, and search for orthologs based on a constituent gene id/name or keyword. A detailed tutorial is available here.

Predictions as a result of combining both protein-protein and genetic interactions in the Isorank algorithm are also available. Gene-gene interactions for yeast, fly, worm, and human were considered. Check the checkbox on the right side of the navigation bar.

Summary
Eukaryotic species (yeast, fly, worm, mouse, human)	5
Protein sequences	87773
Orthologs	12693
Constituent proteins	48120
Last updated on 11/24/14

Downloads

Cluster predicitions, cluster entropies, and id maps
File	Description	Tab format
cid.entropy.nentropy.pids.txt	Ortholog clusters and entropy scores	cluster id, entropy, normalized entropy, space-delimited list of internal pids
pid.symbol.id.syns.txt	Id map from internal pid to external ids	pid, gene symbol, gene id, synonym1\|symnonym2\|...
pid.go.goevid.goaspect.txt	Pid to various GO information	pid, go id, go evidence, go aspect
go.desc.txt	GO to description map	go id, go description

GO heirarchy tools
File	Description	Tab format
gene_ontology_ext.obo.txt	Full ontology file (version 1.2, 04/2010), including cross-products, inter-ontology links, and has-part relationships	See Gene Ontology website
go.dag.obo.v1.2.txt	Directed acyclic graph compact representation of full ontology file.	GO id followed by it's children, space-delimited
OBOfile_to_GOdag.py	Python script to convert a GO OBO file to a directed acyclic graph (DAG)
OBOfile_to_GOdesc.py	Python script to convert GO OBO file to a list of GO id/descriptions
calcEntropy.py	This is a generic script to calculate entropy scores for a given set of values. For details on input file format, execute the script without any parameters to display help. The script was originally used to calculate mean normalized entropies for clusters of predicted orthologs. GO ids were provided for each protein or gene id in a given cluster.

IsoRank and IsoRankN binaries, PPI networks, GI networks, and BLAST scores
File	Description
IsoRankN-HighSpeed.tar	IsoRankN High Speed Version executable and README
IsoRankN.tgz	IsoRankN executable and README
IsoRankN2.tgz	IsoRankN2 executable and README. IsoRankN2 accepts and simultaneously aligns two unrelated sets of networks. Algorithm was used to integrate network alignments of genetic interactions (GI). Optional parameter includes a --beta flag which takes an argument from 0 to 1. A beta of 0.5 weights both sets of networks equally and a beta of 0.75 weights the first set of networks 3 times more than the second set of networks.
BLAST_Bit_Scores.tar.gz	BLAST bit scores (sequences from Ensembl) for C. elegans, D. melanogaster, H. sapiens, M. musculus, S. cerevisiae
ppi_networks.1.0.2.tar.gz	PPI networks from Biogrid (release 3.0.68, 08/31/2010), DIP (06/14/2010), HPRD (release 9, 04/13/2010), MINT (07/28/2010), IntAct (10/07/2010) for C. elegans, D. melanogaster, H. Sapiens, M. musculus, S. cerevisiae
gi_networks.1.0.1.tar.gz	Gene-gene interaction networks from Biogrid (release 3.0.68, 08/31/2010) for C. elegans, D. melanogaster, H. Sapiens, M. musculus, S. cerevisiae

Statistics

Datasets

We used IsoRank & IsoRankN on five eukaryotic PPI networks: H. sapiens (Human), M. musculus (Mouse), D. melanogaster (Fly), C. elegans (Worm), and S. cerevisiae (Yeast). Two forms of data were required as inputs, PPI networks and sequence similarity scores. The PPI networks were constructed by combining data from the DIP (06/14/2010), BioGRID (release 3.0.68, 08/31/2010), and HPRD (release 9, 04/13/2010) databases. In total, these five networks contained 87,737 proteins and 114,897 known interactions. The sequence similarity scores of pairs of proteins were the BLAST Bit-values of the sequences as retrieved from Ensembl.

Species	Number of Proteins	Number of Interactions
H. sapiens	22369	43757
M. musculus	24855	452
D. melanogaster	14098	26726
C. elegans	19756	5853
S. cerevisiae	6659	38109

Evaluation

We evaluated the biological relevance of our results against the Gene Ontology database (GO). We first measured the consistency of the predicted network alignment by computing the mean entropy of the predicted clusters. The entropy of a given cluster S*_v is:

where p_i is the fraction of S*_v with GO group ID i. Thus a cluster has lower entropy if its GO annotations are more within-cluster consistent. We also measured the fraction of clusters which are exact, i.e. those in which all proteins have the same GO ID. With regards to choosing a set of proper GO annotations, we projected all GO terms to the same level of the GO heirarchy (k=5), removing questions of generality of terms and relatedness of annotations having different IDs. Note that only 60-70% of the proteins in any of the aligned networks have an assigned GO ID, comparable to the fraction of all known proteins included in GO. Additionally the relative performance of this consistency measure does not change when projecting GO terms to GO heirarchy levels of k=4, k=5, or k=6.

Consistency	IsoRank & IsoRankN	Homologene	OrthoMCL
Mean normalized entropy (all species)	0.086	0.262	0.206
Mean normalized entropy (human, fly)	0.066	0.298	0.260
Exact cluster ratio*	0.250 (1752 of 7010)	0.232 (1805 of 7769)	0.220 (794 of 3602)
Exact protein ratio*	0.253 (7488 of 29636)	0.288 (5196 of 18057)	0.270 (1996 of 7387)
*The fraction of predicted clusters which are exact, fraction of proteins in exact clusters.

Coverage* (# of species)	IsoRank & IsoRankN	Homologene	OrthoMCL
Total	12848/48978	11746/22527	10008/27601
2	3844/8739	5584/11151	2576/5213
3	4022/13533	1940/5801	638/2016
4	2926/13991	1652/6615	470/1977
5	2056/12715	745/3729	366/1931
*The number of predicted clusters containing exactly # species and number of constituent proteins in those clusters (#cluster / #proteins)

GO/KEGG	IsoRank & IsoRankN
p-value*	1.28 e-90
GO/KEGG category	712/2490
Human	632/2200
Mouse	605/2124
Fly	574/1787
Worm	552/1698
Yeast	368/938
The number of GO/KEGG categories enriched by IsoRank & IsoRankN. *As computed by GO TermFinder, we remark that this excludes those proteins tagged IEA (inferred from electronic annotation).

FAQ

1. How can I cite Isobase?

If you would like to cite Isobase, the references are given as follows:

1. Rohit Singh, Jinbo Xu, and Bonnie Berger. (2008) Global alignment of multiple protein interaction networks with application to functional orthology detection, Proc. Natl. Acad. Sci. USA, 105:12763-12768.

2. Chung-Shou Liao, Kanghao Lu, Michael Baym, Rohit Singh, and Bonnie Berger. (2009) IsoRankN: Spectral methods for global alignment of multiple protein networks, Bioinformatics, 25:i253-i258.

3. Daniel Park, Rohit Singh, Michael Baym, Chung-Shou Liao, and Bonnie Berger. (2011) IsoBase: A Database of Functionally Related Proteins across PPI Networks, Nucleic Acids Research, 39:D295-D300.

2. What is the main difference between Isobase and other ortholog databases?

Isobase has been developed by global network alignment on multiple PPI networks. We demonstrate that incorporating PPI network data in ortholog prediction results in improvements over existing sequence-only approaches and over predictions from local alignments. In addition, our network alignment tools outperform existing algorithms for global network alignment in coverage and consistency on multiple alignments of the five available eukaryotic PPI networks.

3. How does Isobase generate functionally related proteins across multiple species?

We use the global PPI network alignment tools, IsoRank & IsoRankN, based on an idea similar to PageRank and graph spectral clustering, to detect and generate functionally related orthologs (isologs) for Isobase. We also evaluate the biological relevance of our predictions against two gene ontology databases: GO and KEGG.

4. What ortholog information is provided?

For each ortholog across multiple species, brief information such as constituent protein names and their respective synonyms, is provided. Moreover, we give the entropy of every ortholog/cluster to represent the consistency. That is, a cluster has lower entropy if its GO and KEGG annotations are more within-cluster consistent. The GO and KEGG categories the constituent proteins belong to are also shown.

5. Why does Isobase collect orthologs from only five Eukaryotic species?

Isobase is a collection of functionally related orthologs (isologs) predicted by our network alignment tools. Hence we only provide the orthologs from five available eukaryotic PPI networks so far. The isologs across prokaryotic species (PPI networks) will be presented in the near future. With the increasing availability of large PPI networks, Isobase will collect orthologs from more species.

Help

Table of Content
1. Searching for isologs of a gene
2. Including genetic interactions in Isorank network alignments
3. Supported gene ids
4. Walkthrough
5. Input page: isolog search
6. Results page: isolog search

1. Searching for isologs of a gene
To identify isologs of a specific gene, enter the gene id, symbol, or any synonynm/keyword.

2. Including genetic interactions in Isorank network alignments
The Isorank executables support an option to include an additional unrelated network in the alignment. This additional network is simultaneously aligned with the sequence data and the PPI network. Genes with low coverage in the PPI network can be covered with an additional genetic interaction network.

3. Supported gene ids
Isobase supports querying for isologs by multiple gene id types. Currently, supported ids include Wormbase, FlyBase, SGD (Saccharomyces Genome Database), HPRD (Human Protein Reference Database), MGI (Mouse Genome Informatics), gene names and gene symbols.

4. Walkthrough
5. Input page: isology search

The user can query for isologs by supplying a gene id, symbol, name, or any of it's synonyms. In this example, Isobase searches for the isolog(s) having id "CG4252".

Including genetic interactions in the network alignment

To incorporate a genetic interaction network with the PPI network alignment, the Isorank executables can accept two unrelated sets of networks to simultaneously align. The two network types are averaged together with the parameter beta, which is the weight of the primary network. A beta of 0.5 weights both networks equally and a beta of 0.75 weights the first network 3 times more than the second. This secondary network is enabled in the executable with the flag --beta which takes an argument from 0 to 1. The connections for this network are looked for in the supplied data directory in files with .tab2 suffixes. For example, if your human PPI data is in data/hsapi.tab, the human genetic interaction data will go in data/hsapi.tab2

6. Search results page: isologs of the gene nemo

Search results

Entropy:

Species

Gene name

Synonyms

D. Park, R. Singh, M. Baym, C. Liao, and B. Berger. 2011. "IsoBase: A Database of Functionally Related Proteins across PPI Networks." Nucleic Acids Research, doi:10.1093/nar/gkq1234. [PDF]

R. Singh, J. Xu, and B. Berger. 2008. "Global alignment of multiple protein interaction networks with application to functional orthology detection." Proc. Natl Acad. Sci. USA, 105:12763-768. [PDF]

C. Liao, K. Lu, M. Baym, R. Singh, and B. Berger. 2009. "IsoRankN: Spectral methods for global alignment of multiple protein networks." Bioinformatics, 25:i253-i258. [PDF]

Questions or comments? Please contact isobase at csail.mit.edu Berger Lab | CSAIL | MIT