Think globally and solve locally: secondary memorybased network learning for automated multispecies function prediction
 Marco Mesiti†^{1},
 Matteo Re†^{1} and
 Giorgio Valentini†^{1}Email author
DOI: 10.1186/2047217X35
© Mesiti et al.; licensee BioMed Central Ltd. 2014
Received: 4 December 2013
Accepted: 1 April 2014
Published: 23 April 2014
Abstract
Background
Networkbased learning algorithms for automated function prediction (AFP) are negatively affected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their effectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satisfiable with simple offtheshelf computers.
Results
We propose a novel framework for scalable networkbased learning of multispecies protein functions based on both a local implementation of existing algorithms and the adoption of innovative technologies: we solve “locally” the AFP problem, by designing “vertexcentric” implementations of networkbased algorithms, but we do not give up thinking “globally” by exploiting the overall topology of the network. This is made possible by the adoption of secondary memorybased technologies that allow the efficient use of the large memory available on disks, thus overcoming the main memory limitations of modern offtheshelf computers. This approach has been applied to the analysis of a large multispecies network including more than 300 species of bacteria and to a network with more than 200,000 proteins belonging to 13 Eukaryotic species. To our knowledge this is the first work where secondarymemory based network analysis has been applied to multispecies function prediction using biological networks with hundreds of thousands of proteins.
Conclusions
The combination of these algorithmic and technological approaches makes feasible the analysis of large multispecies networks using ordinary computers with limited speed and primary memory, and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using wellequipped standalone machines.
Keywords
Biomolecular networks Big data analysis Networkbased learningBackground
In recent years many efforts have been devoted to build automated tools for large scale automated function prediction of proteins (AFP) exploiting the knowledge generated by high throughput biotechnologies [1, 2]. As highlighted by a recent international challenge for the critical assessment of automated function prediction [3], scalability and heterogeneity of the available data represent two of the main challenges posed by AFP. Indeed on the one hand no single experimental method can fully characterize the multiplicity of the protein functions, and on the other hand the huge amount of data to be processed poses serious computational problems. The complexity of the problem is furthermore exacerbated by the different level of the functional annotation coverage in different organisms, thus making very difficult the effective transfer of the available functional knowledge from one organism to another.
Computational automated function prediction approaches can be useful for the integration of diverse types of data coming from multiple, often unrelated, proteomic and genomic pipelines. A recent example is represented by the Integrative multispecies prediction (IMP) web server [4] which integrates prior knowledge and data collections from multiple organisms for the generation of novel functional working hypotheses used in experimental followup. Despite its undoubted usefulness, IMP actually covers only seven model organisms, preventing its application to the prediction of the functions of proteins belonging to the proteomes of poorly annotated organisms.
Another popular approach for gene functional annotation transfer between species relies on the availability of a collection of orthology relationships across interspecies proteins, and on the usage of an evolutionary relationships network as a suitable medium for transferring functional annotations to the proteins of poorly annotated organisms [5]. Even if orthology is an evolutionary concept, rather than a functional one, it can be used to link functionally equivalent genes across genomes and enables the functional inference of unknown proteins using one or more functionally characterized orthologs in other species [6, 7].
As noticed in [4], the accuracy of machinelearning algorithms for AFP tasks is negatively affected by the sparse coverage of experimental data and by the limited availability of prior functional knowledge. Consequently, these methods are often applied only to biological processes and pathways that are already well characterized for an organism. The construction of large scale multi species networks can be a solution to this problem. Following this approach, network based learning algorithms might benefit of the availability of a priori functional knowledge coming from well annotated species to effectively perform a functional transfer to the proteins of poorly annotated organisms.
Unfortunately this solution is only apparently simple, since the application of classical graphbased algorithms such as the ones based on random walks [8] or label propagation methods [9, 10] are often unfeasible with large multispecies networks, especially when only single offtheshelf machines are available. These approaches, indeed, usually rely on an inmemory adjacency matrix representation of the graph network, scale poorly with the size of the graph [11], and may have time complexity that becomes quickly prohibitive. Performance optimization is usually realized by adopting an adjacencylist representation of the graph to take its sparsity into account, or by using parallel strategies for matrix multiplication [12]. However, when the size of the graph becomes so high that is not possible to maintain it entirely in primary memory, either approaches based on parallel distributed computation [13–15], or secondary memorybased computation [16–18] can be considered. With distributed computation techniques, the graph is spread on different machines and the results are finally collected. However, as outlined in [16], a key issue of these approaches is the need to identify a cut of the graph in order to minimize the communication overhead among machines and their synchronization activities. With secondary memorybased computation, the graph is stored on the disk of a single machine and only limited portions of the graph are loaded in primary memory for computation. In this way, it is possible to overcome the lack of enough primary memory. The use of smart strategies for caching the portions of graph needed for computation [19], the minimization of the number of accesses to secondary memory [20], and the usage of compressed data structures for maintaining the graph in primary memory [21] are the main challenges for making the management of large graph networks in offtheshelf machines comparable to distributed approaches.
In this work we propose a novel framework for scalable semisupervised networkbased learning of multispecies protein functions: on the one hand we adopt a “local learning strategy” to implement classical graphbased algorithms for protein function prediction, and on the other hand we apply secondary memorybased technologies to exploit the large disks available in ordinary offtheshelf computers. The combination of these algorithmic and technological approaches makes feasible the analysis of large multispecies networks in ordinary computers with limited speed and primary memory and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using wellequipped standalone machines.
Only very recently a paper has been devoted to the application of graph database technologies in bioinformatics [22], and to our knowledge this is the first work where secondarymemory based network analysis has been applied to multispecies function prediction using big biological networks with hundreds of thousands of proteins.
This paper is organized as follows. In the next section we introduce our proposed approach based on the local implementation of networkbased algorithms and secondary memorybased computation for the multispecies AFP problem. In particular we discuss the characteristics of Neo4j, a database technology for graph querying and processing, and GraphChi, a diskbased system for graph processing. Then, we show their application to a multispecies network involving proteins of about 300 bacteria species, and to a network including 13 species of Eukaryotes with more than 200.000 proteins, using offtheshelf notebook and desktop computers.
Methods
 1.
Local Implementation of networkbased algorithms. To solve the overall AFP problem we adopt a local learning strategy, according to a “vertexcentric” computational model.
 2.
Secondary memorybased computation. We exploit novel technologies for fast and efficient secondarymemory access: the overall graph is maintained on disk and only small parts of it are loaded each time into primary memory.
It is worth noting that we do not propose novel algorithms, but simply their “local implementation”, according to a vertexcentric programming model, necessary for secondary memorybased computation [14]. Indeed the strength of the proposed approach consists precisely in coupling a “local” vertexcentric implementation of networkbased algorithms with technologies based on secondary memory, to make efficient the local access to graphs stored on disk, thus also allowing the processing of big biological networks when limited RAM memory is available.
Local implementation of networkbased algorithms
The most effective networkbased algorithms for AFP learn by exploiting the overall topology of the networks [23–25], and their implementation usually requires to process in primary memory a large part or the overall underlying graph. The main drawback of this implementation is that big networks cannot be entirely loaded into primary memory using offtheshelf machines.
We aim at providing local implementations of “global” network algorithms by iteratively processing only one vertex and its incident edges at a time. In other words we do not reject to think “globally” by exploiting the overall topology of the network, but at the same time we solve “locally” by designing implementations of these algorithms through a vertexcentric programming model [14, 26].
As an example, we consider the local implementation of the “vanilla” random walk (RW) algorithm [8], a popular networkbased method just successfully applied to AFP [24]. It is worth noting that the RW algorithm is “global”, in the sense that it may exploit the global topology of the graph, but it is also intrinsically local, since at each step each vertex can be processed considering only its direct neighbours. From this standpoint its local implementation is straightforward, since it is sufficient to iteratively process each vertex, its edges and its directly connected vertices to obtain a “vertexcentric” implementation of the algorithm. Other algorithms that can process the adjacency matrix of a graph row by row (e.g., label propagation algorithms [9]) can be easily implemented according to a vertexcentric programming model and can benefit from diskbased approaches. More in general the proposed approach can be extended to any other networkbased method for which a local implementation can be provided.
Basic notation
Having a graph G=<V,E>, representing a functional network, where the vertices V correspond to proteins, and edges E to functional relationships between proteins, we indicate proteins with integers, i.e. V={1,2,…,n}, where n=V is the number of vertices/proteins, and edges (i,j)∈E represent functional relationships between vertices i,j∈V. The weights ${w}_{\mathit{\text{ij}}}\in \mathbb{R}$ associated with edges (i,j) represent the “strength” of their functional relationships and are elements of the symmetric weight matrix W. C⊂V indicates the proteins belonging to a functional class c (e.g., a specific Gene Ontology (GO) term [27]).
Local implementation of random walks
Random walk (RW) algorithms [8] explore and exploit the topology of the functional network, starting and walking around from a subset C⊂V of nodes belonging to a specific class c by using a transition probability matrix Q=D ^{−1} W, where D is a diagonal matrix with diagonal elements ${d}_{\mathit{\text{ii}}}=\sum _{j}{w}_{\mathit{\text{ij}}}$. The elements q _{ i j } of Q represent the probability of a random step from i to j.
and the update (1) is iterated until convergence or until a finite number of steps is reached.
 1.
${d}_{\mathit{\text{jj}}}^{1}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\frac{1}{\sum _{i}{w}_{\mathit{\text{ji}}}}\phantom{\rule{0.3em}{0ex}}$ (i.e., the inverse of the sum of weights of the edges coming from j)
 2.
w _{ j i },1≤j≤n (i.e., the weights of the inedges of i)
 3.
${p}_{j}^{t},1\le j\le n$ (i.e., the probabilities of node j at the previous step).
 a)If the graph is undirected (and this is the case for the AFP problem), the weights of incoming and outcoming edges are the same, that is ∀i,∀j w _{ i j }=w _{ j i }. This implies that only the list of edge weights outcoming from i: L(i)={w _{ i j }w _{ i j }>0} should be stored. This in turn implies that in sparse graphs the spatial (and temporal) complexity at each node is sublinear, and (3) can be expressed as:${p}_{i}^{t+1}=\sum _{j\in N\left(i\right)}{d}_{\mathit{\text{jj}}}^{1}\phantom{\rule{2.77626pt}{0ex}}{w}_{\mathit{\text{ji}}}\phantom{\rule{2.77626pt}{0ex}}{p}_{j}^{t}$(4)
 b)
We need to store ${p}_{j}^{t},$ and ${p}_{j}^{t+1}$, 1≤j≤n, that is the probabilities at the current and previous step. Once a step is completed, the current probabilities (${p}_{j}^{t+1}$) can be used as starting probabilities for the next iteration.
 c)
We can store ${d}_{\mathit{\text{jj}}}^{1},1\le j\le n$, as a value associated to each node j. It could be computed at each node j as a preprocessing step: ${d}_{\mathit{\text{jj}}}^{1}=\frac{1}{\sum _{i}{w}_{\mathit{\text{ji}}}}$.
 d)
The algorithm iterates for a predefined number of steps or until convergence.
 e)
It is easy to see from (3) that the complexity of each iteration of the algorithm is $\mathcal{O}\left({n}^{2}\right)$, but with sparse graphs, i.e. when ∀i,{(j,i)w _{ j i }>0}<<n, the complexity is $\mathcal{O}\left(n\right)$.
Secondary memorybased computation
To be actually applicable to realworld big networks, the local implementations of the algorithm described in Section “Local implementation of networkbased algorithms” require specific technologies for an efficient access to the secondary memory: indeed we need to efficiently load small parts of a graph, update them in primary memory and finally store them back to disk.
To this end we experimented with two different secondary memorybased technologies. The first one is based on graph DB technologies [28], and the second one on efficient technologies for diskbased processing of graphs.
Neo4j: a DB technology for graph querying and processing
Neo4j[17] is a data management system written in Java based on the graph data model. Nodes, relationships and their properties are first class citizen in the model and their storage on disk is optimized by the adoption of specific data structures for graph networks. The Neo4j Kernel is a fast graph engine with the main characteristics expected by a DBMS, like recovery, management of transactions and indexing structures. Neo4j can be used both as an embedded database within a Java application and as a standalone server with an extensive REST interface for easy integration with Web applications. A declarative query language, named cypher, for the specification of SQLstyle queries is provided.
Internally, Neo4j stores graph data spread across a number of files. Each store file contains the data for a specific part of the graph (e.g. nodes, relationships, properties) and their overall organization, which entails the separation of graph structure from property data, allows the efficient traversal of the graph and the generation of query answers. Both nodes, relationships and properties have a fixed size representation (e.g. nodes have a fixed dimension of 9 bytes), and relationships are implemented using doubly linked lists on disk in order to render efficient their traversal. The fixedsize representation of nodes, relationships and properties has the advantage that identifiers should not be stored (corresponds to the file offset) and that their retrieval by means of their identifiers can be done in constant time.
Since this information is stored in secondary memory, its access is made efficient through the use of caching techniques. At file system level, each store file is divided in equally sized regions and these regions are cached. The cache holds a fixed number of regions for each file, and regions are replaced relying on a least frequently used (LFU)like policy. On top of this mechanism, a more specific node/relationship cache has been implemented that is optimized for traversal (for example, relationships of a node are organized relying on their type and their direction).
GraphChi: a diskbased system for graph processing
GraphChi is a diskbased system for the analysis of big graphs on single offtheshelf computers [16]. Differently from Neo4j, GraphChi has not been conceived for querying large graphstructured databases, but for efficiently processing graphs stored in secondary memory. To this end it implements specialized data structures to efficiently break large graphs into small parts that can be quickly loaded into primary memory, and provides efficient disk I/O operations to reduce the number of non sequential accesses to disk. Moreover, it offers an asynchronous model of computation that directly supports the vertexcentric programming model.
 1.
Read: select an interval and load in primary memory its inedges stored in the associated shard (the “memory shard”). Through at most K non sequential reads load its outedges.
 2.
Execute: perform a parallel update of vertices and edges of the memory shard through multithread asynchronous computation in primary memory.
 3.
Write: The updated vertices and edges are written back to disk.
Note that the mechanism of Parallel Sliding Windows requires at most K ^{2} non sequential reads/writes on disk for a full visit of the entire graph (K reads/writes for each interval), thus resulting in a very efficient management of primary and secondary memory [16].
The C++ implementation of this algorithm in GraphChi entails to keep in main memory a global vector containing a copy of the data structures for each vertex v∈V. Indeed, during the execution of the update function, it is only possible to access the identifier of the neighbour vertex j contained in the data structure representing the edge e, but not its associated probability j.p 1 and j.d values, necessary for the computation of the probability i.p 2 (Figure 4). Therefore, the global vector in main memory is used just to access this information. We remark that this solution makes our implementation even faster and feasible in our AFP context, since the number of vertices is by far smaller than the number of edges, and thus there is no risk of running out of memory also with offtheshelf computers, even for very large biological networks involving hundreds of thousands or even millions of proteins.
Analyses
We applied our methods based on the local implementation of networkbased algorithms and secondary memorybased computation to the multispecies protein function prediction in bacteria and eukarya. In the remainder of the section we summarize the experimental setup and the characteristics of the data, and then we compare the empirical computational time required by secondary and primary memorybased implementations of network based algorithms for AFP.
Data description and experimental setup
We applied our methods to two multispecies networks of proteins: the first one (Bacterianet, Section “Bacterianet”) accounts 301 species of bacteria, and the second one (Eukaryanet, Section “Eukaryanet”) includes the proteomes of 13 Eukaryotic species.
Bacterianet
We constructed a multispecies bacteria network (Bacterianet), using the proteins proposed as part of a large scale experiment for a recent international challenge aimed at the evaluation of gene function prediction methods (CAFA2: [29]).
CAFA2 bacteria species and their proteins available in Swissprot (May 2013)
ID.  Species  n. proteins 

83333  Escherichia coli  4431 
224308  Bacillus subtilis  4188 
99287  Salmonella typhimurium  1771 
208964  Pseudomonas aeruginosa  1245 
321314  Salmonella enterica choleraesuis  882 
160488  Pseudomonas putida  693 
223283  Pseudomonas syringae  675 
85962  Helicobacter pylori  581 
170187  Streptococcus pneumoniae  502 
243273  Mycoplasma genitalium  483 
Public databases exploited for the construction of protein profiles
Database  Description 

Pfam [30]  Protein domain 
Protein superfamilies [31]  Structural and functional annotations 
PRINTS [32]  Motif fingerprints 
PROSITE [33]  Protein domains and families 
InterPro [34]  Integrated resource of protein families, domains and functional sites 
EggNOG [35]  Evolutionary genealogy of genes: Nonsupervised Orthologous Groups 
SMART [36]  Simple Modular Architecture Research Tool (database annotations) 
Swissprot  Manually curated keywords describing the function of the proteins 
at different degrees of abstraction 
where ${\stackrel{\u0304}{w}}_{\mathit{\text{ij}}}$ is the weight of the integrated network and ${w}_{\mathit{\text{ij}}}^{d}$ represents the weight associated to the edge (i,j) of the d ^{ t h } network (final phase of Figure 5).
Summary of the distribution of the number of positives across the 381 GO BP classes involved in the functional labelling of the 17638 proteins comprised in the bacterial multi species protein network
Min.  1st Qu.  Median  Mean  3rd Qu.  Max. 

21.0  31.0  53.0  135.4  131.0  2000.0 
The performance of the considered methods have been quantified both in terms of area under the receiving operating curve (AUC) and precision at different recall levels in a standard 5 folds stratified cross validation scheme. We compared the execution times required for the completion of each ranking task in primary memory (on a machine with 16 Gb of RAM) and in secondary memory (on two machines with 16 and 4 Gb of RAM). The machine with 16 Gb of RAM has been equipped with an i7 Intel core processor with 8 cores at 2.80 GHz, while the 4 Gb machine with an Intel i7 core processor with 4 cores at 1.90 GHz (both have been equipped with offtheshelf ATA hard disk drives). Both the machines run an Ubuntu 12.04 Linux operating system.
Eukaryanet
Selected species from the core region of the STRING protein networks database
NCBI taxon ID.  Species  n. proteins 

3218  Physcomitrella patens  10352 
3702  Arabidopsis thaliana  23576 
7227  Drosophila melanogaster  12845 
7739  Branchiostoma floridae  16418 
8364  Xenopus (Silurana) tropicalis  13678 
9031  Gallus gallus  13119 
9258  Ornithorhynchus anatinus  13333 
9606  Homo sapiens  20140 
9615  Canis lupus familiaris  16912 
10090  Mus musculus  20023 
13616  Monodelphis domestica  15409 
39947  Oryza sativa Japonica  13330 
69293  Gasterosteus aculeatus  13307 
This network includes proteins coming from 2 invertebrates (a lancelet of the genus Branchiostoma and the fruit fly), 3 plants (Arabidopsis thaliana, the rice Oryza sativa and the moss Physcomitrella patens), and 8 vertebrates including a frog, the chicken and 6 mammals. The total number of proteins in Eukaryanet is 202,442. This basic version of Eukaryanet is obtained by extracting from STRING all the interactions occurring between proteins of the selected core species. This led to an initial collection of 25,132,538 interactions. A simple connected components analysis revealed that this network is composed by 666 connected components of which only 13 composed by more than 30 vertices (and corresponding to the biggest connected components of the networks associated to each species). This “big” network is thus a collection of the protein networks of the selected species. In order to find a way to “connect” the core components, we extracted all the clusters of orthologous genes from the STRING website according to the following steps:

the cluster of orthologs ID obtained by the STRING team using the eggNOG database (we considered only clusters of type NOG: nonsupervised orthologous groups);

the filtering of each NOG cluster in order to remove the proteins not coming from the selected core species. Note that some NOGs are composed by more than 2,000 proteins, but after our filtering procedure each selected NOG is composed by no more than 10 proteins.
After these steps, we selected all the NOGs in which the number of proteins equals the number of species (i.e. NOG composed by 10 proteins coming from 10 species, NOG composed by 9 proteins coming from 9 species, and so on). We finally constructed an enriched version of the basic Eukaryanet network simply by including in Eukaryanet all the possible edges linking the members of the selected set of NOGs. Following this strategy we obtained a network composed by 25,155,631 edges (network density: 0.000613). In order to verify the impact of the additional 23,093 NOGs based edges on the connectivity of Eukaryanet, we repeated the connected components analysis and we found that this “NOGs augmented” version of Eukaryanet is composed by 552 connected components of which two (190,755 nodes (94.22%) and 10,233 (5.05%)) account for more than 99% of the 202,442 proteins composing the network.
As class labels for the proteins included in Eukaryanet we used the GO annotations available in STRING (version 9.05). The STRING website provides flat text files containing a mapping from GO annotations to STRING proteins and a STRING internal confidence score for each GO annotation, ranging from 1 (low confidence) to 5 (high confidence). While extracting the GO labels we considered only the annotations with confidence score 5. We then filtered out all the GO terms associated with less than 20 and more than 100 proteins (473 GO terms). We finally randomly selected from this set 50 GO terms irrespective of their GO division (Molecular function, Biological process and Cellular component). We then repeated all the test performed on Bacterianet on the bigger Eukaryanet network.
Results and discussion
We compared the runtime required by main memory and secondary memorybased implementations (Section “Secondary memorybased computation”) of the RW algorithm described in Section “Local implementation of networkbased algorithms”. Moreover, even if our main aim consists in showing that the combination of local implementation and secondary memorybased computation allows the analysis of big biological networks on small computers, we performed also a comparison of the performance achieved with singlespecies and multispecies networks of bacteria proteins to experimentally assess the impact of a multispecies approach to the prediction of protein functions.
Results with bacterianet
Empirical time complexity of the main and secondary memorybased implementations of network based algorithms for multispecies function prediction with the Bacterianet
16 Gb RAM machine  4 Gb RAM machine  

Algorithm  Main mem.  Neo4j  GraphChi  Main mem.  Neo4j  GraphChi 
RW  1 step  8.11s  27.92s  8.84s  –  208.27s  12.32s 
RW  2 steps  16.05s  54.36s  16.98s  –  408.57s  25.06s 
RW  3 steps  23.95s  81.18s  25.12s  –  621.92s  36.51s 
The results of the mainmemory algorithm with the notebook (4 Gb RAM machine) are not reported since on this task the main memory implementation of the algorithm fails, due to disk trashing, by which processor time is mainly used to continuously swap from main memory and the virtual memory on disk. On the contrary, the GraphChi implementation results only in a small increment of the computational time, mainly due to the larger time required to construct the shards when less RAM memory is available (Section “GraphChi: a diskbased system for graph processing”) and to the lower speed of the processor in the notebook machine.
Note that with the smaller machine the empirical computational time required by Neo4j increases of about one order of magnitude, while the GraphChi implementation introduces only a small increment of the required execution time (Table 5). This is particularly relevant when we consider the overall computational time required to predict the 381 GO terms: with the “small” machine Neo4j moves from about 3 hours to about one day with the 1step RW, and from about 7 hours to almost 3 days with the 3steps RW.
Bacterianet : average AUC, precision at 20% recall (P20R) and precision at 40% recall across 381 GO BP terms estimated through 5fold crossvalidation
Algorithm  AUC  P20R  P40R 

RW  1 step  0.8744  0.2264  0.1673 
RW  2 steps  0.8590  0.1318  0.0893 
RW  3 steps  0.8419  0.1064  0.0713 
Results with Eukaryanet
Eukaryanet : Average perterm empirical time complexity between Neo4j and GraphChi implementations
16 Gb RAM machine  4 Gb RAM machine  

Algorithm  Neo4j  GraphChi  Neo4j  GraphChi 
RW  1 step  189.60s  20.44s  2520.00s  21.46s 
RW  2 steps  367.82s  31.68s  4919.35s  33.19s 
RW  3 steps  549.84s  45.73s  7333.10s  46.69s 
It is worth noting that in this task involving a bigger net, the GrapChi implementation is significantly faster than the Neo4j implementation (Table 7). Moreover, the average computational time is in practice the same when the 4 Gb and the 16 Gb RAM machines run the GrapChi implementation of the RW algorithm, while we observe a relevant increment in computational time with Neo4j, as previously observed also with Bacterianet.
Eukaryanet : average AUC, precision at 20% recall (P20R) and precision at 40% recall across 50 GO terms estimated through 5fold crossvalidation
Algorithm  AUC  P20R  P40R 

RW  1 step  0.8601  0.1449  0.0943 
RW  2 steps  0.9667  0.1329  0.0929 
RW  3 steps  0.9598  0.0927  0.0785 
Experimental comparison between multispecies and singlespecies approaches
Comparison of the average AUC, precision at 20% recall (P20R) and precision at 40% recall between multispecies and singlespecies approaches with 301 species of bacteria
Multispecies approach  

Algorithm  AUC  P20R  P40R 
RW  1 step  0.8744  0.2264  0.1673 
RW  2 steps  0.8590  0.1318  0.0893 
RW  3 steps  0.8419  0.1064  0.0713 
Singlespecies approach  
Algorithm  AUC  P20R  P40R 
RW  1 step  0.8263  0.1801  0.1176 
RW  2 steps  0.8146  0.1059  0.0647 
RW  3 steps  0.8179  0.1009  0.0563 
These results can be explained, considering two characteristics of multispecies networks: 1) the number of nodes and the number of available annotated proteins; 2) the overall topology of the network.
Indeed in singlespecies nets either the reduced number of available proteins or the reduced number of annotated nodes can negatively affect the generalization capabilities achieved with random walks or any other learning algorithm, while in multispecies networks, by construction, more nodes and more annotated proteins from other species can be available.
Moreover in singlespecies networks usually the number of available functional connections (edges) between proteins can be reduced (for instance, since no sufficient data are available) and in many cases we may have highly disconnected networks, making very difficult the application of algorithms based on the propagation of the information between nodes. On the contrary, in the multispecies setting learning algorithms can enjoy a richer network topology by exploring connections not available in singlespecies nets: the evolutionary relationships between species assure that proteins not connected with other proteins of the same species, can in principle be connected with other homologous proteins in other species, thus enhancing the propagation of the information across the multispecies network.
Summarizing, our results show the feasibility of the “vertexcentric” algorithmic approach coupled with secondary memorybased technologies to process large multispecies protein networks with single offtheshelf computers. Moreover, our preliminary experiments show that in perspective we can also improve performances by constructing large multispecies networks, and by integrating heterogeneous sources of biomolecular and evolutionary information.
Conclusions
Our approach based on local implementations of networkbased algorithms and on novel secondary memorybased technologies provides a solution to the large main memory requirements induced by large multispecies protein networks, thus making possible the analysis of big networks using offtheshelf machines. Our results show that both graph DB technologies (i.e. Neo4j) and secondary memory based systems for graph processing (i.e. GraphChi) can be successfully applied to the analysis of large multispecies networks, even if the latter seems to be less sensitive to the amount of available primary memory, and more efficient for the implementation of networkbased algorithms for AFP. The local implementation strategy can be applied to other networkbased learning algorithms, ranging e.g. from simple guiltbyassociation methods (that are inherently local) [41, 42] to more complex label propagation methods [9, 10], kernelized graph algorithms [25, 43, 44] and the recently proposed parametrized Hopfield networks [45], but in principle any algorithm, that can be expressed according to a “vertexcentric” programming model, can be adapted to this framework.
In perspective, by exploiting orthologous genes and multiple genomic sources, multispecies prediction can be applied to annotate poorly annotated species and discover new functions for uncharacterized genes in model organisms. Indeed our proposed approach allows computational biologists to experiment with large multispecies networks using their own notebooks, but in perspective applications to huge networks including e.g. the proteomes available in SwissProt/TrEmbl could be performed using wellequipped standalone machines.
Our framework could be also adapted and devised to other relevant computational biology scenarios characterized by the construction and processing of large networks, such as in the context of the “Network medicine” [46], or in drug discovery and repositioning problems [47].
Availability of supporting data
The files containing the Bacterianet and Eukaryanet along with the files containing the labels used in our experiments are available from GigaDB [48]http://dx.doi.org/10.5524/100090. The content and format of each file is described in readme files available at the aforementioned database.
Endnote
^{a} For experimental annotation we considered all the available associations having GO evidence codes not included in the following list: IEA, ND, IC, NAS, TAS, ISS, ISO, ISA, ISM, IGC, IBA, IBD, IKR, IRD and RCA. A complete list of the GO evidence codes and their meanings is available at http://www.geneontology.org/GO.evidence.shtml.
Notes
Abbreviations
 AFP:

Automated function prediction
 BP:

Biological process
 CC:

Cellular component
 GO:

Gene ontology
 MF:

Molecular function
 RW:

Random walk.
Declarations
Acknowledgements
The authors would like to thank the reviewers for their useful comments and suggestions, and acknowledge partial support from the PRIN project “Automi e linguaggi formali: aspetti matematici e applicativi”, funded by the italian Ministry of University.
Authors’ Affiliations
References
 Friedberg I: Automated protein function predictionthe genomic challenge. Brief Bioinform. 2006, 7: 225242. 10.1093/bib/bbl004.View ArticlePubMedGoogle Scholar
 Gillis J, Pavlidis P: Characterizing the state of the art in the computational assignment of gene function: lessons from the first critical assessment of functional annotation (CAFA). BMC Bioinformatics. 2013, 14 (Suppl 3): S1510.1186/1471210514S3S15.View ArticlePubMedPubMed CentralGoogle Scholar
 Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, BenHur A, Pandey G, Yunes JM, Talwalkar AS, Repo S, Souza ML, Piovesan D, Casadio R, Wang Z, Cheng J, Fang H, Gough J, Koskinen P, Törönen P, NoksoKoivisto J, Holm L, Cozzetto D, Buchan DWA, Bryson K, Jones DT, Limaye B: A largescale evaluation of computational protein function prediction. Nat Methods. 2013, 10 (3): 221227. 10.1038/nmeth.2340.View ArticlePubMedPubMed CentralGoogle Scholar
 Wong AK, Park CY, Greene CS, Bongo LA, Guan Y, Troyanskaya OG: IMP: a multispecies functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res. 2012, 40 (W1): W484—W490PubMedPubMed CentralGoogle Scholar
 Kuzniar A, van Ham RC, Pongor S, Leunissen JA: The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 2008, 24 (11): 539551. 10.1016/j.tig.2008.08.009.View ArticlePubMedGoogle Scholar
 Koonin EV: Orthologs, paralogs, and evolutionary genomics 1. Annu Rev Genet. 2005, 39: 309338. 10.1146/annurev.genet.39.073003.114725.View ArticlePubMedGoogle Scholar
 Hamp T, Kassner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Rost B: Homologybased inference sets the bar high for protein function prediction. BMC Bioinformatics. 2013, 14 (Suppl 3): S710.1186/1471210514S3S7.View ArticlePubMedPubMed CentralGoogle Scholar
 Lovasz L: Random walks on graphs: a survey. Combinatorics, Paul Erdos is Eighty. 1993, 2: 146.Google Scholar
 Zhou D, Bousquet O, Lal NT, Weston J, Schölkopf B: Learning with local and global consistency. Advances in Neural Information Processing Systems 16. 2004, Cambridge: MIT Press, 321328.Google Scholar
 Bengio Y, Delalleau O, Le Roux N: Label propagation and quadratic Criterion. SemiSupervised Learning. Edited by: Zien A, Schölkopf B, Chapelle O. 2006, Cambridge: MIT Press, 193216.Google Scholar
 Liu W, Wang J, Chang SF: Robust and scalable graphbased Semisupervised learning. Proc IEEE. 2012, 100 (9): 26242638.View ArticleGoogle Scholar
 Foster J: Designing and Building Parallel Programs. 1995, Boston: Addison WesleyGoogle Scholar
 Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C: PowerGraph: Distributed graphparallel computation on natural graphs. OSDI’12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation. 2012, Hollywood, CA: USENIX Association Berkeley, 1730.Google Scholar
 Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM: GraphLab: a new parallel framework for machine learning. Conference on Uncertainty in Artificial Intelligence (UAI). 2010, Catalina Island: AUAI PressGoogle Scholar
 Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G: Pregel: a system for largescale graph processing. Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010. 2010, Indianapolis, Indiana, USA, New York: ACM Press, 135146.View ArticleGoogle Scholar
 Kyrola A, Blelloch G, Guestrin C: GraphChi: largescale graph computation on just a PC. Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation. 2012, CA, USA: Hollywood, CA, USA, OSDI’12: USENIX Association Berkeley, 3146.Google Scholar
 Webber J: A programmatic introduction to Neo4j. Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software for Humanity. 2012, Tucson: ACM, 217218.Google Scholar
 Han WS, Lee S, Park K, Lee JH, Kim MS, Kim J, Yu H: TurboGraph: a fast parallel graph engine handling billionscale graphs in a single PC. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, New York: ACM, 7785.View ArticleGoogle Scholar
 Robinson I, Webber J: Eifrem E: Graph Databases. 2013Google Scholar
 Karedla R, Love J, Wherry B: Caching strategies to improve disk system performance. Computer. 1994, 27: 3846.View ArticleGoogle Scholar
 Boldi P, Vigna S: The WebGraph framework I: compression techniques. In Proc. of the Thirteenth International World Wide Web Conference. 2003, New York: ACM Press, 595601.Google Scholar
 Have C: Jensen L: Are graph databases ready for bioinformatics?. Bioinformatics. 2013, 29 (24): 310710.1093/bioinformatics/btt549.View ArticlePubMedPubMed CentralGoogle Scholar
 Mostafavi S, Ray D, WardeFarley D, Grouios C, Morris Q: GeneMANIA: a realtime multiple association network integration algorithm for predicting gene function. Genome Biol. 2008, 9 (S4):
 Kohler S, Bauer S, Horn D, Robinson P: Walking the Interactome for prioritization of candiyear disease genes. Am J Human Genet. 2008, 82 (4): 948958.View ArticleGoogle Scholar
 Re M, Mesiti M, Valentini G: A fast ranking algorithm for predicting gene functions in biomolecular networks. IEEE ACM Trans Comput Biol Bioinform. 2012, 9 (6): 18121818.View ArticleGoogle Scholar
 Malewicz G, Austern M, Bik AJ, Dehnert J, Horn I, Leiser N, Czajkowski G: Pregel: a system for largescale graph processing. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10. 2010, Indianapolis, Indiana, USA. New York: ACM Press, 135146.View ArticleGoogle Scholar
 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin JM, Sherlock G: Gene Ontology: tool for the unification of biology. Nat Genet. 2000, 25: 2510.1038/75556.View ArticlePubMedPubMed CentralGoogle Scholar
 Angles R, Gutierrez C: Survey of graph database models. ACM Comput Surv. 2008, 40 (1): Article 1View ArticleGoogle Scholar
 Friedberg I, Linial M, Mooney S, Radivojac P: Critical assessment of function annotation experiment. 2013,http://biofunctionprediction.org,Google Scholar
 Finn RD, Mistry J, SchusterBöckler B, GriffithsJones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer ELL, Bateman A: Pfam: clans, web tools and services. Nucleic Acids Res. 2006, 34 (suppl 1): D247—D251PubMedGoogle Scholar
 Gough J, Karplus K, Hughey R, Chothia C: Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol. 2001, 313 (4): 903919. 10.1006/jmbi.2001.5080.View ArticlePubMedGoogle Scholar
 Attwood TK, Bradley P, Flower DR, Gaulton A, Maudling N, Mitchell AL, Moulton G, Nordle A, Paine K, Taylor P, Uddin A, Zygouri C: PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 2003, 31: 400402. 10.1093/nar/gkg030.View ArticlePubMedPubMed CentralGoogle Scholar
 Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, LangendijkGenevaux PS, Pagni M, Sigrist CJ: The PROSITE database. Nucleic Acids Res. 2006, 34 (suppl 1): D227—D230PubMedGoogle Scholar
 Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R, Courcelle E, Das U, Daugherty L, Dibley M, Finn R, Fleischmann W, Gough J, Haft D, Hulo N, Hunter S, Kahn D, Kanapin A, Kejariwal A, Labarga A, LangendijkGenevaux PS, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J: New developments in the InterPro database. Nucleic Acids Res. 2007, 35 (suppl 1): D224—D228PubMedPubMed CentralGoogle Scholar
 Muller J, Szklarczyk D, Julien P, Letunic I, Roth A, Kuhn M, Powell S, Von Mering C, Doerks T, Jensen LJ, Bork P: eggNOG v2. 0: extending the evolutionary genealogy of genes with enhanced nonsupervised orthologous groups, species and functional annotations. Nucleic Acids Res. 2010, 38 (suppl 1): D190—D195PubMedGoogle Scholar
 Letunic I, Copley RR, Pils B, Pinkert S, Schultz J, Bork P: SMART 5: domains in the context of genomes and networks. Nucleic Acids Res. 2006, 34 (suppl 1): D257—D260PubMedGoogle Scholar
 Re M, Valentini G: Networkbased drug ranking and repositioning with respect to DrugBank therapeutic categories. IEEE/ACM Trans Comput Biol Bioinform. 2013, 10 (6): 13591371.View ArticlePubMedGoogle Scholar
 STRING database.http://stringdb.org,
 Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J, Minguez P, Bork P, Von Mering C, Jensen LJ: STRING v9. 1: proteinprotein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013, 41 (D1): D808—D815View ArticlePubMedGoogle Scholar
 Von Mering C, Jensen LJ, Kuhn M, Chaffron S, Doerks T, Krüger B, Snel B, Bork P: STRING 7 recent developments in the integration and prediction of protein interactions. Nucleic Acids Res. 2007, 35 (suppl 1): D358—D362PubMedGoogle Scholar
 Marcotte E, Pellegrini M, Thompson M, Yeates T, Eisenberg D: A combined algorithm for genomewide prediction of protein function. Nature. 1999, 402: 8386. 10.1038/47048.View ArticlePubMedGoogle Scholar
 Bumgarner McDermottRJand, Samudrala R: Functional annotation from predicted protein interaction networks. Bioinformatics. 2005, 21 (15): 32173226. 10.1093/bioinformatics/bti514.View ArticlePubMedGoogle Scholar
 Lippert G, Ghahramani Z, Borgwardt K: Gene function prediction form synthetic leathality networks via ranking on demand. Bioinformatics. 2010, 26 (7): 912918. 10.1093/bioinformatics/btq053.View ArticlePubMedGoogle Scholar
 Re M, Valentini G: Cancer module genes ranking using kernelized score functions. BMC Bioinformatics. 2012, 13 (S14): S3View ArticlePubMedPubMed CentralGoogle Scholar
 Frasca M, Bertoni A, Re M, Valentini G: A neural network algorithm for semisupervised node label learning from unbalanced data. Neural Netw. 2013, 43: 8498.View ArticlePubMedGoogle Scholar
 Barabasi A, Gulbahce N, Loscalzo J: Network medicine: a networkbased approach to human disease. Nat Rev Genet. 2011, 12: 5668. 10.1038/nrg2918.View ArticlePubMedPubMed CentralGoogle Scholar
 Dudley J, Desphonde T, Butte A: Exploiting Drugdisease relationships for computational drug repositioning. Brief Bioinform. 2011, 12 (4): 303311. 10.1093/bib/bbr013.View ArticlePubMedPubMed CentralGoogle Scholar
 Mesiti M, Re M, Valentini G: Supporting materials from ‘Think globally and solve locally: secondary memorybased network learning for automated multispecies function prediction’ GigaScience Database. 2014,http://dx.doi.org/10.5524/100090,Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.