In recent years many efforts have been devoted to build automated tools for large scale automated function prediction of proteins (AFP) exploiting the knowledge generated by high throughput biotechnologies [1, 2]. As highlighted by a recent international challenge for the critical assessment of automated function prediction , scalability and heterogeneity of the available data represent two of the main challenges posed by AFP. Indeed on the one hand no single experimental method can fully characterize the multiplicity of the protein functions, and on the other hand the huge amount of data to be processed poses serious computational problems. The complexity of the problem is furthermore exacerbated by the different level of the functional annotation coverage in different organisms, thus making very difficult the effective transfer of the available functional knowledge from one organism to another.
Computational automated function prediction approaches can be useful for the integration of diverse types of data coming from multiple, often unrelated, proteomic and genomic pipelines. A recent example is represented by the Integrative multi-species prediction (IMP) web server  which integrates prior knowledge and data collections from multiple organisms for the generation of novel functional working hypotheses used in experimental follow-up. Despite its undoubted usefulness, IMP actually covers only seven model organisms, preventing its application to the prediction of the functions of proteins belonging to the proteomes of poorly annotated organisms.
Another popular approach for gene functional annotation transfer between species relies on the availability of a collection of orthology relationships across interspecies proteins, and on the usage of an evolutionary relationships network as a suitable medium for transferring functional annotations to the proteins of poorly annotated organisms . Even if orthology is an evolutionary concept, rather than a functional one, it can be used to link functionally equivalent genes across genomes and enables the functional inference of unknown proteins using one or more functionally characterized orthologs in other species [6, 7].
As noticed in , the accuracy of machine-learning algorithms for AFP tasks is negatively affected by the sparse coverage of experimental data and by the limited availability of prior functional knowledge. Consequently, these methods are often applied only to biological processes and pathways that are already well characterized for an organism. The construction of large scale multi species networks can be a solution to this problem. Following this approach, network based learning algorithms might benefit of the availability of a priori functional knowledge coming from well annotated species to effectively perform a functional transfer to the proteins of poorly annotated organisms.
Unfortunately this solution is only apparently simple, since the application of classical graph-based algorithms such as the ones based on random walks  or label propagation methods [9, 10] are often unfeasible with large multi-species networks, especially when only single off-the-shelf machines are available. These approaches, indeed, usually rely on an in-memory adjacency matrix representation of the graph network, scale poorly with the size of the graph , and may have time complexity that becomes quickly prohibitive. Performance optimization is usually realized by adopting an adjacency-list representation of the graph to take its sparsity into account, or by using parallel strategies for matrix multiplication . However, when the size of the graph becomes so high that is not possible to maintain it entirely in primary memory, either approaches based on parallel distributed computation [13–15], or secondary memory-based computation [16–18] can be considered. With distributed computation techniques, the graph is spread on different machines and the results are finally collected. However, as outlined in , a key issue of these approaches is the need to identify a cut of the graph in order to minimize the communication overhead among machines and their synchronization activities. With secondary memory-based computation, the graph is stored on the disk of a single machine and only limited portions of the graph are loaded in primary memory for computation. In this way, it is possible to overcome the lack of enough primary memory. The use of smart strategies for caching the portions of graph needed for computation , the minimization of the number of accesses to secondary memory , and the usage of compressed data structures for maintaining the graph in primary memory  are the main challenges for making the management of large graph networks in off-the-shelf machines comparable to distributed approaches.
In this work we propose a novel framework for scalable semi-supervised network-based learning of multi-species protein functions: on the one hand we adopt a “local learning strategy” to implement classical graph-based algorithms for protein function prediction, and on the other hand we apply secondary memory-based technologies to exploit the large disks available in ordinary off-the-shelf computers. The combination of these algorithmic and technological approaches makes feasible the analysis of large multi-species networks in ordinary computers with limited speed and primary memory and in perspective could enable the analysis of huge networks (e.g. the whole proteomes available in SwissProt), using well-equipped stand-alone machines.
Only very recently a paper has been devoted to the application of graph database technologies in bioinformatics , and to our knowledge this is the first work where secondary-memory based network analysis has been applied to multi-species function prediction using big biological networks with hundreds of thousands of proteins.
This paper is organized as follows. In the next section we introduce our proposed approach based on the local implementation of network-based algorithms and secondary memory-based computation for the multi-species AFP problem. In particular we discuss the characteristics of Neo4j, a database technology for graph querying and processing, and GraphChi, a disk-based system for graph processing. Then, we show their application to a multi-species network involving proteins of about 300 bacteria species, and to a network including 13 species of Eukaryotes with more than 200.000 proteins, using off-the-shelf notebook and desktop computers.