Molecular biology has in the last couple of years seen an immense growth in experimental data, with perhaps the largest contributor being next-generation sequencing (NGS). With constantly increasing throughput, these technologies have transformed molecular biology into a data-intensive field that presents new challenges in storing and analyzing the huge volumes of data generated [1, 2]. As biological sequencing continues to grow exponentially, bioinformatics has emerged as a key discipline to manage and analyze this data .
The computational power of desktop computers is insufficient for the analysis of today’s biological data sets and scientists are dependent on high-performance computing (HPC) and large-scale storage infrastructures [4, 5]. As the price per sequenced base is decreasing faster than computers are increasing in computational power , it is not possible to simply wait for faster computers to resolve the situation. Bioinformatics tools for processing and analyzing data from NGS are relatively new, and in many cases not well adapted for HPC. There are many specialized tools for different tasks, creating the need for frameworks that integrate such tools into easy to use pipelines [7–12].
Apart from computing power and software tools, a big challenge in molecular biology is how to store the generated data. Scientists are reluctant to discard raw data since improved algorithms may help extract further information from them in the near future. The steps of NGS analysis also generate large temporary files, and it is not uncommon for projects to require 5-10 times as much storage during the analysis phase as required by the initial raw data itself. With multiple compute nodes, as is common in HPC, comes the need to share data between the nodes, which also adds to the complexity . Further, many journals require the final datasets to be made publicly available in order for manuscripts to be published [14, 15]. Long term archiving of large amounts of data is not a trivial task, and it is evident that the NGS community is facing a storage problem .
A researcher who wants to use NGS technologies needs extensive IT and bioinformatics expertise or access to specialists with these skills as well as access to a high-performance infrastructure for analyzing and storing the generated and analyzed data. However, the IT expertise to provide these solutions is not usually available to the average biology research group, which requires the group to either bring an expert into the group or outsource.
In this paper we present a Swedish infrastructure, UPPNEX, aimed at meeting these challenges by providing a high-performance cluster and storage system equipped with an actively maintained bioinformatics software suite, as well as application experts to assist with bioinformatics analysis.
Next-generation sequencing in Sweden
Sweden has a long tradition in biological sciences, such as gene sequencing and methods development, and in recent years, an active NGS community has emerged. Initially, several small sequencing platforms were formed around the larger universities of Sweden to serve nearby researchers. In 2010, Science for Life Laboratory (SciLifeLab) was founded as a cooperation between four universities in the Stockholm-Uppsala region of Sweden: Karolinska Institutet, Royal Institute of Technology, Stockholm University and Uppsala University. This initiative included large investments in NGS technologies and the national sequencing platforms within SciLifeLab, which today consists of eight Illumina machines  (HiSeq2000, HiSeq2500, MiSeq), ten from Applied Biosystems  (Solid 5500xl, Solid 5500xl Wildfire, Ion Torrent, Ion Proton) and three 454 Life Sciences (Roche)  (GS-FLX). Apart from the 21 instruments owned by SciLifeLab, there are at least five other instruments available at the larger universities of Sweden. In addition, apart from performing the actual sequencing, SciLifeLab also assists with the data analysis and interpretation; either as a collaborative project or as fee-for-service. SciLifeLab bioinformaticians typically take care of running the data through a standardized pipeline where the most common preparatory steps of NGS analysis are carried out, such as cleaning up the data and aligning short reads to a reference genome . After this initial step, the researchers are free to continue with any custom pipelines or analyses, based on the prepared data.
In the early days of sequencing in Sweden, data and results were generally delivered to clients on external hard disks. This was not only cumbersome, but also impractical as projects increased both in size and numbers. There was a clear need for an infrastructure that could deal with large quantities of data and provide HPC resources for analysis, tightly coupled with the data storage.
In order to tackle these growing challenges, a national resource for NGS analysis “UPPMAX cluster and storage for next-generation sequencing” (UPPNEX), was established, and enabled by a strategic grant in 2008 by the Knut and Alice Wallenberg foundation (KAW)  together with the Swedish National Infrastructure for Computing (SNIC) . Formally, UPPNEX is a project at Uppsala Multidisciplinary Center for Advanced Computational Science (SNIC-UPPMAX), which is one of six SNIC centers in Sweden and Uppsala University’s resource for HPC, large-scale storage and related know-how. The objective of UPPNEX is to provide computing and storage resources for the NGS community of Sweden, together with an infrastructure of software, tools and user support. The services are provided free of charge for Swedish academia and resources are allocated to projects on the basis of estimated requirements, with sequencing platforms having a higher priority. The sequencing platforms within SciLifeLab deliver data to UPPNEX projects and over the last few years many prominent research projects involving NGS have been performed with UPPNEX resources [23–30]. Below, we describe the implementation of the infrastructure, outline architectural choices and strategic decisions made when implementing UPPNEX and follow up with current activities and lessons learned.