Skip to main content


Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Figure 3 | GigaScience

Figure 3

From: A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data

Figure 3

Timing histograms for different stages of the Hadoop-based Crossbow pipeline: preprocessing (A)-(C), mapping (D) and SNP calling (E)-(F) for Hadoop II cluster, Dataset S9, several independent runs. The double maxima on some plots are due to round-robin-like fashion Hadoop distributes jobs among the worker nodes: the first full-size jobs are scheduled, while the remaining data chunks form smaller peaks of shorter duration. The stage tends to have better scaling when the double peaks, if any, are closer. From this perspective, the reason of sub-linear scaling of the Crossbow SNP-calling stage becomes clear: the majority of jobs of the reducing phase (F) are very short, in the order of tens of seconds, while there are several much longer jobs.

Back to article page