Skip to main content

Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Table 3 Timings for mapping and the ratio T alignment / T comm for HPC and Hadoop I clusters for Dataset S as a function of the number of nodes involved

From: A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data

Hadoop I HPC random
Number of nodes (cores) Mapping T alignment \(\frac {T_{\textit {alignment}}}{T_{\textit {comm}}}\) Number of nodes (cores) Mapping T alignment time,minutes \(\frac {T_{\textit {alignment}}}{T_{\textit {comm}}}\)
4(28) 293.5 1.71 4(64) 74.4 3.89
6(42) 189.8 1.62 10(160) 32.4 3.76
8(56) 136.0 1.62 14(224) 22.7 3.77
16(112) 70.3 1.48 18(288) 17.9 3.78
32(224) 39.3 1.66 22(352) 14.5 3.79
40(280) 32.5 1.65 26(416) 12.3 3.77
    30(480) 10.7 3.73
    34(544) 9.5 3.45
    38(608) 8.5 3.16
    42(672) 7.6 2.96
    46(736) 7.0 2.55
    50(800) 6.4 2.65
    54(864) 5.9 2.34
    58(928) 5.5 2.12
  1. For the ‘HPC random’ approach, data chunks first have to be copied to the local node disks, and the alignments (SAM files) are copied back, while Hadoop keeps all of the data inside HDFS and, hence, does not need data staging. However, Hadoop needs to ingest the data to HDFS and preprocess the reads before the actual mapping stage so as to be able to operate in an MR manner, resulting in what we term ‘communication costs’. Note that each HPC node has 16 cores, while each Hadoop node has seven cores (the eighth core is dedicated to run the virtual machine).