Skip to main content

Advertisement

Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Table 2 Timings in seconds for the different pipeline stages when running Crossbow on HPC node (16 CPU cores) and Hadoop I cluster (eight nodes, 56 CPU cores) and Hadoop II cluster (eight nodes, 112 CPU cores) for Datasets S1-S9

From: A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data

Stages Platform Datasets
   S1 S2 S3 S4 S5 S6 S7 S8 S9
ingest to HDFS Hadoop I,II 106 236 472 606 862 974 1018 1244 1384
conversion gz split (HPC) 466 782 1094 1406 1728 2052 2390 2774 3090
  gz to bz2 conversion (Hadoop I,II) 211 423 633 842 1056 1264 1473 1685 1911
preprocess HPC 406 630 1002 1235 1469 1810 2043 2283 2660
  Hadoop I 560 891 1172 1672 1937 2271 2665 3011 3396
  Hadoop II 537 685 892 1179 1414 1641 2091 2334 2613
map HPC 1434 2857 4281 5775 7216 8627 10088 11432 13028
  Hadoop I 707 1385 2060 3331 3398 4163 4761 5630 6276
  Hadoop II 511 981 1459 2636 3023 3194 3361 4553 4766
  Hadoop II* 486 955 1422 1882 2336 2812 3310 3771 4305
SNP call HPC 1045 1698 2621 3553 10989 18993 16890 20785 21948
  Hadoop I 666 994 1127 1423 1906 2287 2765 2982 3444
  Hadoop II 661 965 1344 1364 1830 2450 2765 3029 3471
total time HPC 3351 5967 8998 11968 21402 31554 31411 37274 40726
  Hadoop I 2250 3929 5464 7848 9159 10959 12682 14558 16436
  Hadoop II 2026 3289 4601 6607 7719 8903 10393 12845 14145
  1. The ‘Hadoop II*’ data were obtained as follows: the average time for each mapping job was multiplied by the number of successful Hadoop mapping jobs, omitting the failed jobs. The errors are not shown.