GenomeTester4: a toolkit for performing basic set operations - union, intersection and complement on k-mer lists
© Kaplinski et al. 2015
Received: 17 April 2015
Accepted: 11 November 2015
Published: 3 December 2015
K-mer-based methods of genome analysis have attracted great interest because they do not require genome assembly and can be performed directly on sequencing reads. Many analysis tasks require one to compare k-mer lists from different sequences to find words that are either unique to a specific sequence or common to many sequences. However, no stand-alone k-mer analysis tool currently allows one to perform these algebraic set operations.
We have developed the GenomeTester4 toolkit, which contains a novel tool GListCompare for performing union, intersection and complement (difference) set operations on k-mer lists. We provide examples of how these general operations can be combined to solve a variety of biological analysis tasks.
GenomeTester4 can be used to simplify k-mer list manipulation for many biological analysis tasks.
KeywordsK-mers Sequence analysis Next-generation sequencing
Because of the rapid uptake and progress of next-generation sequencing techniques, both the time and cost required to sequence full or partial genomes has decreased dramatically. On the other hand, these new technologies have introduced new bioinformatic problems resulting from short read lengths, a large number of sequencing errors, and a huge amount of data that must be processed. Analysis of genomic data often requires either de novo assembly of the genome, mapping of the data to a reference genome, or homology searches from raw reads, all of which are time-consuming and can introduce additional errors.
In recent years, oligomer-frequency-based methods of genome analysis have attracted great interest because they do not require genome assembly and can be performed directly on sequencing reads [1, 2]. These methods have the potential to be both faster and less error-prone than traditional methods, yet have also proven to be useful for correcting sequencing errors during the initial step of mapping and assembly pipelines  and to detect overlapping reads from sequencing datasets . Oligomer-frequency-based methods should now be considered general tools for genomic analysis.
Oligomer frequency analysis is typically conducted by k-mers (oligomers of length k). The first step involves counting k-mers from raw sequencing reads or assembled sequences and is performed in an analogous manner for all subsequent k-mer analysis methods. Several k-mer counting programs have been developed in recent years, both as part of assembly tools or as separate programs. One of the fastest and most widely used k-mer counting tools is Jellyfish , which runs on several parallel CPU threads and operates on a lock-free hash table that eliminates waiting for concurrent data access from different threads. In addition to hashing, k-mer counters can also use more complex data structures that facilitate optimal counting for specific cases. For example, Tallymer  uses a suffix array and specializes in counting k-mers from large eukaryotic genomes with many repeated sequences. KMC2  and DSK  can run on computers with limited memory by writing k-mers into several small temporary tables that are combined onto disk storage. Turtle  uses a combination of Bloom filter and sort-and-compact to reduce cache misses while operating on large datasets.
The second step - k-mer analysis - is task-specific. Analysis of k-mer frequencies has been used to estimate the size of the genome , to detect de novo repeats , to measure gene expression , to find similar reads from different metagenomic samples  and to identify bacteria from sequencing reads using k-mer distributions or specific marker sequences . GSMer  can be used to find taxon-specific barcodes using k-mer counting, whereas Kraken  classifies bacteria by first adding phylogenetic information to every k-mer in its database and then estimating the origin of the bacteria by its k-mer content. The Khmer package implements many common k-mer operations for short-read sequencing .
Many k-mer analysis tasks would benefit from the use of mathematical set operations to find sub- and supersets of k-mers from different nucleotide sequences. Here we present the GenomeTester4 toolkit that aids in both the creation and the modification of k-mer lists. GenomeTester4 can generate lists of k-mer counts from nucleotide sequences and perform basic algebraic set operations - union, intersection and difference (complement) - on these lists. Currently there is no other public toolkit that provides this functionality. We describe the toolkit and provide examples of how to apply this functionality to perform specific biological analyses.
Overview and working principle of the GenomeTester4 toolkit
K-mer list format
The central data structure of all three programs is a binary k-mer list that consists of a small header and an array of k-mers together with their counts. K-mers are encoded as 64-bit unsigned integers with two bits representing each nucleotide. Counts are 32-bit unsigned integers. To reduce memory usage, the list contains only the canonical form of each k-mer. This means that one data entry represents both a k-mer and its reverse complement and is stored using the smaller of the two possible integer encodings. The maximum length of k-mers is thus 32 nucleotides and the maximum possible total count of any given k-mer and its reverse complement is 232 - 1. The full specification of the k-mer list format is given in Additional file 1.
GListMaker computes k-mer counts from nucleotide sequences stored in FASTA or FASTQ format. It uses temporary arrays to collect all k-mers from the input file during the reading phase. The arrays are then sorted and the adjacent instances of the same k-mer are counted during the collation phase. Multiple CPU threads can be used if there is more than one file of input sequences.
The union of two lists contains all entries that were present in either of the original lists, intersection contains only the entries that were present in both lists, and complement provides the entries that were present in the first but not in the second list. The precise definition of what counts as the presence of k-mer and what will be the final count in the output list can be specified by rules. By default, applying a union operation results in a list in which the final k-mer count is the sum of the counts from both initial lists, whereas calculating the intersection results in a default list where the smaller of the two counts is reported. Calculating the complement provides the count from the first list by default (Fig. 2).
GListQuery is used to query the counts of k-mers from k-mer lists. The input can be single k-mer sequences from the command line, lists of k-mers supplied in a text file, or other k-mer lists generated with GListMaker or GListCompare. Input k-mers can also be read from FASTA/FASTQ files with a sliding window. K-mers are looked up by binary search, resulting in O(logN) complexity (the working time is proportional to the logarithm of input list size).
GListQuery can also search for k-mers that differ from the query k-mer by up to a user-supplied number of mismatches. To accomplish this, it generates all possible mismatched k-mer versions of each query sequence and searches for each of them from the input list.
GListCompare and GListQuery do not read lists directly into RAM but instead memory-map these using the Linux mmap system function. This guarantees that only those parts of each file that are being used will be read from disk to RAM. Parts that are never used will not be read and those that were not used recently will be silently cleaned from memory without virtual memory swapping. This allows the system to use lists larger than the available amount of physical memory, albeit with a performance penalty if many look-ups are performed sequentially.
GListMaker uses a pool of CPU worker threads to speed up list generation. Each input file is processed in parallel if enough worker threads are available. The sorting and merging of temporary tables into the final k-mer list is also performed using separate worker threads.
The rules of set operations
To perform the basic set-algebraic operations of union, intersection and difference on k-mer lists, we made additional design decisions. The semantics of standard mathematical set operations are defined only by the presence or absence of specific elements in the set. In case of k-mer lists we have richer structure, as for each k-mer there is also the count of its occurrences.
To accommodate the occurrence count within k-mer lists, we extended these operations by defining the semantics of presence or absence together with a method to calculate the final count. For all operations, the k-mer is counted as being present in one of the input lists only if its count is equal to or above a minimum cutoff value. After determining the presence in the two individual lists, the standard set operation is then applied. If by the result of the operation the k-mer should be included in the output set, its final count is calculated by predetermined rule and the k-mer is written to the output list.
1 - set count value to 1
2 - set count value to 2
add - add up both counts
subtract - subtract the second count from the first
min - smaller of the counts
max - larger of the counts
first - count in the first list
second - count in the second list
To avoid the possibility of zero valued counts in the output list, union and difference operations implement a subset of these rules.
Data and resources used in examples
All bacterial genomes used were downloaded from the RefSeq microbial genomes ftp site  and were current as of 18 June 2014. Plasmids and sequences smaller than 0.5 Mbp (million base pairs) were excluded from the dataset.
Human genome build 37.1 was downloaded from the RefSeq genomes ftp site .
S. aureus and human chromosome 14 sequencing read datasets were downloaded from GAGE project .
Bos taurus genome build 6.1 were downloaded from RefSeq genomes ftp site .
All software performance tests were conducted on a CentOS 5.10 Linux server with 32 cores (2.27 GHz) and 512 GiB (gibibyte, 230 bytes) RAM.
By design, GListMaker rapidly processes genomic sequences that contain few repeats because it performs counting by sorting and all data is stored in a single list structure. However, it is not well optimized for counting large datasets of next-generation sequencing reads, which contain many repeated k-mers. Because all k-mers from a single input file are normally read into memory during the first step, very large input sequences may easily consume all available memory even if the number of unique k-mers in the input is small. This can be mitigated by using command-line arguments that specify the size of in-memory tables. Each k-mer requires 8 bytes of memory (single 64 bit integer) in the initial reading phase and each unique k-mer requires 12 bytes (one 64-bit and one 32-bit integer) during the collation phase. As GListMaker does not implement Bloom filter and keeps all k-mers are in memory, singleton sequencing errors consume both memory and disk space. For a more detailed analysis of memory usage, please refer to Additional file 2.
Comparison of the 32-mer counting speeds of GenomeTester4, Jellyfish 2.2.0, KMC 2.2 and DSK 2.0.7 with a single thread and 24 threads
E. coli K-12 strain MG1655 genome
H. sapiens genome
S. aureus sequencing reads (GAGE library 1)
H. sapiens chromosome 14 sequencing reads (GAGE library 1)
1 file, 4.7 Mbp
24 files, 3.0 Gbp
2 files, 86 Mbp
2 files, 2.55 Gbp
Comparison of the peak memory consumption of GenomeTester4, Jellyfish 2.2.0, KMC 2.2 and DSK 2.0.7 while counting 32-mers with 24 threads
H. sapiens genome
H. sapiens chromosome 14 sequencing reads (GAGE library 1)
24 files, 3.0 Gbp
2 files, 2.55 Gbp
The speed of GListQuery depends on the size of the list file being searched. Using a list of 32-mers from a single bacterial genome we obtained a speed of up to 35 million look-ups per minute. Using a 64 GiB union of all NCBI bacterial genomes, the look-up speed was roughly 2.7 million 32-mers per minute.
Because the list files are memory-mapped, more parts of the file will be read into RAM as the number of query sequences increases, which also speeds up subsequent look-ups. This guarantees a near-immediate start-up for the case where only a few look-ups are required because the entire file does not need to be read from disk, while also providing near-optimal speeds for a large number of look-ups.
We have found that many k-mer analysis tasks require one to combine or partition k-mer lists into sub- and supersets of k-mers. By implementing optimized set-algebraic operations within a single tool, GenomeTester4 allows researchers to significantly simplify k-mer analysis pipelines. Still, because most research questions cannot be implemented using only simple set operations on k-mer lists, we expect that both the initial and final steps of analysis will be performed using custom task-specific tools.
Example 1: Counting k-mers in a large set of sequences
Separate k-mer lists were created for each bacterial genome with GListMaker.
These lists were recursively combined pairwise with a script MakeUnion.pl (included in GenomeTester4 package) that uses the GListCompare union function.
GListCompare running times for the generation of different bacterial datasets
Number of unique k-mers
32-mer lists of all bacteria
2 h 8 m
Union list of all bacteria
2 h 30 m
Streptococcus genus specific
S. pneumoniae species specific
S. pneumoniae strain G54 specific
Example 2: Finding chromosome-specific repeats in a eukaryotic genome
Created separate 16-mer lists from X and Y chromosomes with GListMaker
Created a single 16-mer list from all autosomes
Created subsets of all k-mers whose count was at least 10 in the sex-chromosome lists using the GListCompare cutoff option
Subtracted the autosome list from repeated k-mer lists with GListCompare using the difference function with a cutoff value of 2
Subtracted the other sex-chromosome list from the repeated and unique k-mer lists using GListCompare with a cutoff value of 1
Creating a subset of all k-mers that occur at least 50 times in the chromosome Y list
Subtracting the list of k-mers of over 50 copies from the list of Y-specific k-mers using GListCompare difference function
In total, we identified 4878 X-specific and 137,868 Y-specific 16-mers.
The final steps in our analysis were performed using custom Perl scripts. First, we located the regions in chromosome that had significant over-representation of unique repeated k-mers. Then we grouped these regions by similarity using BLAST alignment. Finally, these regions were aligned against the full genome with BLAST. The more complete protocol with command used is outlined in Additional file 3.
Example 3: Finding group-specific k-mers to identify bacteria
The detection of bacteria at various phylogenetic levels is often required during medical diagnosis and in both epidemiological and ecological studies. To identify whether a certain bacterium belongs or is closely related to a predefined group of strains, one can find the k-mers that are unique to that group of strains and search for those from sequencing reads of the bacterium of interest. We used GenomeTester4 to generate lists of specific 32-mers from the genus Streptococcus, of S. pneumoniae species and of S. pneumoniae strain G54.
First, we created a union of all k-mers contained in all bacteria in the NCBI database, as described above. During this step we also obtained k-mer lists from all strains of interest.
Next, we found the intersection of all lists of the strains of interest. This list contains all k-mers that are present in all bacteria from the set. It is reasonable to expect that many of these k-mers will be present in any new strain as long as it is closely related to any known strain from this set. Because we want to use the difference from union operation for finding unique k-mers, the intersection operation must use the -sum rule (i.e. the counts in the intersection list are sum of the counts of source lists). Finally, we found the difference between the target list and the list of all bacterial genomes using the difference from union option of GListCompare.
The running times of GListMaker or GListCompare and the sizes of all lists used in this example are provided in Table 3. Although we used assembled genomes to generate target and non-target lists, the assembly process is probably not required for creating lists. One can compile these lists directly from sequencing reads to avoid the time-consuming process of assembly. Also, we expect that a much lower sequencing coverage is required to compile representative k-mer lists than the coverage required for genome assembly. The speed of the GenomeTester4 package allows one to perform this kind of analysis as a routine part of sequencing.
GenomeTester4 is a universal toolbox for creating and using k-mer lists from genomic sequences and its computational speed is competitive with other k-mer counting programs. This package is unique in its ability to perform fast set operations and list queries with a user-specified number of mismatches. These routines have proven to be useful in our research and, because of their universal nature, we expect that others may find them to be useful for many possible tasks that require k-mer counting and/or operations with k-mer lists. This makes GenomeTester4 a potentially valuable addition to many k-mer and sequence analysis toolkits.
Availability and requirements
Project name: GenomeTester4
Project (source code) home page: https://github.com/bioinfo-ut/GenomeTester4
Operating systems: Linux (64-bit)
Programming language: C
Other requirements (when recompiling): GCC version 4
License: GNU General Public License version 3.0 (GPLv3)
Any restrictions to use by non-academics: none
Availability of supporting data
The data sets supporting the results of this article are available in the GigaDB repository .
1 billion base pairs
gibibyte (230 bytes)
kibibyte (210 bytes)
1 million base pairs
mebibyte (220 bytes)
This work was funded by the EU ERDF through the Estonian Center of Excellence in Genomics, by Estonian IT Academy and grants SF0180026s09 and IUT34-11 from the Estonian Ministry of Education and Research.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15:R46.View ArticlePubMedPubMed CentralGoogle Scholar
- Patro R, Mount SM, Kingsford C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol. 2014;32:462–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Kim EB, Fang X, Fushan AA, Huang Z, Lobanov AV, Han L, et al. Genome sequencing reveals insights into physiology and longevity of the naked mole rat. Nature. 2011;479:223–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Tu Q, He Z, Zhou J. Strain/species identification in metagenomes using genome-specific markers. Nucleic Acids Res. 2014;42:e67.View ArticlePubMedPubMed CentralGoogle Scholar
- Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–70.View ArticlePubMedPubMed CentralGoogle Scholar
- Kurtz S, Narechania A, Stein JC, Ware D. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics. 2008;9:517.View ArticlePubMedPubMed CentralGoogle Scholar
- Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A. KMC 2: Fast and resource-frugal k-mer counting. Bioinformatics. 2015. doi:10.1093/bioinformatics/btv022.Google Scholar
- Rizk G, Lavenier D, Chikhi R. DSK: K-mer counting with very low memory usage. Bioinformatics. 2013;29:652–3.View ArticlePubMedGoogle Scholar
- Roy RS, Bhattacharya D, Schliep A. Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics. 2014;30:1950–7.View ArticlePubMedGoogle Scholar
- Maillet N, Lemaitre C, Chikhi R, Lavenier D, Peterlongo P. Compareads: comparing huge metagenomic experiments. BMC Bioinformatics. 2012;13 Suppl 1:S10.View ArticlePubMedPubMed CentralGoogle Scholar
- Crusoe M, Edvenson G, Fish J, Howe A, McDonald E, Nahum J, et al. The khmer software package: enabling efficient sequence analysis. figshare. 2014. doi:10.6084/m9.figshare.979190.
- Tatusova T, Ciufo S, Fedorov B, O'Neill K, Tolstoy I. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2014;42:D553–9.View ArticlePubMedGoogle Scholar
- Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012;22:557–67.View ArticlePubMedPubMed CentralGoogle Scholar
- Kaplinski, L; Lepamets, M; Remm, M. Supporting materials and software for “GenomeTester4: a toolkit for performing basic set operations - union, intersection and complement on k-mer lists”. GigaSci database. 2015. http://dx.doi.org/10.5524/100178