Draft genome of the leopard gecko, Eublepharis macularius
© The Author(s). 2016
Received: 6 August 2016
Accepted: 11 October 2016
Published: 26 October 2016
Geckos are among the most species-rich reptile groups and the sister clade to all other lizards and snakes. Geckos possess a suite of distinctive characteristics, including adhesive digits, nocturnal activity, hard, calcareous eggshells, and a lack of eyelids. However, one gecko clade, the Eublepharidae, appears to be the exception to most of these ‘rules’ and lacks adhesive toe pads, has eyelids, and lays eggs with soft, leathery eggshells. These differences make eublepharids an important component of any investigation into the underlying genomic innovations contributing to the distinctive phenotypes in ‘typical’ geckos.
We report high-depth genome sequencing, assembly, and annotation for a male leopard gecko, Eublepharis macularius (Eublepharidae). Illumina sequence data were generated from seven insert libraries (ranging from 170 to 20 kb), representing a raw sequencing depth of 136X from 303 Gb of data, reduced to 84X and 187 Gb after filtering. The assembled genome of 2.02 Gb was close to the 2.23 Gb estimated by k-mer analysis. Scaffold and contig N50 sizes of 664 and 20 kb, respectively, were comparable to the previously published Gekko japonicus genome. Repetitive elements accounted for 42 % of the genome. Gene annotation yielded 24,755 protein-coding genes, of which 93 % were functionally annotated. CEGMA and BUSCO assessment showed that our assembly captured 91 % (225 of 248) of the core eukaryotic genes, and 76 % of vertebrate universal single-copy orthologs.
Assembly of the leopard gecko genome provides a valuable resource for future comparative genomic studies of geckos and other squamate reptiles.
KeywordsGekkota Leopard gecko Eublepharis macularius Genome sequencing Assembly
Sample collection and sequencing
Summary statistics of leopard gecko sequence data derived from paired-end sequencing of seven insert libraries using an Illumina HiSeq 2000 platform
Library insert size (bp)
Read length (bp)
Total bases (Gb)
Sequencing depth (X)
Total bases (Gb)
Sequencing depth (X)
Statistics of genome size estimation by 17-mer analysis. The genome size was estimated according to the formula: Genome size = # Kmers/Peak of depth
Kmer length (bp)
Peak of depth
Estimated genome size (bp)
Data used (bp)
Comparison of genome features between Eublepharis macularius and Gekko japonicus
Assembled genome size (Gb)
Scaffold N50 (kb)
Contig N50 (kb)
Repeat content (% of genome)
Summary statistics of key parameters for 13 reptile genomes
Assembly size (Gb)
Contig N50 (kb)
Scaffold N50 (kb)
Green anole lizard
Chrysemys picta bellii
Western painted turtle
Sanger + NGS
Green sea turtle
Python molurus bivittatus
Australian dragon lizard
Estimation of genome completeness
Coverage of core eukaryotic genes (CEGs) in the gecko genome assessed by CEGMA. All CEGs were divided into four groups based on their degree of protein sequence conservation. Group 1 contains the least conserved CEGs and group 4 contains the most conserved
Summarized benchmarks in the BUSCO assessment
Total BUSCO groups searched
Complete single-copy BUSCOs
Complete duplicated BUSCOs
Summary statistics of annotated repeats in the leopard gecko genome assembly
Total repeat length (bp)
Percentage of genome
We combined homology-based, de novo, and transcriptome-based methods to predict protein-coding genes in the leopard gecko genome.
In the homology-based methods, we downloaded the gene sets of Taeniopygia guttata, Homo sapiens, Anolis carolinensis, Pelodiscus sinensis and Xenopus tropicalis from the Ensembl database (release-73). We first aligned these homologous protein sequences to the leopard gecko genome assembly using TBLASTN with an E-value cutoff of 1e-5, and linked the BLAST hits into candidate gene loci with GenBlastA . We then extracted genomic sequences of candidate loci, together with 3 kb flanking sequences, using GeneWise  to determine gene models. Finally, we filtered pseudogenes that had only one exon with frame errors, as these loci were probably derived from retrotransposition.
In the de novo method, we randomly selected 1000 leopard gecko genes with intact open reading frames (ORFs) and the highest GeneWise score from the homology-based gene set to train the Augustus  gene prediction tool with default parameters. Augustus was then used to perform a de novo gene prediction on repeat-masked genome sequences. Gene models with incomplete ORFs and small genes with a protein-coding length <150 bp were filtered out. Finally, a BLASTP search of predicted genes was performed against the SwissProt database . Genes with matches to SwissProt proteins containing any one of the following keywords were filtered: transpose, transposon, retro-transposon, retrovirus, retrotransposon, reverse transcriptase, transposase, and retroviral.
Transcriptome-based gene prediction was then performed using leopard gecko RNA-seq data from liver, salivary gland, scent gland, and skin tissues obtained from the NCBI database (accession number SRR629643, ERR216315, ERR216316, ERR216322, ERR216325, ERR216304 and ERR216306) . Tophat (v1.3.3) was used to align the RNA-seq reads against the leopard gecko genome assembly to identify splice junctions, and cufflinks (v2.2.1) was used to assemble transcripts using the aligned RNA-seq reads .
A Markov model was estimated with 1000 high-quality genes, which were previously used to train Augustus, using the trainGlimmerHMM tool included in the GlimmerHMM software package . The coding potential of each transcript assembled from the transcriptome data was then identified using the Markov model. Transcripts with complete ORFs were extracted and multiple isoforms from the same locus were collapsed by retaining the longest ORF.
These non-redundant ORFs were then integrated with homology-based gene models to form the core gene set using a custom script. If a gene model with a higher priority overlapped with a model with a lower priority (overlapping length >100 bp), the latter was removed. If two gene models with the same priority overlapped, the one with a longer ORF was preferred.
Homology-based gene models not supported by transcriptome-based evidence but supported by homologous evidence from at least two species were added to the core gene set.
De novo-based gene models not supported by homology-based and transcriptome-based evidence were added to the core gene set where significant hits (BLASTP E-value <1e-5) for non-transposon proteins in the SwissProt database were obtained.
As a result of these steps, a total of 24,755 non-redundant protein-coding genes were annotated in the leopard gecko genome assembly.
Functional annotation of protein-coding genes
Statistics for functional annotation
Number of genes annotated
20,958 (84.66 %)
15,873 (64.12 %)
16,172 (65.33 %)
23,139 (93.47 %)
22,347 (90.27 %)
Availability and requirements
Project name: Leopard gecko genome annotation scripts
Project home page: https://github.com/gigascience/paper-xiong2016
Operating systems: Linux
Programming language: PERL
Other requirements: none
Any restrictions to use by non-academics: none
This work was funded by the China National GeneBank. This research was supported by the Genome 10 k (G10k) project. We thank the faculty and staff in the BGI-Shenzhen, who contributed to the sequencing of the leopard gecko genome, and R. Tremper for providing experimental animals.
Availability of supporting data
Supporting datasets are available at GigaDB . Raw sequencing reads have been deposited in the SRA (Sequence Read Archive) database under SRA ID SRA451060 and Bioproject ID PRJNA339626.
GZ and QL conceived and supervised the project. TG provided the leopard gecko samples. ZX, FL, LZ and JZ performed genome assembly, repeat annotation, gene annotation and gene function annotation. LK, CL and SL provided materials and advice. ZX and FL drafted the manuscript. QL revised the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- De Vosjoli P, Tremper R, Klingenberg R. The herpetoculture of leopard geckos: Advanced Visions Inc. 2005.Google Scholar
- Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012;1(1):1.View ArticleGoogle Scholar
- Liu B, Yuan J, Yiu S-M, Li Z, Xie Y, Chen Y, Shi Y, Zhang H, Li Y, Lam T-W. COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly. Bioinformatics. 2012;28(22):2870–4.View ArticlePubMedGoogle Scholar
- Liu Y, Zhou Q, Wang Y, Luo L, Yang J, Yang L, Liu M, Li Y, Qian T, Zheng Y, et al. Gekko japonicus genome reveals evolution of adhesive toe pads and tail regeneration. Nat Commun. 2015;6:10033.View ArticlePubMedPubMed CentralGoogle Scholar
- Alfoldi J, Di Palma F, Grabherr M, Williams C, Kong L, Mauceli E, Russell P, Lowe CB, Glor RE, Jaffe JD, et al. The genome of the green anole lizard and a comparative analysis with birds and mammals. Nature. 2011;477(7366):587–91.View ArticlePubMedPubMed CentralGoogle Scholar
- Castoe TA, de Koning AP, Hall KT, Card DC, Schield DR, Fujita MK, Ruggiero RP, Degner JF, Daza JM, Gu W, et al. The Burmese python genome reveals the molecular basis for extreme adaptation in snakes. Proc Natl Acad Sci U S A. 2013;110(51):20645–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Vonk FJ, Casewell NR, Henkel CV, Heimberg AM, Jansen HJ, McCleary RJ, Kerkkamp HM, Vos RA, Guerreiro I, Calvete JJ, et al. The king cobra genome reveals dynamic gene evolution and adaptation in the snake venom system. Proc Natl Acad Sci U S A. 2013;110(51):20651–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Wan QH, Pan SK, Hu L, Zhu Y, Xu PW, Xia JQ, Chen H, He GY, He J, Ni XW, et al. Genome analysis and signature discovery for diving and sensory properties of the endangered Chinese alligator. Cell Res. 2013;23(9):1091–105.View ArticlePubMedPubMed CentralGoogle Scholar
- Chen H, He G, Hu L, Pan S, Wan Q, Xia J, Xu P, Zhu Y, He J, Ni X et al. Genomic data of the Chinese alligator (Alligator sinensis). Gigascience Database. 2014. http://dx.doi.org/10.5524/100077.
- Green RE, Braun EL, Armstrong J, Earl D, Nguyen N, Hickey G, Vandewege MW, St John JA, Capella-Gutierrez S, Castoe TA, et al. Three crocodilian genomes reveal ancestral patterns of evolution among archosaurs. Science. 2014;346(6215):1254449.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang Z, Pascual-Anaya J, Zadissa A, Li W, Niimura Y, Huang Z, Li C, White S, Xiong Z, Fang D, et al. The draft genomes of soft-shell turtle and green sea turtle yield insights into the development and evolution of the turtle-specific body plan. Nat Genet. 2013;45(6):701–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Georges A, Li Q, Lian J, O’Meally D, Deakin J, Wang Z, Zhang P, Fujita M, Patel HR, Holleley CE, et al. High-coverage sequencing and annotated assembly of the genome of the Australian dragon lizard Pogona vitticeps. Gigascience. 2015;4:45.View ArticlePubMedPubMed CentralGoogle Scholar
- Shaffer HB, Minx P, Warren DE, Shedlock AM, Thomson RC, Valenzuela N, Abramyan J, Amemiya CT, Badenhorst D, Biggar KK, et al. The western painted turtle genome, a model for the evolution of extreme physiological adaptations in a slowly evolving lineage. Genome Biol. 2013;14(3):1.Google Scholar
- Parra G, Bradnam K, Korf I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics. 2007;23(9):1061–7.View ArticlePubMedGoogle Scholar
- Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–2.View ArticlePubMedGoogle Scholar
- Smit A, Hubley R, Green P. 2015 RepeatMasker Open-4.0. 2016.Google Scholar
- Kapitonov VV, Jurka J. A universal classification of eukaryotic transposable elements implemented in Repbase. Nat Rev Genet. 2008;9(5):411–2.View ArticlePubMedGoogle Scholar
- Edgar RC, Myers EW. PILER: identification and classification of genomic repeats. Bioinformatics. 2005;21 suppl 1:i152–8.View ArticlePubMedGoogle Scholar
- Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27(2):573.View ArticlePubMedPubMed CentralGoogle Scholar
- She R, Chu JS, Wang K, Pei J, Chen N. GenBlastA: enabling BLAST to identify homologous gene sequences. Genome Res. 2009;19(1):143–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Birney E, Clamp M, Durbin R. GeneWise and genomewise. Genome Res. 2004;14(5):988–95.View ArticlePubMedPubMed CentralGoogle Scholar
- Keller O, Kollmar M, Stanke M, Waack S. A novel hybrid gene prediction method employing protein multiple sequence alignments. Bioinformatics. 2011;27(6):757–63.View ArticlePubMedGoogle Scholar
- UniProt C. UniProt: a hub for protein information. Nucleic Acids Res. 2015;43(Database issue):D204–12.Google Scholar
- Hargreaves AD, Swain MT, Logan DW, Mulley JF. Testing the Toxicofera: comparative transcriptomics casts doubt on the single, early evolution of the reptile venom system. Toxicon. 2014;92:140–56.View ArticlePubMedGoogle Scholar
- Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7(3):562–78.View ArticlePubMedPubMed CentralGoogle Scholar
- Majoros WH, Pertea M, Salzberg SL. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics. 2004;20(16):2878–9.View ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 2014;42(D1):D199–205.View ArticlePubMedGoogle Scholar
- Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014;30(9):1236–40.View ArticlePubMedPubMed CentralGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Xiong Z, Li F, Li Q, Zhou L, Gamble T, Zheng J, Kui L, Li C, Li S, Yang H et al. Supporting data for “Draft genome of the leopard gecko, Eublepharis macularius”. GigaScience Database 2016. http://dx.doi.org/10.5524/100246.