RNA-Seq analysis and annotation of a draft blueberry genome assembly identifies candidate genes involved in fruit ripening, biosynthesis of bioactive compounds, and stage-specific alternative splicing
- Vikas Gupta†1, 2,
- April D Estrada†1,
- Ivory Blakley1,
- Rob Reid1,
- Ketan Patel1, 4,
- Mason D Meyer1,
- Stig Uggerhøj Andersen2,
- Allan F Brown3,
- Mary Ann Lila3 and
- Ann E Loraine1Email author
© Gupta et al.; licensee BioMed Central. 2015
Received: 2 September 2014
Accepted: 22 January 2015
Published: 13 February 2015
Blueberries are a rich source of antioxidants and other beneficial compounds that can protect against disease. Identifying genes involved in synthesis of bioactive compounds could enable the breeding of berry varieties with enhanced health benefits.
Toward this end, we annotated a previously sequenced draft blueberry genome assembly using RNA-Seq data from five stages of berry fruit development and ripening. Genome-guided assembly of RNA-Seq read alignments combined with output from ab initio gene finders produced around 60,000 gene models, of which more than half were similar to proteins from other species, typically the grape Vitis vinifera. Comparison of gene models to the PlantCyc database of metabolic pathway enzymes identified candidate genes involved in synthesis of bioactive compounds, including bixin, an apocarotenoid with potential disease-fighting properties, and defense-related cyanogenic glycosides, which are toxic. Cyanogenic glycoside (CG) biosynthetic enzymes were highly expressed in green fruit, and a candidate CG detoxification enzyme was up-regulated during fruit ripening. Candidate genes for ethylene, anthocyanin, and 400 other biosynthetic pathways were also identified. Homology-based annotation using Blast2GO and InterPro assigned Gene Ontology terms to around 15,000 genes. RNA-Seq expression profiling showed that blueberry growth, maturation, and ripening involve dynamic gene expression changes, including coordinated up- and down-regulation of metabolic pathway enzymes and transcriptional regulators. Analysis of RNA-seq alignments identified developmentally regulated alternative splicing, promoter use, and 3′ end formation.
We report genome sequence, gene models, functional annotations, and RNA-Seq expression data that provide an important new resource enabling high throughput studies in blueberry.
KeywordsRNA-Seq Genome Transcriptome Blueberry Alternative splicing Fruit ripening Metabolic pathways
A diet rich in blueberries can help protect against diabetes , cardiovascular disease, and age-related cognitive decline [2,3]. Molecular or biochemical mechanisms underlying these positive health benefits are not known, but most research has thus far focused on the antioxidant and anti-inflammatory properties of polyphenolic phytochemicals that accumulate as the fruit ripen. Blueberry fruit are an especially rich source of polyphenolic anthocyanin pigments, which give blueberries their distinctive color. Of these, malvidin, delphinidin, and peonidin are the most abundant by weight . The relative abundance of anthocyanin species can differ between genotypes [5,6], and in vivo research has shown that different anthocyanins affect biological systems in different ways [7-9], suggesting that berry varieties may offer distinct health benefits. Blueberries also contain relatively large amounts of quercetin , another polyphenolic that may have beneficial effects in Alzheimer’s disease  and inflammation-related disorders . Berries may also contain other as-yet undiscovered beneficial phytochemicals that could interact with anthocyanins or other compounds to potentiate biological efficacy [12,6]. Genomic studies that catalog the full genetic repertoire of blueberry could enable greater understanding of bioactive compounds, a necessary step toward developing new varieties bred for health benefits.
Blueberries are in the Cyanococcus section of family Ericaceae, genus Vaccinium, which also includes cranberry (V. macrocarpon), lingonberry (V. vitis-idaea), and more than 400 other species . Commercially harvested blueberry species in North America include lowbush (wild) blueberry Vaccinium angustifolium, a low, spreading shrub grown in managed stands in the Northern US and Canada, and the highbush blueberry species Vaccinium corymbosum and Vaccinium ashei, which are larger shrubs grown in orderly rows in orchards that require annual pruning to maintain productivity. Of the two highbush species, V. corymbosum is the most widely grown, while V. ashei is grown solely in the Southern US. V. corymbosum was first domesticated in the early 20th century by US Department of Agriculture scientist Fredrick Coville working with New Jersey farmer Elizabeth White, who recruited local pickers to locate wild berry plants with unusually large fruit. Coville’s breeding these early wild selections produced varieties suitable for commercial production, some of which are still grown today. Both lowbush and V. corymbosum highbush blueberries are deciduous and require a period of low temperatures during the winter season to induce flowering the following spring. To expand the range where highbush blueberries can be grown commercially, breeding programs have selected varieties with reduced chilling requirement, leading to development of sub-varieties called ‘southern highbush’ because they require fewer days of colder temperatures to trigger flowering. Ploidy levels of berry species range from diploid to hexaploid, and most varieties of highbush berry contain genetic material introduced from diverse genotypes and species, including V. darrowii Camp (evergreen blueberry) and V. arboreum Mar. (sparkleberry), as well as rabbiteye and lowbush blueberry. Although there is a great diversity across varieties, highbush blueberry plants within the same cultivar are highly uniform, as all are clones propagated from a single selection. Thus sequences collected from individuals from the same cultivar are expected to be highly homogenous with few differences between individuals.
Estimates based on flow cytometry predict that a haploid blueberry genome is around 600 million bases, five times the size of the Arabidopsis thaliana genome . In a related study, a draft genome assembly of a diploid northern highbush blueberry was generated using HiSeq Illumina reads ; the unassembled sequences are available from the Short Read Archive under accession SRA053499. This draft assembly consists of 225,479 contigs organized into 13,757 scaffolds with an N50 scaffold size of 145 kb, meaning that at least half of the sequence data is organized into scaffolds of 145 kb or larger. Plant genes are typically smaller than 2 kb, and intergenic regions are often smaller, which means that a 145 kb or larger contig could accommodate 50 or more genes. Although the genome assembly is still a work in progress, its large N50 make this draft assembly an important new resource for RNA-Seq analysis and gene discovery in blueberry.
To date, blueberry improvement efforts have focused on agronomic traits, such as ability to withstand mechanical harvesting, or consumer-focused traits, such as berry size, flavor, and mouth feel. Due to rising consumer interest in the health-protective effects of blueberries and other fruits and vegetables, breeding for nutritional and health-protective qualities may become practical in the near future. Breeding a more healthful berry will require more complete knowledge of genes encoding enzymes of secondary metabolism as well as their putative regulators. Toward this end, we performed high-throughput transcriptome sequencing (RNA-Seq) and differential gene expression analysis of five stages of berry development and ripening. Genome data, RNA-Seq expression profiles, and functional annotations have been made publicly available and will provide an important new resource for interpretation of high-throughput data from blueberry species.
Berry collection and RNA extraction
Blueberry samples were collected from the field from 4- or 5-year-old blueberry plants growing at the North Carolina Department of Agriculture Piedmont Research Station. Plants were labeled by row and position within the row; for example, plant 2-41 occupied position 41 within row 2. All samples intended for RNA extraction were flash-frozen on liquid nitrogen in the field immediately after collection and stored at -80°C until use. For RNA extraction, whole berry samples were ground to powder in a mortar and pestle with liquid nitrogen and total RNA was extracted using the Spectrum Total Plant RNA Kit (Sigma, St. Louis, USA). Extracted RNAs were treated with DNase I prior to library construction using RNAase-Free DNase (catalog number 79254) from Qiagen.
454 library construction and sequencing for May 2009 samples
Two libraries were prepared from samples of green and ripe fruit, respectively, from plants of the O’Neal variety of southern highbush blueberry (V. corymbosum). The green fruit library was prepared from a mix of unripe, green fruits of varying sizes harvested on 18 May 2009 from plant 2-41. The ripe fruit library was prepared from ripe fruits harvested on 15 June 2009 from plants 2-40, 2-41, and 2-42, also of the O’Neal variety. The libraries for sequencing were constructed using the SMART PCR cDNA synthesis kit from CloneTech, Mountain View, USA. The 3′ and 5′ primers used in first strand cDNA synthesis were aagcagtggtatcaacgcagagtact(30)VN and aagcagtggtatcaacgcagagtacgcggg, respectively, where V was a, g, or c and N was any nucleotide. The products of first strand cDNA synthesis were amplified using a poly(A) disruption PCR primer designed to introduce non-A bases in the poly(A) tail region of the cDNA, because homopolymeric sequences are difficult to sequence using the 454 technology. The poly(A) disruption primer sequence was attctagaggccgaggcggccgacatgt(4)gtct(4)gttctgt(3)ct(4)VN, where numbers in parentheses indicate the number of times the preceding base appeared in the sequence and V was a, g, or c. The sequence of the 5′ primer used for second strand cDNA synthesis was aagcagtggtatcaacgcagagt. The two libraries were sequenced in two sectors of the same plate on a 454-GS FLX Titanium sequencer (454 Life Sciences, Roche Diagnostics, USA) at the David H. Murdock Research Institute. Sequence data are available from the Short Read Archive under accession SRP039977.
454 library construction and sequencing for May 2010 samples
Green and ripe berries were harvested from O’Neal variety plant 2-42 on 29 April 2010 and 26 May 2010, respectively. For sequencing and library construction, samples of total RNA were sent to the North Carolina State University Genome Sciences Laboratory (GSL). Each sample was used to synthesize two libraries, which were sequenced on the same plate of a 454-GS FLX Titanium sequencer. Libraries were synthesized at the GSL following the protocol reported previously . Sequence data are available from the Short Read Archive under accession SRP039977.
Berry collection, library synthesis, and sequencing of berry development samples
Berries from five stages were selected from three plants (3-33, 2-41, and 2-42) of the O’Neal variety of southern highbush blueberry V. corymbosum. We designated the five stages as pads, cups, green, pink, and ripe. The ‘pads’ and ‘cups’ stages corresponded approximately to stages S1/S2 (pads) and S3/S4 (cups) described in . Green fruit were fully rounded green berries, pink berries were partially pigmented but still firm, and ripe berries were fully colored and soft. Samples were collected during the growing season of 2011. Pads were collected on 4 April, cups on 19 April, mature green fruit on 28 April, pink fruit on 20 May, and ripe fruit on 2 June. Following RNA extraction and DNAase I treatment (as described in the previous section) libraries were synthesized using the TruSeq A kit (catalog number FC-121-1001) from Illumina (San Diego, USA) following the manufacturer’s instructions. Libraries were synthesized using different TruSeq adapters to allow multiplexing, combined and then sequenced in three lanes with five libraries per lane using a HiSeq sequencer from Illumina. Sequence data are available from the Short Read Archive under accession SRP039977.
Berry collection, library synthesis, and sequencing of berry cultivars
Fully ripe and green berries from four berry cultivars (Pamlico, Lenoir, O’Neal and Ozark Blue) were harvested in 2009 from plants growing at the Piedmont Research Station (same field as above) and frozen on liquid nitrogen in the field. RNA was extracted as described in the preceding section. Libraries were synthesized using the mRNA-Seq Sample Preparation kit (catalog number RS-930-1001) from Illumina following manufacturer instructions. Libraries were sequenced using paired-end, 76 cycle sequencing at the UNC Chapel Hill Lineberger Cancer Research Center on a GAIIx sequencer from Illumina, San Diego, USA. Sequence data are available from the Short Read Archive under accession SRP039971.
Sequence processing and alignment
Prior to alignment, all sequences were trimmed to remove low-quality bases at the 5′ and 3′ ends of sequences. Single-end Hiseq reads (100 bp) from the berry fruit development series were trimmed to 85 bases to remove lower quality bases. Five bases were trimmed from the 3′ end and 10 bases were trimmed from the 5′ end of each read using the FASTX-Toolkit from Galaxy . Similarly, the 76 bases long paired-end GAIIx sequences were trimmed to 61 bases by removing ten bases on the 5′ end and three bases on the 3′ end of each sequence. For 454-generated sequences, ten bases were removed on the 5′ end only. The Illumina sequences were aligned onto the blueberry draft genome assembly  using TopHat2  and Bowtie2  using default parameters, except for the maximum intron size parameter, which was set to 6,000 bases consistent with typical intron size distributions for plant genes. 454 sequences were aligned onto the reference genome sequence using GMAP  with default parameters except for intron length, which was set to 6,000 bp.
Gene model generation, filtering, and protein sequence assignment
Cufflinks  was used to generate transcript models using paired-end and single-end Illumina sequences alignments with maximum intron set to 6,000 bases. Three ab initio gene-finding programs (Augustus , GlimmerHMM , and GeneMark ) were used to generate gene models from genomic sequence. Arabidopsis-trained parameters were used for GlimmerHMM and default parameters were used for Augustus and GeneMark. Because many of the resulting ab initio and Cufflinks-based gene models covered the same genomic regions, a step-wise filtering protocol (Additional file 1: Figure S1) was applied to reduce redundancy in the final gene set. First, all genes generated by Cufflinks were selected for inclusion in the final gene set, because these were predicted from the RNA-Seq expression data and thus the level of evidence supporting them was high. Any ab initio gene finder-predicted gene that overlapped one of the Cufflinks-predicted genes was eliminated from the candidate gene list. Next, all remaining candidate genes that were predicted by GeneMark were considered. If a candidate GeneMark gene had homology to a known protein (by BLASTX) or overlapped with an expressed sequence alignment, it was added to the final gene set and, as before, overlapping gene models were removed from the remaining candidate gene set. The process was repeated for genes predicted by Glimmer, Augustus, and GMAP. The selection order of priority for ab initio gene prediction programs was based on visual inspection of predicted gene models and consistency with alignments of full-length blueberry sequences available from GenBank. Protein coding sequences were detected in the gene models using the TAU program . Gene models were formatted into a BED-detail (BED14) file (V_corymbosum_scaffold_May_2013.bed) in which field 4 listed the transcript (gene model) name, field 13 listed the gene name, and field 14 listed the name of the program that generated the gene model. Note that many genes predicted by Cufflinks were associated with multiple transcripts, typically products of alternative splicing. Thus, in the case of Cufflinks-predicted genes, the gene field contained a name such as CUFF.11 while the gene model field contained a name such as CUFF.11.1, the first reported transcript generated from gene CUFF.11. Gene models in bed format were added to the project bitbucket repository  in subdirectory GeneModelAnalysis/results.
Functional annotation of gene models using BLASTX against the NCBI non-redundant protein database
Sequences of spliced transcripts were searched against the non-redundant protein database from NCBI (nr) using a database that was downloaded in June 2013. Virtual cDNA sequences obtained from the gene model annotations were searched against the nr database using BLASTX. The resulting matches were used for annotation if the e-value was 10-4 or better, the alignment length was at least 30 amino acids, and the percent identity was 30% or higher. For each gene model with at least one high-quality match meeting these criteria, an annotation string was generated containing the GenBank identifier and FASTA header for the best-matching sequence, information about the BLASTX-generated alignment, and the program that was used to generate the gene model. The annotation string was transferred onto the corresponding blueberry gene model by replacing field 14 in the BED-detail file described above. Field 14 data for gene models that did not have high-quality matches meeting the criteria described above were not modified. The modified gene model file was added to the bitbucket repository subdirectory titled BlastxAnalysis/results as V_corymbosum_scaffold_May_2013_wDescr.bed.gz.
Functional annotation of gene models using Blast2GO
Blast2Go version 2.7.0 (Build 05122013) was obtained via Java Web Start and used to associate Gene Ontology (GO) terms and Enzyme Commission (EC) numbers with individual gene models from blueberry. BLASTP and InterProScan search results were loaded into Blast2GO and the Blast2GO function Annotation > Perform Annotation Step menu was used to perform GO annotation. BLASTP results were obtained by searching predicted blueberry protein sequences against the nr protein database using an e-value cutoff of 10-3 and reporting a maximum of five ‘hit’ sequences per query. IPRScan version 4.8 was run using default parameter settings. Databases current as of July 2013 were searched. MySQL databases (‘b2 g_sep13’) from Blast2Go.com were downloaded and installed on a local server to enable faster processing. Blast2GO results were saved in plain text format (Additional file 2) and in Blast2GO format and added to the bitbucket in a subdirectory named Blast2GO.
Functional annotation of gene models using BLASTX against PlantCyc enzymes
FASTA-format sequence files containing amino acid sequences for enzymes in Plant Metabolic Network species-specific databases (e.g., AraCyc and GrapeCyc) were downloaded in July 2013 and combined into one database. The blueberry cDNAs were used as queries in a BLASTX search of the combined database. Hits to PlantCyc enzymes were then filtered so that hits with at least 60% subject coverage, 45% identity or higher, and e-value of 0.001 or less were retained. Details of how this was done are explained in Markdown file MakingAnnotationFiles.Rmd in the bitbucket repository for the paper. The best hit for each blueberry gene model was identified by e-value and used to generate annotation text, which was inserted into field 14 of the gene models bed file and saved to subdirectory PlantCyc/results as V_corymosum_scaffold_May2013_wDescrPwy.bed. To enable keyword searching in Integrated Genome Browser, this annotation text included the best matching PlantCyc enzyme and a list of PlantCyc pathway identifiers when available.
Building and filtering blueberry gene models
Summary of sequencing strategies and sequences obtained
Number of reads (millions)
Single-end Illumina, 100 bp reads
Five fruit stages from O’Neal cultivar
Paired-end Illumina, 76 bp reads
Ripe, unripe berries from O’Neal, Lenoir, Ozark blue, and Pamlico cultivars
Ripe and unripe berries from 2009, 2010 harvests, O’Neal cultivar
To generate berry gene models from the RNA-Seq data, the Illumina sequence reads were aligned onto the May 2013 reference blueberry genome using the spliced alignment program TopHat  and then merged into gene models using Cufflinks . This step produced 64,666 transcript models representing 57,925 genes. Of the multi-exon genes, 24% were predicted to generate multiple transcript variants due to alternative promoter use or alternative splicing. To assess how realistic this level of transcript variation was, we compared the frequency of alternative transcripts in blueberry with alternative transcription rates in Arabidopsis and soybean, using the genome annotations and assembly releases that were available in July 2013 for those species. The alternative transcription rates among multi-exon genes for soybean and Arabidopsis were 25% and 26% respectively; these rates were similar to blueberry, indicating that the frequency of transcript variation found in the blueberry gene models was reasonable and not likely to be an artifact of incorrect transcript model assembly.
Genes predicted by Cufflinks RNA-Seq analysis or ab initio gene-finding programs
Number of genes
454 Scaffolds (454 RNA-Seq)
Next, an open reading frame (ORF)-finding program (TAU ) was used to identify and annotate a conceptual translation for each gene model. Altogether, there were 57,079 gene models from 51,515 genes that could be annotated with a protein sequence of at least 30 residues, the expected lower range of protein sizes based on Arabidopsis annotations. The remaining transcript models for which no longer protein sequence could be predicted had a broad range of sizes and exon numbers, ranging from short, single-exon genes to longer models with multiple exons, suggesting that in many cases, the gene models were simply incomplete or the sequence itself contained insertions, deletions or other problems that prevented conceptual translation. Another possibility was that many of these genes were non-coding genes whose primary products were RNA and not protein. A third possibility was that many of these were actually pseudogenes. An analysis of Arabidopsis RNA-Seq data  found that many non-coding genes, including both pseudogenes and non-protein coding genes, were detectable in libraries that were prepared using the same protocol as with the blueberry RNA-Seq libraries, and so these latter two possibilities may have been correct in many cases.
Once the genes are loaded, users can search for genes by name or by annotation key word using the ‘Quick Search’ (top left) or ‘Advanced Search’ tab; the latter also enables searching for sequence motifs. Gene models are associated with descriptive text assembled from homology searches against BLAST databases from NCBI and PlantCyc (described below), but users can investigate individual gene models in greater detail using the BLAST features available through IGB. Right-clicking a gene model displays a menu that displays options to run BLASTX or BLASTP searches against the nr protein database at NCBI. Selecting one of these options opens a web browser window, which shows results from the search. In addition to the BLAST search feature, the right-click menu (also called a ‘context menu’) offers the option to show the sequence of a genome model in a separate window, which in turn enables selecting and copying the protein sequence. Using this feature, users can easily copy and paste the protein sequence into other web-based search and analysis tools, such as the InterproScan search tool, which can identify conserved motifs in protein sequences.
Users can also load the RNA-Seq datasets and use the observed pattern of expression to gain insights into the function of individual genes. To illustrate, the image shown in Figure 1 depicts a region from scaffold00001 showing gene models and RNA-Seq coverage graphs from five berry developmental stages. Because the overall read counts for each track were similar, one can visually estimate and compare expression levels for genes provided the coverage graphs are put onto the same scale, which is possible using the IGB ‘Graph’ tab. In this example, comparing graph heights between stages highlights ripening related expression of gene CUFF.187, which is homologous to genes in many other plant species but has no known function. The stage-specific expression of CUFF.187 demonstrates the power and usefulness of visualizing gene models alongside expression data within a visualization environment that also enables rapid exploration of genes and their functions.
Functional annotation of blueberry gene models and comparison with other plant species
Nonetheless, these results prompted us to explore relationships between blueberry and other plant genomes. For this, we used plant-specific RefSeq databases from GenBank, including databases for grape, tomato, Arabidopsis, and several other plant species where a close-to-complete, well-annotated proteome was available . To characterize relationships between blueberry and these other plants, we searched the blueberry transcripts against these close-to-complete plant RefSeq databases and identified the best scoring protein from each genome for each blueberry gene model. Figure 2B shows the distribution of percent identity scores obtained for the best-matching sequences from each species. Grape proteins had the highest median percent identity scores, followed by poplar, castor bean, and tomato. Blueberry and other Ericales species are part of the asterid clade of flowering plants, and are more closely related phylogenetically to tomato than to rosids Populus trichocarpa (poplar) and Ricinus communis (castor bean) . The BLAST results do not contradict this relationship but instead highlight how sequence similarity may reflect similarities in physical or biochemical characteristics. Grape and blueberry are deciduous, berry-producing plants with long generation times. Poplar is also a deciduous, woody plant with a long generation time. Biochemical or morphological similarities that could explain the similarity between blueberry and castor bean are less obvious, however. These results suggest that in-depth comparison between grape, blueberry, and castor bean seed transcriptomes could lead to new insights into developmental programs at work in berries of the three species.
Metabolic pathway annotation using PlantCyc enzyme database
The PlantCyc database is a collection of curated and computationally predicted enzymes, enzymatic reactions, and metabolic pathways for 18 plant species, including grape, Arabidopsis, cassava, poplar, and rice . The PlantCyc databases are closely tied with the Pathway Tools software package , a visualization and database system for pathways and biochemical reactions. At the time of this writing, the pathways database for V. vinifera (GrapeCyc version 3.0) was one of the most complete, containing annotations for 432 pathways. The similarity between grape and blueberry sequences suggested that comparing the blueberry sequences to enzymes in GrapeCyc and other PlantCyc databases would identify candidate genes encoding enzymes of primary or secondary metabolism. To identify enzymatic functions for blueberry genes, the BLASTX algorithm was used to search for matching PlantCyc sequences. To eliminate matches arising from alignments between domains that occur in many different sequences (e.g., ATP-binding cassette), only matches that covered at least 60% of the subject sequence with at least 45% identity were considered. Under these criteria, transcripts from more than 7,100 blueberry genes were found to match at least one PlantCyc enzyme sequence, and there were over 450 pathways from PlantCyc that had at least one enzyme matching a blueberry sequence. As before, grape proteins were typically the best matches for blueberry sequences.
In-depth analysis of ethylene biosynthetic pathway gene expression
Ethylene is a gaseous plant hormone that controls many aspects of plant development and physiology, especially ripening. In climacteric fruits such as tomato and banana, a burst of ethylene biosynthesis triggers ripening, and post-harvest treatment with exogenous ethylene can control ripening onset and progression. By contrast, blueberries do not appear to undergo a burst of ethylene synthesis prior to ripening, and post-harvest application of ethylene has far less effect on the ripening process. Nonetheless, ethylene can influence aspects of flowering and ripening, as shown by experiments with ethephon, a horticultural chemical that when applied to foliage is enzymatically converted to ethylene. When applied to blueberry plants during the harvest, ethephon accelerates and synchronizes ripening, and when applied in the fall (autumn), ethephon delays flowering the following spring and increases the number of flower buds (reviewed in ). This suggests that ethylene is important in flowering and fruit development but its role is likely to be very different than in climacteric fruits.
Ethylene is synthesized from L-methionine via three reactions that are catalyzed by SAM synthase (methionine adenosyltransferase), ACC synthase (1-aminocyclopropane-1-carboxylate synthase), and ACC oxidase (1-aminocyclopropane-1-carboxylate oxidase). In most plants, large multi-gene families encode all three enzymes. The PlantCyc database reports that the grape genome contains 34 SAM synthase genes, 20 ACC synthase genes, and 12 ACC oxidase genes. Consistent with this, the PlantCyc-based annotation of blueberry identified similarly large numbers of genes encoding ethylene biosynthetic genes, including 22 genes encoding SAM synthase, six genes for ACC synthase, and seven genes encoding ACC oxidase (Additional file 1: Figure S2). Interestingly, the expression patterns for SAM synthases were highly variable and at least one SAM synthase gene was highly expressed (more than 1,000 reads per kilobase of transcript per million reads mapped, RPKM) in each stage of berry fruit development and ripening. Expression of genes encoding ACC synthase, a key control point for ethylene and ripening in tomato, was also highly variable. Some genes encoding ACC synthases were expressed at very low levels (close to 0 RPKM), while others were expressed at 200 RPKM or higher, which was around the 95th percentile of gene expression as measured in RPKM. Two ACC oxidase genes were expressed at 150 RPKM or higher at each stage, while others were expressed at much lower levels. Interestingly, one ACC oxidase gene (CUFF.81159) was highly expressed during ripening and reached more than 3,000 RPKM in ripe fruit. The uniformly high expression (>150 RKPM) of ethylene biosynthesis genes at every stage, combined with the extremely high expression of ACC oxidase gene CUFF.81159, suggested that ethylene is produced throughout berry fruit development and that these levels are likely to peak in ripe fruit.
Anthocyanin biosynthetic pathways
Anthocyanins are the 3-O-glycosylated forms of anthocyanidins, which consist of a polyphenolic ring substituted with -H, -OH, and -OCH3 groups at different positions. The substituted group and its location on the polyphenolic ring determine the type of anthocyanidin and may also dictate aspects of biological activity. In blueberry, the anthocyanins containing malvidin, delphinidin, and peonidin aglycones are especially abundant by weight . Another level of diversity arises from the type of sugars attached, and these sugar groups may influence bioavailability in the mammalian digestive tract [40,41]. Thus it is of interest to identify enzymes that catalyze steps in anthocyanin biosynthesis and profile their expression pattern and relative abundance in ripe fruit.
The other highly expressed gene was CUFF.43605, which reached a peak of around 2000 RPKM in pink and ripe fruit, up from around 900 RPKM in the earlier stages. The best PlantCyc match for CUFF.43605 was a protein from Brassica rapa (Bra019350, 78% identity) annotated as a leucocyanidin oxygenase, also called anthocyanidin synthase (ANS) or leucocyanidin dioxygenase (LDOX). Homologous enzymes from Arabidopsis [42,43] and rice  convert leucoanthocyanidin to anthocyanidins, precursors for anthocyanins, but they can also catalyze formation of other bioactive flavonoid precursors, notably dihydroquercetin, the precursor of quercetin. Anthocyanins and quercetin are both abundant in berries, but quercetin has greater bioavailability and therefore may be a more potent bioactive compound in berry fruit [45,46]. If the preferred end product of the CUFF.43605 reaction is indeed dihydroquercetin, then its remarkably high expression likely has a positive effect on quercetin concentration in berry fruit. However, if its major end product is cyanidin, then it likely acts to decrease quercetin levels by consuming leucoanthocyanidin, the dihydroquercetin precursor. As with CUFF.20951, sequence and expression analysis alone is likely insufficient to distinguish these possibilities. Nonetheless, the high expression of CUFF.43605 makes this gene a fruitful candidate for investigating genetic control of anthocyanin and quercetin abundance in berries.
Other pathways - bixin and dhurrin
Bixin is the primary component of annatto, a commonly used food dye collected from the seeds of Bixa orellano, which grows in the tropics and is also known as the lipstick plant. Annatto is used in folk medicine and for body decoration and has also been investigated as a plant-based treatment for diabetes , cancer , and microbial infections . We found that four blueberry genes matched two of three annotated enzymes of the pathway (Additional file 1: Figure S3). Expression of all four genes peaked in the mature green stage of fruit development, around the time when seeds were developing. Grape seed contains bixin , and, as discussed previously, the grape and blueberry proteomes were remarkably similar. These results suggested that blueberries may contain a bixin-like compound.
Blueberry genes similar to enzymes from a potentially harmful pathway were also found. Blast analysis identified putative blueberry homologs for each of three biosynthetic enzymes involved in synthesis of dhurrin in sorghum [51-53]; the blueberry genes shared between 45 and 60% identity with their putative homologs from sorghum. Dhurrin, which reaches up to 10% dry weight in sorghum seedlings, is one of a large class of cyanogenic glycoside (CG) defense compounds plants synthesize from amino acid precursors as a form of chemical warfare against insects and other herbivores [54,55]. Stored as inactive glycosides, mechanical damage to cells (such as from chewing) activates endogenous glycosidases that remove the sugar group, triggering production of toxic hydrogen cyanide (HCN) due to instability of the sugar-free aglycone or to the activity of other catabolic enzymes. Interestingly, the putative blueberry homologs were most highly expressed in the green fruit stages (Additional file 1: Figure S3D), suggesting that green berries synthesize a cyanogenic glycoside that discourages insects and mammals from eating unripe berries.
Homology searches also identified putative berry homologs of enzymes involved in two CG catabolic pathways, one that removes the glycosyl group from the CG leading to production of HCN, and another pathway that detoxifies CGs by converting them to harmless byproducts. Seven genes had significant similarity to enzymes in the cyanogenic catabolic pathway, including four genes resembling dhurrinase, which deglycosylates dhurrin, and three genes that were similar to hydroxynitrile lyase, which catabolizes the dhurrin aglycone to HCN. Similar to the biosynthetic enzymes, these catabolic, cyanogenic enzymes were most highly expressed in unripe fruit. We also identified a blueberry candidate gene encoding nitrilase 4, an enzyme that detoxifies dhurrin and possibly other cyanogenic glycosides by converting them to aspartic acid and asparagine . Searching with nitrilase 4 from A. thaliana (AtNIT4, AT5G22300) identified blueberry gene CUFF.32314, which shared 83% identity with the Arabidopsis protein. Plant nitrilase 4 enzymes are highly conserved , typically sharing 60 to 70% identify at the amino acid level, suggesting that CUFF.32314 indeed encodes nitrilase 4. The putative blueberry nitrilase CUFF.32314 gene was highly expressed in young fruit and ripe fruit, reaching more than 70 RPKM in ripe berries, but had much lower expression (<40 RPKM) in mature green berries. Thus the expression profile of the candidate gene involved in CG detoxification was roughly complementary to profiles of candidate genes involved in CG synthesis and cyanide release. This suggests that immature green berries produce CG compounds that discourage herbivory, but that berry NAT4-like genes then detoxify these CGs as part of the ripening process.
Differential gene expression during blueberry development and ripening
Up- and down-regulated genes between sample types
However, the transition from fully rounded green fruit to pink fruit involved by far the largest number of differences. The chief physical difference between the two stages was color; pink berries had some reddish color indicating the onset of ripening, but other than this, green and pink berries were similar. Gene Ontology enrichment analysis identified around 40 terms as being significantly enriched among genes differentially expressed between the green and pink fruit. Two terms were related to photosynthesis and likely reflected down-regulation of photosynthetic functions, despite the fact that when the pink berries were collected, they typically were only partly pigmented and much of the berry surface was still green. Of 43 genes annotated to the term ‘photosynthesis,’ 27 were differentially expressed and all were down-regulated in pink fruit. Other significant terms were related to metabolic pathway functions, including the terms metabolic process, catalytic activity, catabolic process, hydrolase activity, carbohydrate metabolic process, lipid metabolic process, transferase activity, and biosynthetic process. There was also enrichment in functions related to transcriptional regulation of gene expression, including chromatin binding, nucleic acid binding, sequence-specific DNA binding transcription factor activity, and signal transduction.
Multi-dimensional scaling was used to cluster samples according to their overall similarity of gene expression pattern (Figure 5B). Samples from similar stages formed clusters. Pink and ripe fruit formed a cluster, cup and pad stages formed another cluster, and mature green fruit formed another cluster separate from the others. Interestingly, one ripe fruit sample (plant 2-41, P1) clustered slightly distant from the other ripe fruit samples. Visualization of RNA-Seq coverage graphs in IGB showed that some genes were up- or down-regulated in this sample relative to the other ripe fruit samples. Genes that were down-regulated in P1 relative to the other ripe berry samples had functions related to cell wall degradation or modification, whereas genes that were up-regulated had functions related to stress responses. This suggests that the P1 ripe berry sample was undergoing a stress relative to the other samples. Nonetheless, the possible presence of a possible confounding factor related to stress did not significantly reduce power to identify DE genes between samples types.
A time series clustering program (STEM ) was used to identify groups of genes that varied in concert over time. Scaled, averaged expression values for genes found to be DE between any two fruit stages were provided as inputs to the STEM program, which identified several gene expression profiles depicted graphically in Figure 5C. Two of the most statistically significant profiles had complementary patterns (clusters labeled 1 and 6). The ‘early high’ profile (number 1 in Figure 5C) contained genes with high expression in early fruit stages and lower expression in later stages. Gene Ontology enrichment analysis showed that this early high cluster contained an unusually large number of genes annotated with GO terms related to photosynthesis, catalytic activity, kinase activity, cell cycle, and DNA binding (Additional file 4). The ‘late high’ profile (number 6), contained genes with expression patterns complementary to the early high profile, with genes reaching their peak expression in the pink and ripe fruit stages. The genes in this cluster were enriched with GO terms related to transport, sugar metabolism, and catalytic activity, but had a lower than expected proportion of genes annotated to term DNA metabolic process. Only 1% of DNA metabolic process genes were in the late high cluster. These two complementary profiles highlight the diverse biological processes underway in ripe, fully mature fruit versus immature, rapidly growing fruit [57,59-61]. In the early stages, cell growth and cell division are underway, while in later stages, sugars and other metabolites are being imported into the fruit and cell growth happens largely through cell expansion, not cell division.
The next two most significant clusters also had complementary patterns. The ‘green high’ (number 0) cluster included around 1,200 genes that peaked in expression in the mature green stage. This cluster contained unusually many genes annotated with terms related to metabolic processes, transport, DNA binding, and secondary metabolism, suggesting that the green fruit stage represents a developmental transition and also is rich in production of secondary metabolites (Additional file 5). The ‘green low’ cluster (number 2) contained around 1,500 genes but was significantly enriched with only one term: catalytic activity. Interestingly, genes annotated with this term were unusually prevalent among all genes in all four profiles, suggesting that fruit development is rich in biosynthetic processes. However, the nature of these biosynthetic processes varies among stages, beginning with photosynthetic processes in the early stages and transitioning to secondary metabolism in the later stages. Figure 5D-G show RNA-Seq coverage graphs for genes exemplifying the four profiles.
Transcript variation during fruit development and ripening
As described above, nearly one quarter of multi-exon genes were associated with multiple gene models corresponding to putative products of alternative promoters, alternative splicing, or alternative polyadenylation sites. Any or all three of these mechanisms of alternative transcription can play a role in developmental regulation of gene expression. To test whether alternative transcription was developmentally regulated during berry fruit development and ripening, we used the the CuffDiff program, a companion program to Cufflinks, to identify genes whose patterns of splicing, promoter use, or polyadenylation changed during the development time series. Cuffdiff reported more than 700 genes as undergoing some type of differential transcript variation between pairs of conditions. To assess the results, a subset of the highest confidence CuffDiff results was selected at random and manually inspected using IGB. RNA-Seq read alignments were loaded into the viewer alongside the gene models and read support for alternative transcripts was compared between samples. In most cases involving alternative splice sites, the region that was different between alternative transcripts had small numbers of supporting reads, typically fewer than ten reads per splice junction. By contrast, examples of alternative promoter use (Additional file 1: Figure S4A) and alternative 3′ ends (Additional file 1: Figure S4B) had much stronger support. In these instances, typically there were dozens of reads supporting one or both alternatives in each sample. Thus, the CuffDiff analysis was able to identify differential use of alternative promoters and alternative transcription sites but was not as good at distinguishing differential splicing.
Comparison with previous sequencing studies of blueberry
Before now, several studies have been published that used Sanger [64,17], 454 , or Illumina sequencing  to characterize blueberry genes and profile gene expression changes. Two studies used de novo transcriptome assembly procedures to assemble transcript models, which they then annotated using similar methods to those described here. At the time of writing, only a subset of the data reported in the Illumina-based study was publicly available , but the others have made their raw sequence reads available as part of dbEST or the Short Read Archive. To provide a resource to researchers and also assess the completeness of the new transcriptome assembly reported here, we aligned publicly available EST, 454 and Illumina sequences onto the blueberry assembly and compared the alignments to the non-redundant set of gene models generated by Cufflinks and the ab initio gene prediction programs. Of the 22,415 Sanger ESTs from dbEST, 19,967 (89%) were aligned onto the draft genome assembly. A comparable percentage of sequences from the much larger 454 sequencing study  were also aligned. More than 85% of these 454-based ESTs aligned to the genome, and around 75% of the 454 ESTs overlapped over one or more berry gene models. Similar results were obtained for data from the Illumina-based study . We obtained around 15.5 million read pairs from the Short Read Archive using object number SRR942391. Around 55.4% were mapped as proper pairs while another 13.28% were mapped as singletons. Of the mapped reads on the genome, more than 95% overlapped a blueberry gene model. Thus the non-redundant gene models presented here encompass the vast majority of previously generated sequences from blueberry.
Correspondence between blueberry flavonoid biosynthesis ESTs and blueberry gene models
JK655064, JK655458, JK664855, JK666169
JK664803, JK666412, JK661182, JK663155
R2R3 MYB transcription factor
(JK663426, JK666734, JK667006, JK667133, JK667135, JK655710 not found)
In general, the RNA-Seq results were consistent with the EST-based study. As shown in Additional file 1: Figure S5, RNA-Seq based expression patterns for genes encoding VcANR (anthocyanidin reductase), VcDFR (dihydroflavonol reductase), and VcUFGT (UDP-Glc:flavonoid-3-O-glycosyltransferase) were similar to those reported previously. Interestingly, the gene corresponding to VcUFGT was CUFF.20951, which was also the most highly expressed gene in the anthocyanin biosynthesis pathway according to the RNA-Seq data (Figure 4A). Zifkin et al.  also identified ESTs encoding a Myb family transcription factor with biphasic expression and showed it was able to activate an ANR promoter from poplar, demonstrating it is a likely regulator of the PA pathway in blueberry. The EST reported for this gene (JK664730) mapped to two different genes in the blueberry genome assembly, only one of which (CUFF.51789) exhibited the previously reported biphasic expression pattern. The other gene (CUFF.14288) had an expression profile more similar to the ‘high early, low late’ cluster 1 in Figure 5C, suggesting that it may co-regulate PA biosynthetic genes in early fruit stages.
In recent years high-throughput sequencing technologies have been used to investigate the transcriptomes of numerous plant species, including many for which no reference genome sequence is available (reviewed in ). As sequencing accuracy improves and the cost of sequencing drops, more such projects will become feasible. However, making use of these new data can be challenging when a reference genome is not available to guide the assembly of sequence reads into contigs that are long enough to support functional annotation. Previously, we attempted de novo assembly of RNA-Seq reads, but technical difficulties related to the large amount of repetitive sequences in the data hindered these attempts (data not shown). Although the blueberry genome assembly used here is a draft, it was nonetheless complete enough to enable formation of high-quality gene models that could be annotated with GO terms, pathways, and protein homologies. Moreover, the gene models obtained were similar in structure and number to those of many other well-annotated transcriptomes, with one exception, which was that many apparently single-exon genes with no convincing homology to known protein sequences were obtained. Determination of the function of these single-exon, apparently non-protein-coding genes is beyond the scope of the current study, but it seems likely that many of these genes may represent long non-coding RNAs, pseudogenes, miRNAs, or other genes whose primary product is RNA and not protein. This possibility does not seem unlikely as other non-coding RNAs as well as pseudogenes have been observed to be expressed in other similarly prepared RNA-Seq libraries [29,67].
Interestingly, blueberry sequences had highest overall percent identity to sequences from the grape vine V. vinifera, a basal eudicot with a slow mutation rate. The next most similar species in terms of percent identity was the castor bean R. communis, followed by poplar, and then by tomato, the only other asterid species in the RefSeq database at the time this analysis was done. The 454-based transcriptome analysis published by Rowland and co-workers observed an almost identical result , as did another group that annotated a de novo transcriptome assembly for the tea plant C. sinensis, another Ericales species . These results seem to suggest that blueberry and perhaps even other Ericales species are closer to grape than to tomato; however, sequence similarity does not necessarily imply phylogenetic closeness, and a more likely explanation is that grape has diverged at a slower rate than tomato and so is more similar to the last common ancestor of grape and the Ericacae. It is also a woody, deciduous plant that makes berry-like fruit and so shares some morphological characteristics with blueberry. The same explanation may also apply to poplar, also a slow-evolving, woody, deciduous plant. However, neither of these explanations serve to explain why castor been, an herbaceous plant with a short generation time, appears to be more closely related to blueberry at the sequence level. More recently, phylogenetic analysis of mitochondrial and chloroplast sequences from cranberry (Vaccinium macrocarpon) confirmed that cranberry belongs in the asterid grouping and is clearly more closely related to other asterids than grape . It is beyond the scope of this paper to determine correct placement of blueberry relative to these other groups. Instead, the data provided here may provide a means for experts in phylogenetic relationships to investigate these and other questions related to the evolution of flowering plants.
Another important contribution is that for the first time, we provide necessary resources for performing high-throughput gene expression studies in blueberry. The gene models provided here are a key resource for processing and analysis of RNA-Seq expression data, and to maximize their usefulness, we made the RNA-Seq data analysis described here available as part of an open source repository of data files and source code . To complement the gene structures, we provided Gene Ontology and PlantCyc-based enzyme annotations for more than half the protein-coding gene models. The GO and enzyme annotations will provide critical resources for future studies, especially high-throughput studies such as RNA-Seq in which statistical analysis typically identifies many hundreds or even thousands of differentially expressed genes. As shown here, being able to identify categories of over- or under-represented in large gene lists enables deeper understanding of biological differences. Moreover, the pathway annotations may further enable interpretation as well as functional prediction. Prior analysis of the AraCyc database together with a compendium of microarray expression data showed that enzymes belonging to the same metabolic pathway are often highly co-expressed, a property that enables identification of missing players in metabolic or regulatory pathways through large-scale analysis of expression data [70,71]. This approach has been used in many studies that used co-expression to identify candidate genes, leading to a deeper understanding of genes involved in metabolic pathways. Therefore, one important long-term result from the RNA-Seq analysis presented here, along with the pathway predictions, is that we have laid the groundwork for further studies identifying other enzymes in high value pathways. This work may be particularly relevant to anthocyaninin biosynthesis, since different types of anthocyanins have different effects on mammalian biology, possibly due to differences in bioavailability. Another benefit of these data is that because the RNA-Seq data were from an autotetraploid blueberry, aligned onto a diploid genome sequence, it may be possible to use these data to investigate allele-specific gene expression, using SNPs or other polymorphisms to distinguish alleles. This and other studies may become possible thanks to the databases and datasets presented there, and our future work will focus on improving these resources to enable better understanding of metabolic pathways that are active in blueberry.
Analysis code availability
R Markdown files, python scripts, and shell scripts used in data analysis and data processing are version-controlled in a git-based, publicly accessible repository . The repository contains files documenting which subsections of the repository were used in the analyses described above. Note that we will continue to update and modify the code repository to meet the needs of users; however, readers interested in retrieving older versions of the repository that existed at the time of publication can do so using the git program, which is well documented at many sites.
The number of reads per gene was counted using the samtools view -c command. All single-mapping reads that overlapped a gene region were counted, and gene regions were defined as the smallest start and largest end position of transcripts annotated to the gene. For each comparison, the method ‘exactTest’ with tagwise dispersion from Bioconductor library edgeR  was used to identify differentially expressed genes. False discovery rate for each comparison was calculated using the p.adjust method in R, with option ‘BH’ for Benjamini and Hochberg. For clarity, all results reported here applied an FDR cutoff of 0.001. Code used to perform differential expression analysis is available in the project bitbucket repository; readers interested in exploring other cutoff options can obtain copies of source code files, edit parameters in the file, and re-run the analysis. Gene Ontology terms that were unusually enriched among differentially expressed genes were identified using GOSeq , also from Bioconductor.
Analysis of alternative splicing with CuffDiff
The non-redundant gene set was used as input to cuffcompare, which produced a file (cuffcmp.combined.gtf) suitable for use as input to cuffdiff. The cuffdiff was then run using the full set of alignments (both multi- and single-mapping) obtained for the blueberry development and ripening time course dataset with cuffcomp.combined.gtf file as reference gene models. For analysis of splicing differences, the file splicing.diff output by cuffdiff was used. It was observed that the most significant splicing differences reported by cuffdiff all had the same low p-value (10-5) and so these were selected for further analysis. A random subset of the splicing differences from genes that were annotated by the BLASTX step above was selected for visual inspection in IGB.
Analysis of alternative splicing with ArabiTag
The non-redundant gene set, BAM files with read alignments, and junction files produced by the FindJunctions program (Integrated Genome Browser tools package) were provided to the AltSpliceAnalysis software , which is based on the ArabiTag algorithm described in . The software identifies annotated alternative splicing events and then uses the RNA-Seq data to count reads supporting alternative splicing choices associated with each event. Data analysis code written in R was used to calculate the percentage of support (%S) for each variant, using gapped reads to support alternative donor or acceptor sites and non-gapped reads to support retention of introns. To identify differential splicing between conditions, a t-test was performed on splicing scores. Differential splicing was tabulated and highest confidence differences were manually inspected using Integrated Genome Browser.
Availability and requirements
Project name: BlueberryGenome.
Project home page:https://bitbucket.org/lorainelab/blueberrygenome.
Operating system(s): Platform independent.
Programming language: R, python.
Other requirements: RStudio, 0.98.507 or higher, R 3.0.3 or higher.
License: The MIT License.
Availability of supporting data
RNA-Seq data are freely available for visualization in Integrated Genome Browser, and analysis code is available from a publicly available git repository .
All sequence data are available in the Short Read Archive under accessions SRP039977 and SRP039971. Files containing alignments, RNA-Seq coverage graphs, and output from TopHat2 are available from a publicly accessible IGBQuickLoad site  configured to serve data for visualization in Integrated Genome Browser. Data files including gene models and related annotations are available from the git source code repository. The git source code repository also contains R code used to analyze data. Archived snapshots of these files are also available from the GigaScience repository, GigaDB .
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool protein-protein search
Basic Local Alignment Search Tool search against protein databases using translated nucleotide query
Complementary deoxyribonucleic acid
Expressed Sequence Tag
Genomic Mapping and Alignment Program
Integrated Genome Browser
National Center for Biotechnology Information
Reads per kilobase of transcript per millions mapped
- SAM synthase:
Jim Ballington and the North Carolina Department of Agriculture Piedmont Research Station provided blueberry plants. Lisa J Rowland donated genomic DNA used in genomic sequencing. Shira Stav and Tyler Estrada helped collect berries used to prepare libraries for RNA sequencing. We thank Nate Watson, Ra’ad Ghabaraih, Adam Baxter, and Cory Brouwer for contributions to early de novo transcriptome assembly efforts not reported here. A grant (to AEL) from UNC General Administration provided financial support. Integrated Genome Browser software used in the study was supported by NIH grant 1R01RR032048-01 (to AEL). We also express our thanks to the reviewers whose suggestions enabled us to improve the manuscript.
- Babu PV, Liu D, Gilbert ER. Recent advances in understanding the anti-diabetic actions of dietary flavonoids. J Nutr Biochem. 2013;24(11):1777–89.View ArticlePubMedGoogle Scholar
- Del Rio D, Rodriguez-Mateos A, Spencer JP, Tognolini M, Borges G, Crozier A. Dietary (poly)phenolics in human health: structures, bioavailability, and evidence of protective effects against chronic diseases. Antioxid Redox Signal. 2013;18(14):1818–92.View ArticlePubMedPubMed CentralGoogle Scholar
- Vauzour D. Dietary polyphenols as modulators of brain functions: biological actions and molecular mechanisms underpinning their beneficial effects. Oxid Med Cell Longev. 2012;2012:914273.View ArticlePubMedPubMed CentralGoogle Scholar
- US Department of Agriculture. USDA database for the flavonoid content of selected foods, Release 3.1. http://www.ars.usda.gov/nutrientdata/flav. Accessed 14 Jan 2015.
- Yousef GG, Brown AF, Funakoshi Y, Mbeunkui F, Grace MH, Ballington JR, et al. Efficient quantification of the health-relevant anthocyanin and phenolic acid profiles in commercial cultivars and breeding selections of blueberries (Vaccinium spp.). J Agric Food Chem. 2013;61(20):4806–15.View ArticlePubMedGoogle Scholar
- Routray W, Orsat V. Blueberries and their anthocyanins: factors affecting biosynthesis and properties. Compr Rev Food Sci Food Saf. 2011;10(6):303–20.View ArticleGoogle Scholar
- Grace MH, Ribnicky DM, Kuhn P, Poulev A, Logendra S, Yousef GG, et al. Hypoglycemic activity of a novel anthocyanin-rich formulation from lowbush blueberry, Vaccinium angustifolium Aiton. Phytomedicine. 2009;16(5):406–15.View ArticlePubMedPubMed CentralGoogle Scholar
- Kay CD, Hooper L, Kroon PA, Rimm EB, Cassidy A. Relative impact of flavonoid composition, dose and structure on vascular function: a systematic review of randomised controlled trials of flavonoid-rich food products. Mol Nutr Food Res. 2012;56(11):1605–16.View ArticlePubMedGoogle Scholar
- Prior RL, Wu X. Anthocyanins: structural characteristics that result in unique metabolic patterns and biological activities. Free Radic Res. 2006;40(10):1014–28.View ArticlePubMedGoogle Scholar
- Ansari N, Khodagholi F. Natural products as promising drug candidates for the treatment of Alzheimer’s disease: molecular mechanism aspect. Curr Neuropharmacol. 2013;11(4):414–29.View ArticlePubMedPubMed CentralGoogle Scholar
- Spagnuolo C, Cerella C, Russo M, Chateauvieux S, Diederich M, Russo GL. Quercetin downregulates Mcl-1 by acting on mRNA stability and protein degradation. Br J Cancer. 2011;105(2):221–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Rimando AM, Kalt W, Magee JB, Dewey J, Ballington JR. Resveratrol, pterostilbene, and piceatannol in vaccinium berries. J Agric Food Chem. 2004;52(15):4713–9.View ArticlePubMedGoogle Scholar
- Galletta G, Ballington J. Blueberries, cranberries, and lingonberries. In: Janick J, Moore J, editors. Fruit breeding vol. II. Vine and small fruits. NY: Wiley; 1996. p. 1–107.Google Scholar
- Costich DE, Ortiz R, Meagher TR, Bruederle LP, Vorsa N. Determination of ploidy devel and nuclear DNA content in blueberry by flow cytometry. Theor Appl Genet. 1992;83:1001–6.Google Scholar
- Bian Y, Ballington J, Raja A, Brouwer C, Reid R, Burke M, et al. Patterns of simple sequence repeats in cultivated blueberries (Vaccinium section Cyanococcus spp.) and their use in revealing genetic diversity and population structure. Mol Breeding. 2014;34(2):675–89.View ArticleGoogle Scholar
- Rowland LJ, Alkharouf N, Darwish O, Ogden EL, Polashock JJ, Bassil NV, et al. Generation and analysis of blueberry transcriptome sequences from leaves, developing fruit, and flower buds from cold acclimation through deacclimation. BMC Plant Biol. 2012;12:46.View ArticlePubMedPubMed CentralGoogle Scholar
- Zifkin M, Jin A, Ozga JA, Zaharia LI, Schernthaner JP, Gesell A, et al. Gene expression and metabolite profiling of developing highbush blueberry fruit indicates transcriptional regulation of flavonoid metabolism and activation of abscisic acid metabolism. Plant Physiol. 2012;158(1):200–24.View ArticlePubMedGoogle Scholar
- Blankenberg D, Gordon A, Von Kuster G, Coraor N, Taylor J, Nekrutenko A. Manipulation of FASTQ data with Galaxy. Bioinformatics. 2010;26(14):1783–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36.View ArticlePubMedPubMed CentralGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Wu TD, Watanabe CK. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005;21(9):1859–75.View ArticlePubMedGoogle Scholar
- Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7(3):562–78.View ArticlePubMedPubMed CentralGoogle Scholar
- Stanke M, Steinkamp R, Waack S, Morgenstern B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 2004;32(Web Server issue):W309–12.View ArticlePubMedPubMed CentralGoogle Scholar
- Majoros WH, Pertea M, Salzberg SL. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics. 2004;20(16):2878–9.View ArticlePubMedGoogle Scholar
- Borodovsky M, Lomsadze A. Eukaryotic gene prediction using GeneMark.hmm-E and GeneMark-ES. Curr Protoc Bioinformatics. 2011;Chapter 4:Unit 4.6.1–10.PubMedGoogle Scholar
- Filichkin SA, Priest HD, Givan SA, Shen R, Bryant DW, Fox SE, et al. Genome-wide mapping of alternative splicing in Arabidopsis thaliana. Genome Res. 2009;20(1):45–58.View ArticlePubMedGoogle Scholar
- Source code repository for blueberry RNA-Seq data analysis and genomic studies. http://www.bitbucket.org/lorainelab/blueberrygenome.
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28(5):511–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Loraine AE, McCormick S, Estrada A, Patel K, Qin P. RNA-Seq of Arabidopsis pollen uncovers novel transcription and alternative splicing. Plant Physiol. 2013;162(2):1092–109.View ArticlePubMedPubMed CentralGoogle Scholar
- Smith SA, Beaulieu JM, Stamatakis A, Donoghue MJ. Understanding angiosperm diversification using small and large phylogenetic trees. Am J Bot. 2011;98(3):404–14.View ArticlePubMedGoogle Scholar
- Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 2007;449(7161):463–7.View ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Brown GR, Maglott DR. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2011;40(Database issue):D130–5.PubMedPubMed CentralGoogle Scholar
- Group TAP. An update of the angiosperm phlogeny group classification for the orders and families of flowering plants: APG III. Bot J Linn Soc. 2009;161(2):105–21.View ArticleGoogle Scholar
- Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005;21(18):3674–6.View ArticlePubMedGoogle Scholar
- Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A, et al. InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 2012;40(Database issue):D306–12.View ArticlePubMedGoogle Scholar
- Mitsuda N, Ohme-Takagi M. Functional analysis of transcription factors in Arabidopsis. Plant Cell Physiol. 2009;50(7):1232–48.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang P, Dreher K, Karthikeyan A, Chi A, Pujar A, Caspi R, et al. Creation of a genome-wide metabolic pathway database for Populus trichocarpa using a new approach for reconstruction and curation of metabolic pathways for plants. Plant Physiol. 2010;153(4):1479–91.View ArticlePubMedPubMed CentralGoogle Scholar
- Karp PD, Paley SM, Krummenacker M, Latendresse M, Dale JM, Lee TJ, et al. Pathway Tools version 13.0: integrated software for pathway/genome informatics and systems biology. Brief Bioinform. 2009;11(1):40–79.View ArticlePubMedPubMed CentralGoogle Scholar
- Retamales JB, Hancock JF. Blueberries. Crop production science in horticulture series, vol 21. Wallingford, Oxfordshire, UK. Cambridge, MA: CABI; 2012.Google Scholar
- McGhie TK, Walton MC. The bioavailability and absorption of anthocyanins: towards a better understanding. Mol Nutr Food Res. 2007;51(6):702–13.View ArticlePubMedGoogle Scholar
- Yi W, Akoh CC, Fischer J, Krewer G. Absorption of anthocyanins from blueberry extracts by caco-2 human intestinal cell monolayers. J Agric Food Chem. 2006;54(15):5651–8.View ArticlePubMedGoogle Scholar
- Wilmouth RC, Turnbull JJ, Welford RW, Clifton IJ, Prescott AG, Schofield CJ. Structure and mechanism of anthocyanidin synthase from Arabidopsis thaliana. Structure. 2002;10(1):93–103.View ArticlePubMedGoogle Scholar
- Turnbull JJ, Nagle MJ, Seibel JF, Welford RW, Grant GH, Schofield CJ. The C-4 stereochemistry of leucocyanidin substrates for anthocyanidin synthase affects product selectivity. Bioorg Med Chem Lett. 2003;13(21):3853–7.View ArticlePubMedGoogle Scholar
- Reddy AM, Reddy VS, Scheffler BE, Wienand U, Reddy AR. Novel transgenic rice overexpressing anthocyanidin synthase accumulates a mixture of flavonoids leading to an increased antioxidant potential. Metab Eng. 2007;9(1):95–111.View ArticlePubMedGoogle Scholar
- Erlund I, Freese R, Marniemi J, Hakala P, Alfthan G. Bioavailability of quercetin from berries and the diet. Nutr Cancer. 2006;54(1):13–7.View ArticlePubMedGoogle Scholar
- Manach C, Williamson G, Morand C, Scalbert A, Remesy C. Bioavailability and bioefficacy of polyphenols in humans. I. Review of 97 bioavailability studies. Am J Clin Nutr. 2005;81(1 Suppl):230S–42.PubMedGoogle Scholar
- Takahashi N, Goto T, Taimatsu A, Egawa K, Katoh S, Kusudo T, et al. Bixin regulates mRNA expression involved in adipogenesis and enhances insulin sensitivity in 3T3-L1 adipocytes through PPARgamma activation. Biochem Biophys Res Commun. 2009;390(4):1372–6.View ArticlePubMedGoogle Scholar
- Tibodeau JD, Isham CR, Bible KC. Annatto constituent cis-bixin has selective antimyeloma effects mediated by oxidative stress and associated with inhibition of thioredoxin and thioredoxin reductase. Antioxid Redox Signal. 2010;13(7):987–97.View ArticlePubMedPubMed CentralGoogle Scholar
- Galindo-Cuspinera V, Rankin SA. Bioautography and chemical characterization of antimicrobial compound(s) in commercial water-soluble annatto extracts. J Agric Food Chem. 2005;53(7):2524–9.View ArticlePubMedGoogle Scholar
- Siva R, Doss FP, Kundu K, Satyanarayana VSV, Kumar V. Molecular characterization of bixin—An important industrial product. Ind Crop Prod. 2010;32(1):48–53.View ArticleGoogle Scholar
- Busk PK, Moller BL. Dhurrin synthesis in sorghum is regulated at the transcriptional level and induced by nitrogen fertilization in older plants. Plant Physiol. 2002;129(3):1222–31.View ArticlePubMedPubMed CentralGoogle Scholar
- Tattersall DB, Bak S, Jones PR, Olsen CE, Nielsen JK, Hansen ML, et al. Resistance to an herbivore through engineered cyanogenic glucoside synthesis. Science. 2001;293(5536):1826–8.View ArticlePubMedGoogle Scholar
- Krothapalli K, Buescher EM, Li X, Brown E, Chapple C, Dilkes BP, et al. Forward genetics by genome sequencing reveals that rapid cyanide release deters insect herbivory of Sorghum bicolor. Genetics. 2013;195(2):309–18.View ArticlePubMedPubMed CentralGoogle Scholar
- Vetter J. Plant cyanogenic glycosides. Toxicon. 2000;38(1):11–36.View ArticlePubMedGoogle Scholar
- Barceloux DG. Cyanogenic foods (cassava, fruit kernels, and cycad seeds). DM Dis Mon. 2009;55(6):336–52.View ArticlePubMedGoogle Scholar
- Jenrich R, Trompetter I, Bak S, Olsen CE, Moller BL, Piotrowski M. Evolution of heteromeric nitrilase complexes in Poaceae with new functions in nitrile metabolism. Proc Natl Acad Sci U S A. 2007;104(47):18848–53.View ArticlePubMedPubMed CentralGoogle Scholar
- Gillaspy G, Ben-David H, Gruissem W. Fruits: a developmental perspective. Plant Cell. 1993;5(10):1439–51.View ArticlePubMedPubMed CentralGoogle Scholar
- Ernst J, Bar-Joseph Z. STEM: a tool for the analysis of short time series gene expression data. BMC Bioinformatics. 2006;7:191.View ArticlePubMedPubMed CentralGoogle Scholar
- Birkhold KT, Koch K, Darnell RL. Carbon and nitrogen economy of developing rabbiteye blueberry fruit. JASHS. 1992;117(1):139–45.Google Scholar
- Castrejón ADR, Eichholz I, Rohn S, Kroh LW, Huyskens-Keil S. Phenolic profile and antioxidant activity of highbush blueberry (Vaccinium corymbosum L.) during fruit maturation and ripening. Food Chem. 2008;109(3):564–72.View ArticleGoogle Scholar
- Jaakola L, Maatta K, Pirttila AM, Torronen R, Karenlampi S, Hohtola A. Expression of genes involved in anthocyanin biosynthesis in relation to anthocyanin, proanthocyanidin, and flavonol levels during bilberry fruit development. Plant Physiol. 2002;130(2):729–39.View ArticlePubMedPubMed CentralGoogle Scholar
- English AC, Patel KS, Loraine AE. Prevalence of alternative splicing choices in Arabidopsis thaliana. BMC Plant Biol. 2010;10:102.View ArticlePubMedPubMed CentralGoogle Scholar
- Gulledge AA, Roberts AD, Vora H, Patel K, Loraine AE. Mining Arabidopsis thaliana RNA-seq data with Integrated Genome Browser reveals stress-induced alternative splicing of the putative splicing regulator SR45a. Am J Bot. 2012;99(2):219–31.View ArticlePubMedGoogle Scholar
- Dhanaraj AL, Alkharouf NW, Beard HS, Chouikha IB, Matthews BF, Wei H, et al. Major differences observed in transcript profiles of blueberry during cold acclimation under field and cold room conditions. Planta. 2007;225(3):735–51.View ArticlePubMedGoogle Scholar
- Li X, Sun H, Pei J, Dong Y, Wang F, Chen H, et al. De novo sequencing and comparative analysis of the blueberry transcriptome to discover putative genes related to antioxidants. Gene. 2012;511(1):54–61.View ArticlePubMedGoogle Scholar
- Strickler SR, Bombarely A, Mueller LA. Designing a transcriptome next-generation sequencing project for a nonmodel plant species. Am J Bot. 2012;99(2):257–66.View ArticlePubMedGoogle Scholar
- Bhargava A, Clabaugh I, To JP, Maxwell BB, Chiang YH, Schaller GE, et al. Identification of cytokinin-responsive genes using microarray meta-analysis and RNA-Seq in Arabidopsis. Plant Physiol. 2013;162(1):272–94.View ArticlePubMedPubMed CentralGoogle Scholar
- Shi CY, Yang H, Wei CL, Yu O, Zhang ZZ, Jiang CJ, et al. Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds. BMC Genomics. 2011;12:131.View ArticlePubMedPubMed CentralGoogle Scholar
- Polashock J, Zelzion E, Fajardo D, Zalapa J, Georgi L, Bhattacharya D, et al. The American cranberry: first insights into the whole genome of a species adapted to bog habitat. BMC Plant Biol. 2014;14:165.View ArticlePubMedPubMed CentralGoogle Scholar
- Srinivasasainagendra V, Page GP, Mehta T, Coulibaly I, Loraine AE. CressExpress: a tool for large-scale mining of expression data from Arabidopsis. Plant Physiol. 2008;147(3):1004–16.View ArticlePubMedPubMed CentralGoogle Scholar
- Wei H, Persson S, Mehta T, Srinivasasainagendra V, Chen L, Page GP, et al. Transcriptional coordination of the metabolic network in Arabidopsis. Plant Physiol. 2006;142(2):762–74.View ArticlePubMedPubMed CentralGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40.View ArticlePubMedGoogle Scholar
- Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol. 2010;11(2):R14.View ArticlePubMedPubMed CentralGoogle Scholar
- Alternative splicing analysis source code repository. http://www.bitbucket.org/lorainelab/altspliceanalysis.
- Blueberry IGBQuickLoad site. http://www.igbquickload.org/blueberry.
- Gupta V, Estrada AD, Blakley IC, Reid R, Patel K, Meyer MD, et al. Supporting material for RNA-Seq analysis and annotation of a draft blueberry genome assembly. GigaScience Database 2015. http://dx.doi.org/10.5524/100117
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.