Discovery and annotation of small proteins using genomics, proteomics, and computational approaches
Xiaohan Yang, Timothy J. Tschaplinski, Gregory B. Hurst, Sara Jawdy, Paul E. Abraham, Patricia K. Lankford, Rachel M. Adams, Manesh B. Shah, Robert L. Hettich, Erika Lindquist, Udaya C. Kalluri, Lee E. Gunter, Christa Pennacchio, and Gerald A. Tuskan
2011 March 02, Genome Res. 2011. 21: 634-641
Small proteins (10200 amino acids [aa] in length) encoded by short open reading frames (sORF) play important regulatory roles in various biological processes, including tumor progression, stress response, flowering, and hormone signaling. However, ab initio discovery of small proteins has been relatively overlooked. Recent advances in deep transcriptome sequencing make it possible to efficiently identify sORFs at the genome level. In this study, we obtained ∼2.6 million expressed sequence tag (EST) reads from Populus deltoides leaf transcriptome and reconstructed full-length transcripts from the EST sequences. We identified an initial set of 12,852 sORFs encoding proteins of 10200 aa in length. Three computational approaches were then used to enrich for bona fide protein-coding sORFs from the initial sORF set: (1) coding-potential prediction, (2) evolutionary conservation between P. deltoides and other plant species, and (3) gene family clustering within P. deltoides. As a result, a high-confidence sORF candidate set containing 1469 genes was obtained. Analysis of the protein domains, non-protein-coding RNA motifs, sequence length distribution, and protein mass spectrometry data supported this high-confidence sORF set. In the high-confidence sORF candidate set, known protein domains were identified in 1282 genes (higher-confidence sORF candidate set), out of which 611 genes, designated as highest-confidence candidate sORF set, were supported by proteomics data. Of the 611 highest-confidence candidate sORF genes, 56 were new to the current Populus genome annotation. This study not only demonstrates that there are potential sORF candidates to be annotated in sequenced genomes, but also presents an efficient strategy for discovery of sORFs in species with no genome annotation yet available.
P. deltoides small protein-coding candidate genes enriched from transcription units. (A) Number of genes in different sORF candidate subsets. (B) Proportion of the sORF subsets having known protein domains detected by InterProScan. Subset A contains the sORF candidates with high protein-coding potential predicted using known proteins as training sequences. Subset B contains sORF candidates conserved between P. deltoides and at least one other plant species. Subset C contains sORF candidates clustered into families. (Initial) The initial sORF candidate set (Fig. 1). (AB) The intersection of Subsets A and B. (ABC) (i.e., the high-confidence sORF candidate set) The intersection of Subsets A, B, and C. The value in parentheses represents the number of sORFs in each individual subset.
Protein domain annotation of sORF candidates. (A) Venn diagram showing the number of sORF candidates in four different subsets and their intersections. (B) Proportion of the sORF subsets having protein mass spectrometry data support. The Initial set, Subsets A, B, C, AB, and ABC are as described in Figure 2. Subset D contains sORF candidates with known protein domains detected by InterProScan. (ABCD) (i.e., the higher-confidence sORF candidate set) The intersection of Subsets A, B, C, and D. The value in parentheses represents the number of sORFs in each individual subset.
Venn diagram showing the number of sORFs from the 611-sORF set with the highest confidence that were detected in P. deltoides leaf, phloem, and xylem tissue based on analysis of the trypsin-digested whole proteome using two-dimensional HPLC interfaced with tandem mass spectrometry.
Yang X, Tschaplinski TJ, Hurst GB, Jawdy S, Abraham PE, Lankford PK, Adams RM, Shah MB, Hettich RL, Lindquist E, Kalluri UC, Gunter LE, Pennacchio C, Tuskan GA. Discovery and annotation of small proteins using genomics, proteomics, and computational approaches. Genome Res. 2011 Apr;21(4):634-41. Epub 2011 Mar 2. PubMed PMID: 21367939; PubMed Central PMCID: PMC3065711.