TY - JOUR
T1 - A sequence alignment-independent method for protein classification
AU - Vries, John K.
AU - Munshi, Rajan
AU - Tobi, Dror
AU - Klein-Seetharaman, Judith
AU - Benos, Panayiotis V.
AU - Bahar, Ivet
N1 - Funding Information:
The software developed for n-gram classification by JKV was supported by the National Institute of Standards and Technology (Advanced Technology Program grant number 70NANBOH3058). The work of PVB was supported by NSF grant number MCB0316255 and intramural funds of the Center for Computational Biology and Bioinformatics, the University of Pittsburgh Cancer Institute, and the Department of Human Genetics. This research was also supported by Information Technology grant number 0225636 from the National Science Foundation.
PY - 2004
Y1 - 2004
N2 - Annotation of the rapidly accumulating body of sequence data relies heavily on the detection of remote homologues and functional motifs in protein families. The most popular methods rely on sequence alignment. These include programs that use a scoring matrix to compare the probability of a potential alignment with random chance and programs that use curated multiple alignments to train profile hidden Markov models (HMMs). Related approaches depend on bootstrapping multiple alignments from a single sequence. However, alignment-based programs have limitations. They make the assumption that contiguity is conserved between homologous segments, which may not be true in genetic recombination or horizontal transfer. Alignments also become ambiguous when sequence similarity drops below 40%. This has kindled interest in classification methods that do not rely on alignment. An approach to classification without alignment based on the distribution of contiguous sequences of four amino acids (4-grams) was developed. Interest in 4-grams stemmed from the observation that almost all theoretically possible 4-grams (204) occur in natural sequences and the majority of 4-grams are uniformly distributed. This implies that the probability of finding identical 4-grams by random chance in unrelated sequences is low. A Bayesian probabilistic model was developed to test this hypothesis. For each protein family in Pfam-A and PIR-PSD, a feature vector called a probe was constructed from the set of 4-grams that best characterised the family. In rigorous jackknife tests, unknown sequences from Pfam-A and PIR-PSD were compared with the probes for each family. A classification result was deemed a true positive if the probe match with the highest probability was in first place in a rank-ordered list. This was achieved in 70% of cases. Analysis of false positives suggested that the precision might approach 85% if selected families were clustered into subsets. Case studies indicated that the 4-grams in common between an unknown and the best matching probe correlated with functional motifs from PRINTS. The results showed that remote homologues and functional motifs could be identified from an analysis of 4-gram patterns.
AB - Annotation of the rapidly accumulating body of sequence data relies heavily on the detection of remote homologues and functional motifs in protein families. The most popular methods rely on sequence alignment. These include programs that use a scoring matrix to compare the probability of a potential alignment with random chance and programs that use curated multiple alignments to train profile hidden Markov models (HMMs). Related approaches depend on bootstrapping multiple alignments from a single sequence. However, alignment-based programs have limitations. They make the assumption that contiguity is conserved between homologous segments, which may not be true in genetic recombination or horizontal transfer. Alignments also become ambiguous when sequence similarity drops below 40%. This has kindled interest in classification methods that do not rely on alignment. An approach to classification without alignment based on the distribution of contiguous sequences of four amino acids (4-grams) was developed. Interest in 4-grams stemmed from the observation that almost all theoretically possible 4-grams (204) occur in natural sequences and the majority of 4-grams are uniformly distributed. This implies that the probability of finding identical 4-grams by random chance in unrelated sequences is low. A Bayesian probabilistic model was developed to test this hypothesis. For each protein family in Pfam-A and PIR-PSD, a feature vector called a probe was constructed from the set of 4-grams that best characterised the family. In rigorous jackknife tests, unknown sequences from Pfam-A and PIR-PSD were compared with the probes for each family. A classification result was deemed a true positive if the probe match with the highest probability was in first place in a rank-ordered list. This was achieved in 70% of cases. Analysis of false positives suggested that the precision might approach 85% if selected families were clustered into subsets. Case studies indicated that the 4-grams in common between an unknown and the best matching probe correlated with functional motifs from PRINTS. The results showed that remote homologues and functional motifs could be identified from an analysis of 4-gram patterns.
UR - http://www.scopus.com/inward/record.url?scp=22744437666&partnerID=8YFLogxK
U2 - 10.2165/00822942-200403020-00008
DO - 10.2165/00822942-200403020-00008
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 15693739
AN - SCOPUS:22744437666
SN - 1175-5636
VL - 3
SP - 137
EP - 148
JO - Applied Bioinformatics
JF - Applied Bioinformatics
IS - 2-3
ER -