The USC Andrew and Erna Viterbi School of Engineering USC Signal and Image Processing Institute USC Ming Hsieh Department of Electrical and Computer Engineering University of Southern California

Technical Report USC-SIPI-385

“Prediction Modeling and Statistical Analysis for Amino Acid Substitutions”

by Hua Yang

December 2006

Classifying and predicting amino acid substitutions are important in pharmaceutical and pathological research. We proposed a novel feature set from amino acids' physicochemical properties, evolutionary profile of proteins, and protein sequence information. Large scale size of human disease-associated data were collected and processed, together with the unbiased experimental amino acid substitutions. Machine learning methods of decision tree, support vector machine, Gaussian mixture model, and random forests were used to classify neutral and deleterious substitutions, and the comparison of classification accuracy with published results showed that our feature set is superior to the existing ones. We designed a simulated annealing bump hunting method to automatically extract interpretable rules for amino acid substitutions. Rules are consistent with current biological knowledge or provide new insights for understanding substitutions. We also designed a Multiple Selection and Rule Voting (MS-RV) model, which integrates data partition and feature selection to predict and prioritize disease-associated mutations. For mutation data in SwissProt database, the 10-fold cross validation accuracy outperforms the support vector machine and random forests. We prioritized the substitutions inside thirty 10-Mb chromosomal regions which are related to monogenic diseases, and analyzed the normalized ranks. The overall area under ROC curve (AUC) scores is 86.6%. For the polygenic disease-associated amino acid substitutions, we analyzed the mutations that cause the Alzheimer disease. Our method prioritized the disease-associated substitutions on top ranks. The results indicate that MS-RV model effectively prioritizes disease-associated amino acid substitutions. We also studied the unclassified mutations with high prediction scores, and found evidences to support our conclusions.


This report is not currently available in PDF format for downloading. Contact the Signal and Image Processing Institute for information on its availability.