Active Learning for the Prediction of Asparagine/Aspartate Hydroxylation Sites on Proteins

Festus O. Iyuke, James R. Green, and William G. Willmore


Active learning, support vector machines, hydroxylation site prediction, class imbalance


In this paper, we propose pool-based active learning with support vector machine (SVM) classifiers for the prediction of asparagine/aspartate (N/D) hydroxylation sites on proteins. The verification of hydroxylation sites on human proteins in wet-lab experiments is very costly and sometimes time-consuming to achieve. The active learning procedure could therefore be used to choose which putative hydroxylation sites should be selected for future wet-lab experimental validation and verification in order to gain maximal information. Using a dataset of N/D sites with known hydroxylation state, we here demonstrate through simulations that active learning query strategies can achieve higher classification performance with fewer labelled training instances for hydroxylation site prediction, compared to traditional passive learning. The active learning query strategies (uncertainty, density-uncertainty, certainty) are shown to identify the most informative unlabelled instances for oracle annotation at each learning cycle. Furthermore, our experimental results also show that active learning strategies are highly robust in the presence of class imbalance in the available unlabeled data.

Important Links:

Go Back