Biopython – Machine Learning Overview
Machine Learning algorithms are useful in every aspect of life for analyzing data accurately. Bioinformatics can easily derive information using machine learning and without it, it is hard to analyze huge genetic information.
Machine Learning algorithms are broadly classified into three parts: Supervised Learning, Unsupervised Learning, and Reinforcement learning. This article discusses content based on supervised learning only.
Supervised learning algorithms iteratively predicts results based on a training data set and corrected by a supervisor, which can be assumed as a teacher. In short mathematical expression supervised learning depends on the equation Y=f(X), where based on input data X predicts the output variable Y.
Supervised Learning problems are solved using any of the most suitable method from the two methods categorized as: Classification(output value is in a category), Regression (Output value is a real number). The following are some models which employ supervised learning to achieve results for different problems arising in the field of Bioinformatics:
The technique that determines the relationship between a dependent variable and one or more independent variable, where the type of dependent is being a binary variable. This model is used to predict K classes using a weighted sum. By this model, we can count the probability of any event happening.
Biopython has Bio.LogisticRegression module for this type of operation. Currently, the K value is 2, for the search of DNA. Two classes are OP(Adjacent genes of the same person) and NOP(adjacent genes of different persons). An example of a logistic regression model in Biopython is gene regulation(a variety of ways to increase or decrease gene products) in bacteria.
It is a collection of algorithms that all are depend on Bayes theorem (It bases probability of an event on an event that occurred prior to it). This fits on new observations and previous data. All data is independent of each other.
Bio.NaiveBayes module is there to work on this. As the Naive Bayes algorithm is considered a good fit for the recommendation systems, research is going on gene recommendation based on Naive Bayes model.
Markov Model & Maximum Entropy:
Hidden Markov Model(a simple way to model sequential data) is used for genomic data analysis. For Identification of gene regions based on segment or sequence this model is used. And maximum entropy is for biological modeling of gene sequences.
In the field of bioinformatics, these two models are being worked on with. Bio.MaximumEntropy, Bio.MarkovModel and/or Bio.HMM.MarkovModel modules are used to support the application provided by these models to work.
This model first stores a different number of cases and then works on categorizing data based on nearest neighbor data that fits the model. Statistic estimation and Pattern recognition is used for this purpose.
The Bio.kNN module is for this type of operation. Gene pair(two copies of a particular gene present in a cell) accuracy checking is an example of a problem that employs such a model to retrieve results.