Choosing a suitable Machine Learning algorithm
Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. ML is one of the most exciting technologies that one would have ever come across.
A machine-learning algorithm is a program with a particular manner of altering its own parameters, given responses on the past predictions of the data set.
Who should read this article?
Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine Learning Foundation Course at a student-friendly price and become industry ready.
Anybody who wants to learn about the factors to keep in mind while selecting an algorithm
for a machine learning model. This article will highlight these essential components in brief.
Widely used machine learning algorithms:
- Linear Regression: It is essential in searching for the relationship between two continuous variables. One is an independent variable and other is the dependent variable.
- Logistic Regression: Logistic regression is one of the common methods to analyse the data and explain the relationship between one dependent binary variable and one or more independent variables of the nominal, ordinal, interval, or ratio level.
- KNN: KNN can be used for classification and regression predictive problems.
- K-means: K-means clustering is an unsupervised learning algorithm, which is used when we are dealing with the data which is not labelled(without proper categories or groups). The aim of the algorithm is to search the groups in the data set, with the number of groups being represented by the variable K.
- Support Vector Machines(SVM): It is a supervised machine learning algorithm which can be used for classification or regression tasks. It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs.
- Random Forest: It can be used for regression and classifications task. It results in greater accuracy. Random forest classifier can manage the missing values and hold the accuracy for a significant proportion of the data. If there are more number of trees, then it won’t permit the trees in the machine learning model that are overfitting.
Following factors should be taken into account while choosing an algorithm:
- The kind of model in use (problem)
- Analyzing the available Data (size of training set)
- The accuracy of the model
- Time taken to train the model (training time)
- Number of parameters
- Number of features
Understanding the problem type: It is really essential to understand the kind of model we want to make and the purpose that needs to be fulfilled, as each algorithm has been designed in a way that it serves a specific purpose like classification, regression etc. So, we are required to choose the most appropriate algorithm that would do the work.
Types of machine learning tasks:
- Supervised learning
- Unsupervised learning
- Reinforcement learning
Size of training set: We all know, when the training data set is not enough it always results in poor estimation. An over-constrained model on the insufficient training data set will always result in underfitting, on the other hand, an under-constrained model is likely to result in overfitting of the data set, in both the cases the outcome will turn to be the poor performance. The size of the training data set is a factor that plays a major role for us in deciding the algorithm of our choice. For a little training data set, as the low bias/high variance classifiers (such as k-nearest neighbours) are likely to overfit the training data set, the high bias/low variance classifiers (such as Naive Bayes) are at advantage over this.
Accuracy: We use machine learning algorithms to make realistic decisions, and stronger model results lead to better decisions. The expense of errors may be massive, so it is essential for us to minimize that cost by improving model accuracy. The accuracy needed will be distinct, depending on the requirement. The approximation is often sufficient which can result in a massive decrease in processing time. However, approximate techniques are likely to result in overfitting of the training data set.
Training time : Time taken to train the model varies for each algorithm. This running time is in correlation with the size of the data set and the accuracy we are aiming for.
Number of parameters: Parameters are one of the most important factors in leading to a decent performing model and the components like error tolerance level and a total number of iterations depend on the algorithm’s nature. Usually, the most number of trail and errors are needed to find a decent combination in the algorithms which have a huge number of parameters. Although having many parameters typically gives more versatility, the time taken to train the model using a particular algorithm, and accuracy of the same may be sensitive in obtaining just the right setup.
Number of features: Compared with the number of data points, the number of features of certain datasets may be quite large. We face the same situation when dealing with the NLP data sets which are more of a textual data sets. Some of the learning algorithms can lead to very poor training time when dealing with such a large number of features and make our work unfeasible. Few algorithms like Support Vector Machines(SVM) are especially well designed for this situation. These assumptions we make based on the past experiences doesn’t work for all situations and we are required to have a better understanding of such algorithms in order to apply the best one for a specific problem.
Linearity: Another factor that can be taken into account is that the linear machine learning algorithms like linear regression, logistic regression, and even support vector machines use linearity. The work becomes relatively easy if it is possible to approach the problems using these algorithms as they are based on a simple algorithm and do not take much training time(relatively quick to train the model). They might lower the accuracy of the algorithm is not suitable for that particular type of problem.