Pros & Cons of Popular Machine Learning Models

October 24, 2022
11:28 pm

Pros & Cons of Popular Machine Learning Models

Machine Learning, Statistics, Tech, Tools

Thomas Bustos

Data Scientist | Data Engineer | ML Engineer

This blogpost is to help you identify why to use a specific model instead of another by understand the pros and cons of each one. In this post we cover Linear Regression, Logistic Regression, Support Vector Machine Learning, Decision Tree, K Nearest Neighbour, K Means, Principal Component Analysis and Naive Bayes.

Linear Regression

Good

Simple to implement and efficient to train
Overfitting can be reduced by regularization
Performs well when the dataset is linearly separable

Bad

Assumes that the data is independent which is rare in real life
Prone to noise and overfitting
Sensitive to outliers

Logistic Regression

Good

Less prone to over-fitting but it can overfit in high dimensional datasets
Efficient when the dataset has features that are linearly separable
Easy to implement and efficient to train

Bad

Should not be used when the number of observations are lesser than the number of features
Assumption of linearity which is rare in practice
Can only be used to predict discrete functions

Support Vector Machine Learning

Good

Good at high dimensional data
Can work on small dataset
Can solve non-linear problems

Bad

Inefficient on large data
Requires picking the right kernal

Decision Tree

Good

Can solve non-linear problems
Can work on high-dimensional data with excellent accuracy
Easy to visualize and explain

Bad

Overfitting. Might be resolved by random forest
A small change in the data can lead to a large change in the structure of the optimal decision tree
Calculations can get very complex

K Nearest Neighbour

Good

Can make predictions without training
Time complexity is O(n)
Can be used for both classification and regression

Bad

Does not work well with large dataset
Sensitive to noisy data, missing values and outliers
Need feature scaling
Choose the correct K value

K Means

Good

Simple to implement
Scales to large data sets
Guarantees convergence
Easily adapts to new examples
Generalizes to clusters of different shapes and sizes

Bad

Sensitive to the outliers
Choosing the k values manually is tough
Dependent on initial values
Scalability decreases when dimension increases

Principal Component Analysis

Good

Reduce correlated features
Improve performance
Reduce overfitting

Bad

Principal components are less interpretable
Information loss
Must standardize data before implementing PCA

Naive Bayes

Good

Training period is less
Better suited for categorical inputs
Easy to implement

Bad

Assumes that all features are independent which is rarely happening in real life
Zero Frequency
Estimations can be wrong in some cases

Those articles are made to remind you key concepts on your journey to become a GREAT data scientist ;). Feel free to share your thoughts on the article, ideas of posts and feedbacks.

Have a nice day!