This blogpost is to help you identify why to use a specific model instead of another by understand the pros and cons of each one. In this post we cover Linear Regression, Logistic Regression, Support Vector Machine Learning, Decision Tree, K Nearest Neighbour, K Means, Principal Component Analysis and Naive Bayes.
Linear Regression
Good
- Simple to implement and efficient to train
- Overfitting can be reduced by regularization
- Performs well when the dataset is linearly separable
Bad
- Assumes that the data is independent which is rare in real life
- Prone to noise and overfitting
- Sensitive to outliers
Logistic Regression
Good
- Less prone to over-fitting but it can overfit in high dimensional datasets
- Efficient when the dataset has features that are linearly separable
- Easy to implement and efficient to train
Bad
- Should not be used when the number of observations are lesser than the number of features
- Assumption of linearity which is rare in practice
- Can only be used to predict discrete functions
Support Vector Machine Learning
Good
- Good at high dimensional data
- Can work on small dataset
- Can solve non-linear problems
Bad
- Inefficient on large data
- Requires picking the right kernal
Decision Tree
Good
- Can solve non-linear problems
- Can work on high-dimensional data with excellent accuracy
- Easy to visualize and explain
Bad
- Overfitting. Might be resolved by random forest
- A small change in the data can lead to a large change in the structure of the optimal decision tree
- Calculations can get very complex
K Nearest Neighbour
Good
- Can make predictions without training
- Time complexity is O(n)
- Can be used for both classification and regression
Bad
- Does not work well with large dataset
- Sensitive to noisy data, missing values and outliers
- Need feature scaling
- Choose the correct K value
K Means
Good
- Simple to implement
- Scales to large data sets
- Guarantees convergence
- Easily adapts to new examples
- Generalizes to clusters of different shapes and sizes
Bad
- Sensitive to the outliers
- Choosing the k values manually is tough
- Dependent on initial values
- Scalability decreases when dimension increases
Principal Component Analysis
Good
- Reduce correlated features
- Improve performance
- Reduce overfitting
Bad
- Principal components are less interpretable
- Information loss
- Must standardize data before implementing PCA
Naive Bayes
Good
- Training period is less
- Better suited for categorical inputs
- Easy to implement
Bad
- Assumes that all features are independent which is rarely happening in real life
- Zero Frequency
- Estimations can be wrong in some cases
Those articles are made to remind you key concepts on your journey to become a GREAT data scientist ;). Feel free to share your thoughts on the article, ideas of posts and feedbacks.
Have a nice day!