When we try to train a classification model to predict fraudulant bank transactions, 99% of transactions are legitimate and 1% are fraudulent. If we build a classifier it would have 99% of accuracy but wouldn’t predict fraudulent transactions well. The situation where one class is more frequent is call Class imbalance. This situations requires different approaches.
Most often we refer at the confusion matrix as there are multiple metrics we can calculate from it.
Accuracy: (tp+tn)/(tp+tn+fp+fn)
Precision: tp/(tp+fp)
Note: Higher precision means lower false positive rate. For a classifier it means not many legitimate transactions are predicted to be fraudulent.
Recall: tp/(tp+fn)
Note: Higher recall means lower false negative rate For a classifier it means predicted most fraudulent transactions correctly. It is also called sensitivity
F1 Score: 2*(precision*recall)/(precision+recall)
Note: This metric gives equal weight to precision and recall, therefore it factors in both the number of errors made by the model and the type of errors. The F1 score favors models with similar precision and recall and is useful metric if we are seeking a model which performs reasonably well across both metrics.
Logistic regression is used for classification problems and outputs probabilities. If probability p>0.5 then data is labeled as 1 and if p<0.5 the data is labeled 0. Logistic regression produces a linear decision boundary. The default threshold in scikit learn is 0.5.
If we vary this threshold?
We can use a Receiver Operating Characteristic, or ROC curve to visualize how different thresholds affect true positive and false positive rates.
How do we quantify the model performance based on this?
We calculate the area under the ROC curve, a metric known as AUC. Scores range from zero to one, with one being ideal. When the model is above the dotted line it means that it performs batter than randomly guessing the class of each observation.
Hyperparameter tuning is to optimize models. Like choosing a value for alpha in lasso regression/ridge or n_neighbours for KNN. To choose the correct hyperparameters we can try different hyperparameter values, try them separatly, see how they perform and choose the best values (hyperparameter tuning). To avoid overfitting on the test set we can split the data and perform cross-validation on the training set. We withhold the test set for final evaluation. One approach for hypertuning is call grid search. GridSearch doesn’t scale well but we can always use RandomizedSearchCV.
Useful Imports in Python
from sklearn.metrics import classification_report, confusion_matrix from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_score from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RandomizedSearchCV
Useful Functions in Python
# Key metrics confusion_matrix(y_test, y_pred) classification_report(y_test, y_pred) # ROC Curve plot y_pred_probs = logreg.predict_proba(X_test)[:, 1] fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs) plt.plot([0,1],[0,1], ‘k--’) plt.plot(fpr,tpr) plt.xlabel(”False Positive Rate”)) print(roc_auc_score(y_test, y_pred_probs)) # Use of Grid Search param_grid = {”alpha”: np.arrange(0.001,1,10), “solver”:[”sag”, “lsqr”]} ridge_cv = GridSearchCV(ridge, param_grid, cv=kf) #perform grid search we which to tune over to a model and the cv parameters ridge_cv.fit(X_train, y_train) # Use of RandomizedSearchCV and metrics ridge_cv = RandomizedSearchCV(ridge, param_grid, cv=kf, n_iter=2) print(ridge_cv.best_params_, ridge_cv.best_score_) #params for RandomizedSearchCV with logistic regression and metrics params = {"penalty": ["l1", "l2"], "tol": np.linspace(0.0001, 1.0, 50), "C": np.linspace(0.1, 1, 50) "class_weight": ["balanced", {0:0.8, 1:0.2}]} print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) print("Tuned Logistic Regression Best Accuracy Score: {}".format(logreg_cv.best_score_))
Those articules are made to remind you key concepts during your quest to become a GREAT data scientist ;). Feel free to share your thoughts on the article, ideas of posts and feedbacks.
Have a nice day!
One Response