Fine-Tuning your model (Classification Metrics, Logistic Regression, Cross Validation, Python Imports & Functions)

September 26, 2022
5:16 pm

Fine-Tuning your model (Classification Metrics, Logistic Regression, Cross Validation, Python Imports & Functions)

Fundamentals, Machine Learning, Python

Thomas Bustos

Data Scientist | Data Engineer | ML Engineer

When we try to train a classification model to predict fraudulant bank transactions, 99% of transactions are legitimate and 1% are fraudulent. If we build a classifier it would have 99% of accuracy but wouldn’t predict fraudulent transactions well. The situation where one class is more frequent is call Class imbalance. This situations requires different approaches.

Most often we refer at the confusion matrix as there are multiple metrics we can calculate from it.

Accuracy: (tp+tn)/(tp+tn+fp+fn)

Precision: tp/(tp+fp)

Note: Higher precision means lower false positive rate. For a classifier it means not many legitimate transactions are predicted to be fraudulent.

Recall: tp/(tp+fn)

Note: Higher recall means lower false negative rate For a classifier it means predicted most fraudulent transactions correctly. It is also called sensitivity

F1 Score: 2*(precision*recall)/(precision+recall)

Note: This metric gives equal weight to precision and recall, therefore it factors in both the number of errors made by the model and the type of errors. The F1 score favors models with similar precision and recall and is useful metric if we are seeking a model which performs reasonably well across both metrics.

Logistic regression is used for classification problems and outputs probabilities. If probability p>0.5 then data is labeled as 1 and if p<0.5 the data is labeled 0. Logistic regression produces a linear decision boundary. The default threshold in scikit learn is 0.5.

If we vary this threshold?

We can use a Receiver Operating Characteristic, or ROC curve to visualize how different thresholds affect true positive and false positive rates.

How do we quantify the model performance based on this?

We calculate the area under the ROC curve, a metric known as AUC. Scores range from zero to one, with one being ideal. When the model is above the dotted line it means that it performs batter than randomly guessing the class of each observation.

Hyperparameter tuning is to optimize models. Like choosing a value for alpha in lasso regression/ridge or n_neighbours for KNN. To choose the correct hyperparameters we can try different hyperparameter values, try them separatly, see how they perform and choose the best values (hyperparameter tuning). To avoid overfitting on the test set we can split the data and perform cross-validation on the training set. We withhold the test set for final evaluation. One approach for hypertuning is call grid search. GridSearch doesn’t scale well but we can always use RandomizedSearchCV.

Useful Imports in Python

from sklearn.metrics import classification_report, confusion_matrix

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_curve

from sklearn.metrics import roc_auc_score

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import RandomizedSearchCV

Useful Functions in Python

# Key metrics
confusion_matrix(y_test, y_pred)
classification_report(y_test, y_pred)

# ROC Curve plot
y_pred_probs = logreg.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)
plt.plot([0,1],[0,1], ‘k--’)
plt.plot(fpr,tpr)
plt.xlabel(”False Positive Rate”))
print(roc_auc_score(y_test, y_pred_probs))

# Use of Grid Search
param_grid = {”alpha”: np.arrange(0.001,1,10), “solver”:[”sag”, “lsqr”]}
ridge_cv = GridSearchCV(ridge, param_grid, cv=kf) #perform grid search we which to tune over to a model and the cv parameters
ridge_cv.fit(X_train, y_train)

# Use of RandomizedSearchCV and metrics
ridge_cv = RandomizedSearchCV(ridge, param_grid, cv=kf, n_iter=2)
print(ridge_cv.best_params_, ridge_cv.best_score_)

#params for RandomizedSearchCV with logistic regression and metrics
params = {"penalty": ["l1", "l2"], "tol": np.linspace(0.0001, 1.0, 50), "C": np.linspace(0.1, 1, 50) "class_weight": ["balanced", {0:0.8, 1:0.2}]}
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Best Accuracy Score: {}".format(logreg_cv.best_score_))

Those articules are made to remind you key concepts during your quest to become a GREAT data scientist ;). Feel free to share your thoughts on the article, ideas of posts and feedbacks.

Have a nice day!

Share this Post

One Response

Pingback: Preprocessing and Pipelines – Tech News & Tools