Preprocessing and Pipelines

Picture of Thomas Bustos

Thomas Bustos

Data Scientist | Data Engineer | ML Engineer

This article is about Preprocessing and building Pipeline. For a better understanding it is recommended to read this article before: Fine-Tuning Your Model (Classification Metrics, Logistic Regression, Cross Validation, Python Imports & Functions)

How to preprocess Data?

Scikit-learn requires numeric data and no missing values. In the realworld it won’t happen and we need to preprocess the data. Convert categorical values into numeric feautres called dummy variables where 0 means it is not from a category and 1 that it is. To create dummy variables we can use OneHotEncoder() from scikit-learn or get_dummies() from pandas. Once we have dummies variables, we can build a model splitting the dataset (train_test_split), seperate into k folds, load our model and use cross_val_score (we can set scoring equal to “neg_mean_squared_error” which returns the negative MSE. This is because scikit-learn’s cross-validation metrics presume a higher score is better, so MSE is changed to negative to counteract this). We can calculate the training RMSE by taking the square root and converting to positive. Remember to have a first idea of a model performance you can compare the RMSE against the standard deviation of the target feature. If the RMSE is lower, the model reasonably accurate.

 

How to handle missing values?

When dealing with datasets in the real world, we will face missing data. A common approach is to remove missing observations accounting for less than 5% of all the data. Another way to deal with this is to impute values. We can use expertise to replace missing data with educated guesses, calculate the mean, the median or another value. For categorical values, we typically use the most frequent value – the mode. It is important to remember to impute AFTER splitting data to avoid data leakage (leak test set information to the model). To apply this into python we can use from scikit-learn SimpleImputer. The first step is to split numeric, categorical data and target feature (do two train_test_split for categorical and numerical values). Imputers are known as transformers. We can also impute using a pipeline using Pipeline from sklearn. To build a pipeline we construct a list of steps containing tuples with a list of the steps containing tuples with the step names specified as strings, and instantiate the transformer or model.

 

How to center and scale the data?

Data imputation is one of several important preprocessing steps for machine learning. Let’s now see about centering and scaling the data. A lot of the times the features that we have in our dataset ranges vary widely and on larger scales it can disproportionately influence the model and models like KNN uses distance explicitly when making predictions. This is why we want features to be on a similar scale normalizing or standardizing (scaling and centering). To scale the data we can subtract the mean and divide by variance so that all features are centered around zero and have a variance of one (Standardization). We can also subtract the minimum and divide by the range of the data so the normalized dataset has minimum zero and maximum 1. We can also center our data so that it ranges from -1 to 1 instead. See scikit-learn docs for further details.

 

How do we decide which model to use in the first place?

It is clear that we will use different models based on the type of problem. The three main features are size of the dataset, interpretability and Flexibility.

Size of the dataset: We can take into account the size of the dataset as for fewer features the simpler the model and the faster it takes to train. Some models require large amounts of data to perform well.

Interpretability: Some models are easier to explain which can be important for stakeholders

Flexibility: May improve accuracy, by making fewer assumptions about the data. For exemple a KNN model is more flexible as it doesn’t assume any linear relationships

Most of Regression model performance will be evaluated with RMSE and R-squared.

For Classification model performance we’ll use Accuracy, Confusion Matrix, Precision, recall, F1-score and ROC AUC. (See this post: Fine-Tuning your model (Classification Metrics, Logistic Regression, Cross Validation, Python Imports & Functions))

Therefore, one approach is to select several models and a metric, then evaluate their performance without any form of hyperparameter tuning. Recall that the performance of some models such as KNN, Linear Regression (plus Ridge, Lasso), Logistic Regression, Artificial Neural Network are affected by scaling our data. Therefore it is generally best to scale our data before evaluating models out of the box.

 

Key Imports in Python

import pandas as pd
import
import sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

Key Functions in Python

# dummies variables
dummies_dataset = pd.get_dummies(df[”categorical_variable”], drop_first=True)
dataset_with_new_columns = pd.concat([originql_dataset, dummies_dataset], axis=1)
dataset_with_new_columns = dataset_with_new_columns.drop("categorical_variable", axis=1)

model_cv = cross_val_score(model, X_train, y_train, cv=kf, scoring="neg_mean_squared_error")
print(np.sqrt(-model_cv))

print(df.isna().sum().sort_values()) #see for each feature the missing values
df = dataset.dropna(subset=["feature1","feature2","feature3"]) #subset is to only apply the function on the selected columns

x_cat = df["categorical_var"].values.reshape(-1,1)
x_num = df.drop["categorical_var", "target_variable"], axis=1).values
y = df["target_variable"].values
X_train_cat, X_test_cat, y_train, y_test = train_test_split(X_cat, y, test_size=0.2, random_state=12)
X_train_num, X_test_num, y_train, y_test = train_test_split(X_cat, y, test_size=0.2, random_state=12)

# Simple Imputer Categorical data:
imp_cat = SimpleImputer(strategy="most_frequent") #by default SimpleImputer expect NumPy-dot-NaN to represent missing values
X_train_cat = imp_cat.fit_transform(X_train_cat)
X_test_cat = imp_cat.transform(X_test_cat)

# Simple Imputer Numeric Data:
imp_num = SimpleImputer()
X_train_num = imp_num.fit_transform(X_train_num)
X_test_num = imp_num.transform(X_test_num)

# combine training data
X_train = np.append(X_train_num, X_train_cat, axis=1)
X_test = np.append(X_test_num, X_test_cat, axis=1)

#Imputing with Pipeline
df = df.dropna(subset=["feature1","feature2","feature3"])
df["categorical_variable"] = np.where(df["categorical_variable"]) == "value", 1, 0)

steps = [("imputation", SimpleImputer()), ("logistic_regression", LogisticRegression())]
pipeline = Pipeline(steps) #then fit the model using pipeline

# StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_train)

# Scaling in a pipeline
steps = [('scaler', StandardScaler()), ('knn', KNeighbordsClassifier(n_neighbors=6))]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=21)
knn_scaled = pipeline.fit(X-train, y_train)
y_pred = knn_scaled.predict(X_test)
print(knn_scaled.score(X_test, y_test)) #R-Squared

# Cross-validation and scaling in a pipeline
steps = [("scaler", StandardScaler()), "knn", KNeighborsClassifier())]
pipeline = Pipeline(steps)
parameters = {"knn__n_neighbors": np.arrange(1,50)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train) 
y_pred = cv.predict(X_test)
print(cv.best_score_, cv.best_params_)

# Evaluating classification models
models = {"Logistic Regression": LogisticRegression(), "KNN": KNeighborsClassifier(), 
                    "Decision Tree": DecisionTreeClassifier()}
results = []
for model in models.values():
    kf = KFold(n_splits=6, random_state=42, shuffle=True)
    cv_results = cross_val_score(model, X_train_scaled, y_train, cv=kf)
    results.append(cv_results)
plt.boxplot(results, labels=models.keys())
plt.show()
for name, model in models.items():
    model.fit(X_train, y_train)
    test_score = model.score(X_test_scaled, y_test)
    print("{} Test Set Accuracy: {}".format(name, test_score))

#####

# Create steps
steps = [("imp_mean", SimpleImputer()), 
         ("scaler", StandardScaler()), 
         ("logreg", LogisticRegression())]

# Set up pipeline
pipeline = Pipeline(steps)
params = {"logreg__solver": ["newton-cg", "saga", "lbfgs"],
         "logreg__C": np.linspace(0.001, 1.0, 10)}

# Create the GridSearchCV object
tuning = GridSearchCV(pipeline, param_grid=params)
tuning.fit(X_train, y_train)
y_pred = tuning.predict(X_test)

# Compute and print performance
print("Tuned Logistic Regression Parameters: {}, Accuracy: {}".format(tuning.best_params_, tuning.score(X_test, y_test)))

Those articules are made to remind you key concepts during your quest to become a GREAT data scientist ;). Feel free to share your thoughts on the article, ideas of posts and feedbacks.

Have a nice day!

Share this Post

One Response

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles