The linear regression minimize an Error function (also called loss function or cost function) using a coefficient a for each feature variable, plus b.
We calculate the residual sum of squares (RSS) so that the positive and negative residuals do not cancel each other. This type of linear regression is called Ordinary Least Squares (OLS) and it minimize the RSS.
R-Squared quantify the variance in the target values explained by the feature. We can use the .score function in python to see this metric. Another way to assess the linear regression performance is to use the Mean Squared Error and root mean squared error. It is measured in the squared target units. We can import it from sklearn.metrics and use the function mean_squared_error.
As model performance is dependent on the way we split up the data and it is not representative of the model’s ability to generalize to unseen data, we use Cross-validation. We split the dataset into groups and for every run we’ll use one fold for the test data and the others as training data. We compute the metric of interest for every fold as a test Data. We call this process X-fold Cross-Validation (CV). The more folds we set, the more computationally expensive it gets because we are fitting unpredict data more times.
Regularization is used to penalize large coefficients and avoid overfitting. The Ridge regression method penalize models for coefficients with large positive or negative values. Picking alpha in ridge is similar to picking k in KNN. Alpha control model complexity, when equal to 0 we are performing OLS. A high alpha means that large coefficients are significantly penalized, which can lead to underfitting.
Lasso regression is another type of regularized regression where our loss function is the OLS loss function plus the absolute value of each coefficient multiplied by some constant alpha. Lasso regression can be use to assess feature importance because it tends to shrink the coefficients of less important features to zero. Features not shrunk to zero are selected by lasso.
Reminder of useful Imports (Python)
from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.model_selection import cross_val_score, KFold from sklearn.linear_model import Ridge, Lasso
Reminder of useful Functions (Python)
#Basic plotting X.reshape(-1,1) plt.scatter(X,y,color=“blue”) # Spliting data X = df.drop(”output_column”, axis=1).values y = sales_df[”output_column”].values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Load model reg = LinearRegression() reg.fit(X_train, y_train) # Calculate predictions and metrics y_pred = reg.predict(X_test) #compare with the y_test to see plot the predictions against the real values r_squared = reg.score(X_test, y_test) rmse = mean_squared_error(y_pred, y_test, squared=False) # Remember to set squared=False when calling mean_squared_error(), so the square root of the MSE is returned. # Implement cross validation and metrics to evaluate kf = KFold(n_splits=6), shuffle=True, random_state=42 #n_split as a default of 5, shuffle shuffles the dataset before splitting into folds cv_results = cross_val_score(reg, X, y, cv=kf) print(np.mean(cv_results), np.std(cv_results)) print(np.quantile(cv_results, [0.025, 0.975])) # Plot lasso_coef (feature importance) lasso_coef = lasso.fit(X, y).coef_ plt.bar(sales_columns, lasso_coef) plt.xticks(rotations=45) plt.show()
Those articules are made to be useful on your quest to become a GREAT data scientist. Feel free to share your thoughts on the article, ideas of posts and feedbacks.
Have a nice day!