Welcome in this article about Clustering for dataset exploration. Soon we will post another article related Visualization with hierarchical clustering and t-SNE, stay tuned! Always remember that at the end of our articles you can find key imports and functions in Python regarding the topic.
What is Unsupervised Learning?
Unsupervised Learning is a class of machine learning techniques for discovering patterns in data. For instance, finding the natural “clusters” of customers based on their purchase histories, or searching for patterns and correlations among these purchases and using these patterns to express the data in a compressed form. (Ex: Clustering or Dimensional Reduction). Unsupervised learning is defined in opposition to supervised learning. In contrast to supervised learning, unsupervised learning is learning without labels. It is pure pattern discovery, unguided by a prediction task. We will see in this article K-means clustering.
What’s K-Mean?
K-Mean is an unsupervised ML model that finds a specified number of clusters in the samples. It’s implemented in the scikit-learn or “sklearn” library. After we’ve trained a K-Means model we can use it on new samples that can be assigned to existing clusters. K-means remembers the mean of each cluster (the “centroids”). We can use scatter plots to visualize clusters, where each point would represent a sample and we can colour points by cluster labels and use pyplot library (matplotlib).
How to Evaluate the quality of the clustering with and without labels information?
To evaluate the quality of the clustering we can use cross tabulation with pandas. Cross tabulations provide great insights into which sort of samples are in which cluster. But in most datasets, the samples are not labelled by clusters. But how to do it without any label for the clusters?
We need a way to measure the quality of a clustering that uses only the clusters and the samples themselves. A good clustering has tight clusters, meaning that the samples in each cluster are bunched together, not spread out. How spread out the samples within each clusters are can be measured by the “inertia” (lower is better). Intuitively, inertia measures how far samples are from their centroids. There is the precise definition in the scikit-learn documentation. We want clusters that are not spread out, so lower values of the inertia are better. The intertia of a k-means model is measured automatically when any of the fit methods are called, and is available afterwards as the inertia attribute. In fact, k-means aims to place the clusters in a way that minimizes the inertia. We can compare on a plot the inertia and different numbers of clusters.
What is the best number of clusters?
Remember that a good clustering has tight clusters (so low inertia), but not too many clusters. A good rule of thumb is to choose an elbow in the inertia plot, which is a point where the inertia begins to decrease more slowly.
How to transform features for a better clustering?
Most of the time the data will have different characteristics that aren’t optimized for models. For exemple features that have different variances (the spread of the values). You can see in the following blogpost the fundamentals of statistics: Fundamentals Of Statistics With Code And CheatSheet In Python. To give every feature a chance, the data needs to be transformed so that features have equal variance. This can be achieved with the StandardScaler from scikit-learn. This function transform every feature to have mean 0 and variance 1. The resulting “standardized” features can be very informative. The APIs of StandardScaler and K-Means are similar, but there is an important difference. StandardScaler transform the data, and so has a transform method. KMeans, in contrast, assigns cluster labels to samples, and this done using the predict method. So in practice we standardized data and THEN cluster it using KMeans. To combine both, as we’ve seen in the Preprocessing And Pipelines With Python Code article, we can use the pipeline library from sklearn. Other examples of the “preprocessing” step are MaxAbsScaler or Normalizer.
Key Imports in Python
from sklearn.cluster import KMeans import matplotlib.pyplot as plt import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline
Key Functions in Python
# Train KMeans model sample = array model = KMeans(n_clusters=3) model.fit(samples) labels = model.predict(samples) print(labels) new_labels = model.predict(new_samples) # Plot model with pyplot xs = new_points[:,0] ys = new_points[:,1] plt.scatter(xs, ys, c=labels, alpha = 0.5) # alpha is for transparency centroids = model.cluster_centers_ centroids_x = centroids[:,0] centroids_y = centroids[:,1] plt.scatter(centroids_x, centroids_y, marker = 'D', s=50) # D (a diamond marker) and size of 50 plt.show() # Aligning labels and species df = pd.DataFrame({'labels':labels, 'species':species}) print(df.head()) # Crosstab of labels and species ct = pd.crosstab(df['labels'], df['species']) print(ct.head()) # Inertia measures clustering quality mdel = KMeans(n_clusters=3) model.fit(samples) print(model.inertia_) #sklearn StandardScaler scaler = StandardScaler() scaler.fit(samples) StandardScaler(copy=True, with_mean=True, with_std=True) samples_scaled = scaler.transform(samples) # Combine standarization with KMeans scaler = StandardScaler() kmeans = KMeans(n_clusters=3) pipeline = make_pipeline(scaler, kmeans) pipeline.fit(samples) labels = pipeline.predict(samples)
Those articules are made to remind you key concepts during your quest to become a GREAT data scientist ;). Feel free to share your thoughts on the article, ideas of posts and feedbacks.
Have a nice day!
One Response