Visualization with hierarchical clustering and t-SNE with Python Code

October 17, 2022
5:44 pm

Visualization with hierarchical clustering and t-SNE with Python Code

Machine Learning, Python, Statistics, Tech, Tools

Thomas Bustos

Data Scientist | Data Engineer | ML Engineer

An important part in the job of a data scientist is the communication of insights to other people. Visualizations are an excellent way to share your findings, particularly with a non-technical audience. We’ll cover two unsupervised learning techniques for visualization: t-SNE and hierarchical clustering. It is recommended to read this article first: Clustering For Dataset Exploration With Python Code (Unsupervised Learning, K-Mean, Evaluate and Transform).

t-SNE: Creates a 2D map of any dataset and conveys useful information about the proximity of the samples to one another.

Hierarchical clustering: arranges samples into a hierarchy of clusters.

Hierarchical Clustering

Agglomerative hierarchical clustering starts from unique clusters that will merge and create bigger clusters until there is only one big cluster. The divisive clustering works the other way around.

In a dendrogram we read from the bottom up, vertical lines represent clusters.

Hierarchical clustering is not only a visualization tool. We can recover cluster labels at any intermediate stage and use it in cross_tabulations for exemple. We choose a hight on the dendrogram.

The y axis of the dendrogram encodes the disctance between merging clusters. The hight on dendrogram specifies maximum distance between merging clusters. It specifies that the hierarchical clustering should stop merging clusters when all clusters are at least this far apart. The distance between two clusters is measured using a “linkage method”. We can use “complete” linkage, where the distance between clusters is the maximum of the distances between their samples. This was specified by the method parameter. There are many other linkage methods, and this different methods give different hierarchical clusterings. The scipy cluster labels start at 1, not at 0 like they do in scikit-learn.

In complete linkage, the distance between clusters is the distance between the furthest points of the clusters. In single linkage, the distance between clusters is the distance between the closest points of the clusters.

t-SNE

t-SNE stands for “t-distributed stochastic neighbor embedding”. It maps samples from their high-dimensional space into a 2- or 3-dimensional space so they can visualized. While some distortion is inevitable, t-SNE does a great job of approximately representing the distances between the samples. For this reason, t-SNE is an invaluable visual aid for understanding a dataset. For t-SNE the samples are in a two dimmensional array and there is a list giving the species of each sample.

There are two aspects that deserve special attention: the fit_transorm method and the learning rate. The fit_transform method simultaneously fits the model and transforms the data. However, t-SNE does not have separate fit and tranform methods. It basically means that you can’t extend a t-SNE map to include new samples. Instead, you have to start over each time. The second thing to notice is the learning rate, which make the use of t-SNE more complicated than some other techniques. You may need to try different learning rates for different datasets. It is clear however when you’ve made a bad choice, because all the samples appear bunched together in the scatter plot. Normally it’s enough to try a few values between 50 and 200.

Key Imports in Python

import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
import pandas as pd
from sklearn.manifold import TSNE

Key Functions in Python

# Hierarchical clustering with SciPy
mergings = linkage(samples, method='complete')
dendrogram(mergings, labels=country_names, leaf_rotation=90, lead_font_size=6)
plt.show()

# Extracting cluster labels using fcluster
mergings = linkage(samples, method="complete")
labels = fcluster(merging, 15, criterion='distance')
print(labels)
pairs= pd.DataFrame({'labels':labels,'countries':country_names})
print(pairs.sort_values('labels'))
print(pd.crosstab(df["labels"], df["countries"]))

# t-SNE implementation
model = TSNE(learning_rate=100)
transformed = model.fit_transform(samples)
xs = transformed[:,0]
ys = transformed[:,1]
plt.scatter(xs, ys, c=species)
plt.show()

Those articules are made to remind you key concepts during your quest to become a GREAT data scientist ;). Feel free to share your thoughts on the article, ideas of posts and feedbacks.

Have a nice day!