Chapter 9 - Clustering#

Why clustering analysis is important?#

Clustering is a common technique in systems biology and bioinformatics that is used to group together biological entities that have similar characteristics or functions. This can be useful for identifying patterns and relationships within large datasets, and for making predictions about the behavior of biological systems. Clustering can help researchers to understand the underlying mechanisms of biological processes, and to identify potential targets for therapeutic interventions. It can also be used to identify subpopulations within a population of cells, which can be useful for understanding the diversity and complexity of biological systems.

What is clustering analysis?#

Clustering analysis is a statistical method used in bioinformatics to group together biological entities that have similar characteristics or functions. This is typically done using algorithms that can analyze large datasets and identify patterns and relationships within the data. Clustering analysis can be used to identify subpopulations within a population of cells, or to group together genes or proteins that have similar functions. This can help researchers to understand the underlying mechanisms of biological processes, and to identify potential targets for therapeutic interventions. Clustering analysis can be applied to many different types of data, including gene expression data, protein sequence data, and imaging data.

Clustering algorithms#

Some of the most common clustering algorithms include k-means clustering, hierarchical clustering, and density-based clustering.

K-means clustering is a simple and widely used clustering algorithm that groups data points into a specified number of clusters based on their similarity. This algorithm works by iteratively assigning data points to the cluster that they are most like, and then recalculating the center of each cluster to better reflect the data points that are assigned to it.

Hierarchical clustering is a type of clustering algorithm that creates a hierarchy of clusters, with each cluster being composed of smaller subclusters. This algorithm can be used to identify clusters at different levels of granularity and can handle data with varying degrees of similarity between points.

Gaussian mixture model (GMM) clustering is a probabilistic clustering algorithm that assumes that the data points in a cluster are generated from a mixture of Gaussian distributions. This allows the algorithm to handle data with complex and non-linear structures, and to model the uncertainty in the data.

Spectral clustering is a graph-based clustering algorithm that uses the eigenvectors of a similarity matrix to identify clusters in the data. This algorithm can handle data with complex structures and can identify clusters that are not well-separated in the original space.

Affinity propagation is a clustering algorithm that uses messages passed between data points to identify clusters in the data. This algorithm can handle data with many different types of patterns and relationships and can identify clusters of different sizes and shapes.

Each of these algorithms has its own strengths and limitations, and the choice of which algorithm to use will depend on the specific data and the goals of the analysis.

The steps of the hierarchical clustering algorithm#

The steps involved in hierarchical clustering are as follows:

  1. The algorithm starts by considering each data point as a separate cluster.

  2. It then calculates the distance between each pair of clusters, using a metric such as Euclidean distance.

  3. The algorithm then merges the two closest clusters together, based on the calculated distances, to form a new, larger cluster.

  4. This process is repeated until all the data points are grouped into a single cluster, or until some other stopping criteria is met.

  5. The resulting hierarchy of clusters can be represented using a dendrogram, which shows the sequence of cluster merges and the distances between the clusters at each step.

  6. The dendrogram can then be used to determine the optimal number of clusters, based on the desired level of granularity and the structure of the data.

Overall, hierarchical clustering is a simple and effective way to group together data points that have similar characteristics or functions. It can be used to identify patterns and relationships within large datasets, and to understand the underlying mechanisms of biological processes.

The steps of the k-means clustering algorithm#

The steps involved in k-means clustering follows:

  1. The algorithm starts by selecting a specified number of data points at random, called the “cluster centers” or “centroids”.

  2. It then assigns each data point to the cluster that it is most like, based on a metric such as Euclidean distance.

  3. The algorithm then calculates the new center of each cluster, based on the data points that are assigned to it.

  4. This process is repeated until the cluster centers converge, or until some other stopping criteria is met.

  5. The resulting clusters can be used to group together data points with similar characteristics or functions.

Overall, k-means clustering is a simple and widely used clustering algorithm that can be applied to many different types of data. It is particularly useful for identifying well-defined and well-separated clusters in the data. However, it can be sensitive to the initial selection of cluster centers and may not work well for data with complex or non-linear structures.

Implementing hierarchical clustering in Python#

Python code that demonstrates how to perform hierarchical clustering:


from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt

# Example data:
X = [[0, 0], [0, 1], [1, 0], [1, 1], [2, 2], [3, 3]]

# Perform hierarchical clustering using the single linkage method:
Z = linkage(X, method='single')

# Plot the dendrogram:
plt.figure()
dendrogram(Z)
plt.show()

This will generate a dendrogram that shows the hierarchy of the clusters. The y-axis represents the distance between the two clusters being merged, and the x-axis represents the data points.

You can also use other linkage methods, such as average, complete, or ward, by specifying the method parameter. The choice of linkage method will depend on the characteristics of your data and the desired properties of the clusters.

Implementing k-means clustering in Python#

Python code that demonstrates how to perform k-means clustering:


from sklearn.cluster import KMeans

# Example data:
X = [[0, 0], [0, 1], [1, 0], [1, 1], [2, 2], [3, 3]]

# Initialize the KMeans model:
kmeans = KMeans(n_clusters=2)

# Fit the model to the data:
kmeans.fit(X)

# Predict the cluster labels for the data:
labels = kmeans.predict(X)

# Print the cluster labels:
print(labels)

This will print the cluster labels for each data point: [0 0 0 0 1 1], indicating that the first four data points are in cluster 0, and the last two data points are in cluster 1.

You can also specify other parameters of the KMeans model, such as the init parameter, which determines the method for initializing the clusters, or the ninit parameter, which determines the number of times the KMeans algorithm will be run with different centroid seeds.

Clustergrammer#

Clustergrammer is a web-based tool developed by the Ma’ayan Lab for visualizing and analyzing clustering results in systems biology and bioinformatics. It is designed to help researchers to explore and understand the relationships and patterns within large datasets of biological entities, such as genes or proteins. Clustergrammer uses interactive heatmap visualizations to show the clustering results and provides a range of tools and features for filtering, zooming, and comparing the data. It is available as a standalone tool, or as a library that can be integrated into other software applications.

Authors: Avi Ma’ayan, chatGPT, Heesu Kim

Maintainers: Avi Ma’ayan, Heesu Kim

Version: 0.1

License: CC-BY-NC-SA 4.0