Cluster analysis in r (Clustering)

Clustering is a way of dividing data points with shared attribute into groups(clusters). The clustering system falls under unsupervised learning, where a data set is provided without label and the algorithm tries to figure out groups to assign each data point based on how similar or dissimilar they are to each other.

Example use cases

  • Natural Language Processing (NLP)
  • Computer vision
  • Customer/Market segmentation
  • Stock markets

Let us try to understand clustering with a graphical illustration

r cluster analysis fig

The above 2D graph shows how data points are grouped into Four distinct clusters.

We are going to discuss two major types of clustering in R

  1. The hierarchical clustering
  2. The k-means clustering

Hierarchical clustering in R

Hierarchical clustering is separating data points into different groups based on some measures of similarities


For this example we will make use of the popular iris data set in r. The iris data set contains the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Ideally, because of the presence of a label this data set falls in the Supervised category but for the sake of this lesson i will remove the labeled variable thereby making the data set good for Unsupervised learning.


  1. Load the data set
  2. Create a scatter plot
  3. Normalize the data
  4. Calculate Euclidean distance
  5. Create a dendogram


Load the data set

iris_data <- iris

head(iris_data) # View first 6 observations
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Now let’s remove the Species label variable

iris_data$Species <- NULL


Create a scatter plot

Sepal.Width VS Sepal.Length


ggplot(iris_data, aes(Sepal.Width, Sepal.Length)) + geom_point()
scatter plot showing sepal.width vs sepal.length

Petal.Width VS Petal.Length

ggplot(iris_data, aes(Petal.Width, Petal.Length)) + geom_point()
scatter plot showing petal.width vs petal.length


Normalize the data

iris_data <- scale(iris_data)

head(iris_data) # View first 6 observations of the normalized data
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   -0.8976739  1.01560199    -1.335752   -1.311052
## [2,]   -1.1392005 -0.13153881    -1.335752   -1.311052
## [3,]   -1.3807271  0.32731751    -1.392399   -1.311052
## [4,]   -1.5014904  0.09788935    -1.279104   -1.311052
## [5,]   -1.0184372  1.24503015    -1.335752   -1.311052
## [6,]   -0.5353840  1.93331463    -1.165809   -1.048667


Calculate Euclidean distance

Eu_dist <- dist(iris_data, method = "euclidean")

Compute hierarchical clustering using the hclust function in R

h_clust <- hclust(Eu_dist) 


Create a dendogram

We can use the color_branches function from the dendextend package to color out the number of groups/clusters we would like to see


h_clust$labels <-  " "

plot(color_branches(as.dendrogram(h_clust), k = 3))
dendogram showing clustered data

View how many observations were assigned to each cluster

table(cutree(h_clust, k = 3))
##  1  2  3 
## 49 24 77

Knowing that our original data set had 3 categories of data points i.e three classes of Species (50 each). We would expect a good model to be able to group the data set evenly into three distinct clusters. The hierarchical clustering model above assigned few data points to the wrong groups.

k-means clustering in R

The k-means clustering is a similar cluster analysis technique like the hierarchical clustering discussed above, the main difference however, is that the number of groups/clusters to be computed is known prior to building the model in k-means clustering while there is no prior knowledge of the number of clusters to be computed in the hierarchical clustering.

Build model

kmeans_model <- kmeans(x = iris_data, centers=3, nstart = 20)

Let’s Walk through the code above

  • The kmeans function is a base R function for computing kmeans unsupervised learning in r, the function takes some parameter arguments, three of which were used above
    • the x parameter represents the dataframe to be used
    • the centers parameter represents the number of clusters/groups that we want
    • the nstart parameter represents the number of random sets that should be chosen

View summary of the model

##              Length Class  Mode   
## cluster      150    -none- numeric
## centers       12    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

See how well the data points were clustered using the Sepal.length and Sepal.width variables

plot(iris$Sepal.Length, iris$Sepal.Width, col = kmeans_model$cluster)
scatter plot showing three clusters of sepal.width vs sepal.lenth

The above plot shows that the data points were decently clustered.

Anthony Aigboje Akhonokhue
Anthony Aigboje Akhonokhue

A statistician/data-analyst/data-scientist for hire. See my Resume for more info.

comments powered by Disqus