Cluster analysis in r (Clustering)

Clustering is a way of dividing data points with shared attribute into groups(clusters). The clustering system falls under unsupervised learning, where a data set is provided without label and the algorithm tries to figure out groups to assign each data point based on how similar or dissimilar they are to each other.

Example use cases

  • Natural Language Processing (NLP)
  • Computer vision
  • Customer/Market segmentation
  • Stock markets

Let us try to understand clustering with a graphical illustration

r cluster analysis fig

The above 2D graph shows how data points are grouped into Four distinct clusters.

We are going to discuss two major types of clustering in R

  1. The hierarchical clustering
  2. The k-means clustering

Hierarchical clustering in R

Hierarchical clustering is separating data points into different groups based on some measures of similarities

EXAMPLE

For this example we will make use of the popular iris data set in r. The iris data set contains the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Ideally, because of the presence of a label this data set falls in the Supervised category but for the sake of this lesson i will remove the labeled variable thereby making the data set good for Unsupervised learning.

STEPS

  1. Load the data set
  2. Create a scatter plot
  3. Normalize the data
  4. Calculate Euclidean distance
  5. Create a dendogram

STEP 1

Load the data set

iris_data <- iris

head(iris_data) # View first 6 observations
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Now let’s remove the Species label variable

iris_data$Species <- NULL

STEP 2

Create a scatter plot

Sepal.Width VS Sepal.Length

require(ggplot2)

ggplot(iris_data, aes(Sepal.Width, Sepal.Length)) + geom_point()
scatter plot showing sepal.width vs sepal.length

Petal.Width VS Petal.Length

ggplot(iris_data, aes(Petal.Width, Petal.Length)) + geom_point()
scatter plot showing petal.width vs petal.length

STEP 3

Normalize the data

iris_data <- scale(iris_data)

head(iris_data) # View first 6 observations of the normalized data
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   -0.8976739  1.01560199    -1.335752   -1.311052
## [2,]   -1.1392005 -0.13153881    -1.335752   -1.311052
## [3,]   -1.3807271  0.32731751    -1.392399   -1.311052
## [4,]   -1.5014904  0.09788935    -1.279104   -1.311052
## [5,]   -1.0184372  1.24503015    -1.335752   -1.311052
## [6,]   -0.5353840  1.93331463    -1.165809   -1.048667

STEP 4

Calculate Euclidean distance

Eu_dist <- dist(iris_data, method = "euclidean")

Compute hierarchical clustering using the hclust function in R

h_clust <- hclust(Eu_dist) 

STEP 5

Create a dendogram

We can use the color_branches function from the dendextend package to color out the number of groups/clusters we would like to see

require(dendextend)


h_clust$labels <-  " "

plot(color_branches(as.dendrogram(h_clust), k = 3))
dendogram showing clustered data

View how many observations were assigned to each cluster

table(cutree(h_clust, k = 3))
## 
##  1  2  3 
## 49 24 77

Knowing that our original data set had 3 categories of data points i.e three classes of Species (50 each). We would expect a good model to be able to group the data set evenly into three distinct clusters. The hierarchical clustering model above assigned few data points to the wrong groups.

k-means clustering in R

The k-means clustering is a similar cluster analysis technique like the hierarchical clustering discussed above, the main difference however, is that the number of groups/clusters to be computed is known prior to building the model in k-means clustering while there is no prior knowledge of the number of clusters to be computed in the hierarchical clustering.

Build model

kmeans_model <- kmeans(x = iris_data, centers=3, nstart = 20)

Let’s Walk through the code above

  • The kmeans function is a base R function for computing kmeans unsupervised learning in r, the function takes some parameter arguments, three of which were used above
    • the x parameter represents the dataframe to be used
    • the centers parameter represents the number of clusters/groups that we want
    • the nstart parameter represents the number of random sets that should be chosen

View summary of the model

summary(kmeans_model)
##              Length Class  Mode   
## cluster      150    -none- numeric
## centers       12    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

See how well the data points were clustered using the Sepal.length and Sepal.width variables

plot(iris$Sepal.Length, iris$Sepal.Width, col = kmeans_model$cluster)
scatter plot showing three clusters of sepal.width vs sepal.lenth

The above plot shows that the data points were decently clustered.

Anthony Aigboje Akhonokhue
Anthony Aigboje Akhonokhue

A statistician/data-analyst/data-scientist for hire. See my Resume for more info.

comments powered by Disqus