# Cluster analysis in r (Clustering)

Clustering is a way of dividing data points with shared attribute into groups(clusters). The clustering system falls under unsupervised learning, where a data set is provided without label and the algorithm tries to figure out groups to assign each data point based on how similar or dissimilar they are to each other.

Example use cases

• Natural Language Processing (NLP)
• Computer vision
• Customer/Market segmentation
• Stock markets

Let us try to understand clustering with a graphical illustration

The above 2D graph shows how data points are grouped into Four distinct clusters.

We are going to discuss two major types of clustering in R

1. The hierarchical clustering
2. The k-means clustering

## Hierarchical clustering in R

Hierarchical clustering is separating data points into different groups based on some measures of similarities

### EXAMPLE

For this example we will make use of the popular iris data set in r. The iris data set contains the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Ideally, because of the presence of a label this data set falls in the Supervised category but for the sake of this lesson i will remove the labeled variable thereby making the data set good for Unsupervised learning.

STEPS

2. Create a scatter plot
3. Normalize the data
4. Calculate Euclidean distance
5. Create a dendogram

STEP 1

iris_data <- iris

head(iris_data) # View first 6 observations

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa


Now let’s remove the Species label variable

iris_data$Species <- NULL  STEP 2 Create a scatter plot Sepal.Width VS Sepal.Length require(ggplot2) ggplot(iris_data, aes(Sepal.Width, Sepal.Length)) + geom_point()  Petal.Width VS Petal.Length ggplot(iris_data, aes(Petal.Width, Petal.Length)) + geom_point()  STEP 3 Normalize the data iris_data <- scale(iris_data) head(iris_data) # View first 6 observations of the normalized data  ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## [1,] -0.8976739 1.01560199 -1.335752 -1.311052 ## [2,] -1.1392005 -0.13153881 -1.335752 -1.311052 ## [3,] -1.3807271 0.32731751 -1.392399 -1.311052 ## [4,] -1.5014904 0.09788935 -1.279104 -1.311052 ## [5,] -1.0184372 1.24503015 -1.335752 -1.311052 ## [6,] -0.5353840 1.93331463 -1.165809 -1.048667  STEP 4 Calculate Euclidean distance Eu_dist <- dist(iris_data, method = "euclidean")  Compute hierarchical clustering using the hclust function in R h_clust <- hclust(Eu_dist)  STEP 5 Create a dendogram We can use the color_branches function from the dendextend package to color out the number of groups/clusters we would like to see require(dendextend) h_clust$labels <-  " "

plot(color_branches(as.dendrogram(h_clust), k = 3))


View how many observations were assigned to each cluster

table(cutree(h_clust, k = 3))

##
##  1  2  3
## 49 24 77


Knowing that our original data set had 3 categories of data points i.e three classes of Species (50 each). We would expect a good model to be able to group the data set evenly into three distinct clusters. The hierarchical clustering model above assigned few data points to the wrong groups.

## k-means clustering in R

The k-means clustering is a similar cluster analysis technique like the hierarchical clustering discussed above, the main difference however, is that the number of groups/clusters to be computed is known prior to building the model in k-means clustering while there is no prior knowledge of the number of clusters to be computed in the hierarchical clustering.

#### Build model

kmeans_model <- kmeans(x = iris_data, centers=3, nstart = 20)


Let’s Walk through the code above

• The kmeans function is a base R function for computing kmeans unsupervised learning in r, the function takes some parameter arguments, three of which were used above
• the x parameter represents the dataframe to be used
• the centers parameter represents the number of clusters/groups that we want
• the nstart parameter represents the number of random sets that should be chosen

View summary of the model

summary(kmeans_model)

##              Length Class  Mode
## cluster      150    -none- numeric
## centers       12    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric


See how well the data points were clustered using the Sepal.length and Sepal.width variables

plot(iris$Sepal.Length, iris$Sepal.Width, col = kmeans_model\$cluster)


The above plot shows that the data points were decently clustered. 