Visualizing and interpreting correlation in R
Correlation analysis is a statistical analysis for understanding the relationship/association between continuous variables. Essentially what we are interestedin in a correlation analysis is a value called the correlation coefficient. This value ranges from 1 to +1, it tells us how strong or weak the relationship is between a pair of variables.
Generally:

A value of 1 translates to a perfect negative correlation between the two variables
 a value between 0.1 and 0.49 implies that there is a weak negative correlation between the two variables(Low negative correlation)
 a value between 0.5 and 0.99 implies that there is a strong negative correlation between the two variables(High negative correlation)

A value of +1 translates to a perfect positive correlation between the two variables
 a value between +0.1 and +0.49 implies that there is a weak positive correlation between the two variables(Low negative correlation)
 a value between +0.5 and +0.99 implies that there is a strong positive correlation between the two variables(High negative correlation)

A value of 0 translates to no correlation between the two variables(No relationship)
THREE MAJOR METHODS OF CORRELATION COEFFICIENT CAN BE COMPUTED IN BASE R

Pearson's productmoment correlation
Pearson’s productmoment correlation is a parametric measure of the linear association between two continuous random variables
ASSUMPTIONS:
 The two variables are continuous
 The variable values are pairs(there are two values, two measurements for each sample case)
 The relationship between the variables are approximately linear
 The variables are normally distributed
 There exist no outliers

Spearman's rank correlation
Spearman’s rank correlation is a nonparametric measure of the monotonic association between two continuous random variables
ASSUMPTIONS:
 Both variables are ordinal and/or continuous
 The relationship between the variables is monotonic
 The variable values are pairs(there are two values, two measurements for each sample case)

Kendall's rank correlation
kendall’s rank correlation is a nonparametric measure of the association, based on concordance or discordance of two continuous random variables.
HYPOTHESIS TESTING
When performing correlation analysis in R we can also test for hypothesis as follows:
 Ho (Null hypothesis): correlation between the two variables is equal to Zero
 H1 (Alternative hypothesis): correlation between the two variables is not equal to Zero
REJECTION REGION
Reject the Null hypothesis when the pvalue is smaller than 0.05
Reject the Alternative hypothesis when the pvalue is greater than 0.05
EXAMPLE
For this example we will make use of the popular mtcars dataset that comes with R. The dataset contains information about 32 cars, including their weight, fuel efficiency (in miles_per_gallon), speed, Rear axle ratio, etc… you can use the help(mtcars)
function in r to know more about the dataset.
Our aim is to analyze the correlation between the Miles_per_gallon(mpg) variable and few other continuous variables in the dataset.
Load required packages and dataset
require(tidyverse)
require(corrplot)
data("mtcars")
See what the data looks like
mtcars %>% head() %>% knitr::kable()
mpg  cyl  disp  hp  drat  wt  qsec  vs  am  gear  carb  

Mazda RX4  21.0  6  160  110  3.90  2.620  16.46  0  1  4  4 
Mazda RX4 Wag  21.0  6  160  110  3.90  2.875  17.02  0  1  4  4 
Datsun 710  22.8  4  108  93  3.85  2.320  18.61  1  1  4  1 
Hornet 4 Drive  21.4  6  258  110  3.08  3.215  19.44  1  0  3  1 
Hornet Sportabout  18.7  8  360  175  3.15  3.440  17.02  0  0  3  2 
Valiant  18.1  6  225  105  2.76  3.460  20.22  1  0  3  1 
To compute the Pearson, Speaman or kendall correlation coefficient test in R we use the cor.test
function. We can indicate the method we want to use with the method parameter, by default the pearson’s method is used.
Compute Pearson correlation test for MPG vs DISP variables
cor.test(mtcars$dis, mtcars$mpg, method = "pearson")
##
## Pearson's productmoment correlation
##
## data: mtcars$dis and mtcars$mpg
## t = 8.7472, df = 30, pvalue = 9.38e10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9233594 0.7081376
## sample estimates:
## cor
## 0.8475514
We can see that we get a lot of important information like the pvalue, hypothesis, confidence interval, degreeoffreedom, and correlation coefficient when we use the cor.test
function.
A correlation coefficient value of 0.8475514 means that there is strong negative correlation between the MPG and DISP variable. We can simply use the cor
function if we are only interested in the value of the correlation coefficient between the variables. i.e cor(mtcars$mpg, mtcars$disp, method = "pearson")
Visualize the variables
We can also visualize the variables to understand the association
mtcars %>% ggplot(aes(disp, mpg)) + geom_point() + geom_smooth(se = FALSE)
Obviously as the disp increases the mpg decreases and the relationship is fairly linear.
Compute Spearman’s rank correlation test for MPG vs DISP variables
cor.test(mtcars$dis, mtcars$mpg, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: mtcars$dis and mtcars$mpg
## S = 10415, pvalue = 6.37e13
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9088824
We get a higher correlation coefficient in the Spearman’s test
NOTE: It is always advisable to first verify which of the assumptions are satisfied before going ahead to compute any of these tests.
Compute Pearson correlation test for MPG vs DRAT variables
cor.test(mtcars$drat, mtcars$mpg, method = "pearson")
##
## Pearson's productmoment correlation
##
## data: mtcars$drat and mtcars$mpg
## t = 5.096, df = 30, pvalue = 1.776e05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4360484 0.8322010
## sample estimates:
## cor
## 0.6811719
We get a correlation coefficient value of 0.6811719 indicating that there is a strong correlation between the mpg and drat variables
Visualize the variables
We can also visualize the variables to understand the association
mtcars %>% ggplot(aes(drat, mpg)) + geom_point() + geom_smooth(se = FALSE)
We can see that the mpg increases as drat increases and the relationhip is fairly linear
COMPUTING AND VISUALIZING THE CORRELATION OF MULTIPLE VARIABLES
it is also possible to compute correlation coefficient for multiple variables, each pairs combined to form a matrix. This can also be achieved with the cor
function, this time instead of supplying the function with two continuous variables we supply an entire data frame with several continuous variables.
Correlation matrix for the mtcars dataframe
cor(mtcars) %>%
knitr::kable()
mpg  cyl  disp  hp  drat  wt  qsec  vs  am  gear  carb  

mpg  1.0000000  0.8521620  0.8475514  0.7761684  0.6811719  0.8676594  0.4186840  0.6640389  0.5998324  0.4802848  0.5509251 
cyl  0.8521620  1.0000000  0.9020329  0.8324475  0.6999381  0.7824958  0.5912421  0.8108118  0.5226070  0.4926866  0.5269883 
disp  0.8475514  0.9020329  1.0000000  0.7909486  0.7102139  0.8879799  0.4336979  0.7104159  0.5912270  0.5555692  0.3949769 
hp  0.7761684  0.8324475  0.7909486  1.0000000  0.4487591  0.6587479  0.7082234  0.7230967  0.2432043  0.1257043  0.7498125 
drat  0.6811719  0.6999381  0.7102139  0.4487591  1.0000000  0.7124406  0.0912048  0.4402785  0.7127111  0.6996101  0.0907898 
wt  0.8676594  0.7824958  0.8879799  0.6587479  0.7124406  1.0000000  0.1747159  0.5549157  0.6924953  0.5832870  0.4276059 
qsec  0.4186840  0.5912421  0.4336979  0.7082234  0.0912048  0.1747159  1.0000000  0.7445354  0.2298609  0.2126822  0.6562492 
vs  0.6640389  0.8108118  0.7104159  0.7230967  0.4402785  0.5549157  0.7445354  1.0000000  0.1683451  0.2060233  0.5696071 
am  0.5998324  0.5226070  0.5912270  0.2432043  0.7127111  0.6924953  0.2298609  0.1683451  1.0000000  0.7940588  0.0575344 
gear  0.4802848  0.4926866  0.5555692  0.1257043  0.6996101  0.5832870  0.2126822  0.2060233  0.7940588  1.0000000  0.2740728 
carb  0.5509251  0.5269883  0.3949769  0.7498125  0.0907898  0.4276059  0.6562492  0.5696071  0.0575344  0.2740728  1.0000000 
Notice how the leading diagonal is 1 all through, this means that there is always a perfect positive correlation when a variable is paired with itself.
we can tell how much each variable is correlated with another by observing the correlation matrix, but some people may not be too comfortable seeing just numbers, it may be hard for them to interpret results is this form.
Another good way to compute a correlation matrix is by making an heatmap. With heatmap it is easy to tell which variables have high and low correlation without having to be looking at numbers
Visualize the correlation of the mtcars dataframe using the corrplot
function from the corrplot package
corrplot(cor(mtcars), method = "pie")
Each pie from the plot above shows how strong or weak correlation between pairs is, more filled pies indicates strong relationship between the pairs and less filled pie indicates weak relationship between the pairs. The color of each pairs tells us the sign/direction of the relationship i.e
Negative
or Positive
With plots like the one above one would not need to be struggling to understand numbers. Mere looking at the plot one can tell how much each pairs correlates with one another compared to the rest.
The method parameter in the corrplot
function allows us to choose the visualization method of correlation matrix to be used. It currently supports the seven methods below:
 circle
 square
 ellipse
 number
 pie
 shade
 color
for example we can use the “square” method as follows:
corrplot(cor(mtcars), method = "square")
and so on…….