9 Correlations

Correlations87 Remember, Correlation does not imply causation! explore how two or more variables are related to each other. It attempts to assess if the changes in one variable systematically vary with the changes in another. There is no dependence relationship between variables (e.g., IV/DV). Most often correlations are used to look at how variables are correlated to each other in a data set, usually with a focus on the variable(s) of interest.

This section explores the three most common correlation tests, one parametric (Pearson) and two non-parametric (Spearman and Kendall). All three tests compute a correlation coefficient that can range between -1 and 1. The closer the value is to the extreme (-1 or 1) the stronger the relationship is.

  • < 0 - indicates a negative correlation, meaning that the as the value of x increases, the value of y decreases.
  • 0 - indicates no association.
  • > 0 - indicates a positive correlation, meaning that as the value of x increases, the value of y increases as well.

Null hypothesis (H0): There is no correlation between the two variables. In this case the correlation coefficient (which depending on test can be r, \(\varrho\), or \(\tau\)) is zero or close to zero.

The data set used in the examples below is called mtcars and is available in R example datasets. The data, covering 11 variables describing cars, was extracted from the 1974 Motor Trend US magazine (Table 9.1).

Table 9.1: Description of variables in the mtcars dataset

Variable Description
mpg Miles/gallon (US)
cyl Number of cylinders
disp Displacement (in cubic inches)
hp Horsepower
drat Rear axle ratio
wt Weight (in 1000 lb)
qsec 1/4 mile time
vs Engine (0-V, 1-Line)
am Transmission (0-automatic, 1-manual)
gear Number of forward gears
carb Number of carburators

The first few rows of the data set are shown in Table 9.2.

Table 9.2: First few rows of the mtcars dataset

mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

The question asked is: Is there any relationship between mpg and wt? That is, is there any correlation between the car’s weight and its fuel efficiency?

9.1 Pearson Correlations Test

Parametric test used for two interval or ratio variables. The test requires the following assumptions to be met:

  • Data is bi-variate normal;
  • The relationship between variables is linear.

Before running the test, we verify the assumptions for the Pearson Correlations88 An issue with the Pearson correlations test is the fact that outliers can negatively impact the test results. Therefore, an analysis of outliers should be performed for the sample before the test. test.

To test normality we use the Shapiro-Wilk test.


    Shapiro-Wilk normality test

data:  my.cor$mpg
W = 0.95, p-value = 0.1

    Shapiro-Wilk normality test

data:  my.cor$wt
W = 0.94, p-value = 0.09

The results show that the distribution of data for both variables is not significantly different from the normal distribution (both p-values are > 0.05), thus verifying the assumption of normality.

Besides using Shapiro-Wilks, data normality can be analyzed using histograms, q-q plots, and the values of skewness and kurtosis for each data set. These alternative ways of studying a data set’s normality are exemplified for other analyses.

The linearity assumption can be visualized by generating a scatter plot representation with one variable on the X axis and the other variable on the Y axis.

Figure 9.1: Scatterplot of miles/gallon (mpg) vs. weight (wt)

Scatterplot of miles/gallon (mpg) vs. weight (wt)

Looking at the scatter plot (Figure 9.1) the assumption of linearity seems to hold because the relationship between the two variables seems to be linear along the red line (regression line). Should the pattern of points show a different trend (e.g., curve), the relationship between the two variables is not linear and therefore other correlation tests should be used to analyze it.

With the assumptions verified, let’s run the Pearson Correlations test.


    Pearson's product-moment correlation

data:  my.cor$mpg and my.cor$wt
t = -9.6, df = 30, p-value = 1e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9338 -0.7441
sample estimates:
    cor 
-0.8677 

The p-value < 0.05 suggests that there is a significant correlation between mpg (fuel efficiency) and wt (car’s weight).

9.2 Spearman Correlations Test

Non-parametric test used for two interval, ratio, or ordinal type variables.

Because the Spearman Correlations test does not have any assumptions about the data, it can be run directly.


    Spearman's rank correlation rho

data:  my.cor$mpg and my.cor$wt
S = 10000, p-value = 1e-11
alternative hypothesis: true rho is not equal to 0
sample estimates:
    rho 
-0.8864 

Based on the computed p-value < 0.05 it can be concluded that the two variables are significantly correlated to each other.

9.3 Kendall Correlations Test

Non-parametric test used for two interval, ratio, or ordinal type variables.

Because the Kendall Correlations test does not have any assumptions about the data, it can be run directly.


    Kendall's rank correlation tau

data:  my.cor$mpg and my.cor$wt
z = -5.8, p-value = 7e-09
alternative hypothesis: true tau is not equal to 0
sample estimates:
    tau 
-0.7278 

Based on the computed p-value < 0.05 it can be concluded that the two variables are significantly correlated to each other.