9 Correlations
Correlations87 Remember, Correlation does not imply causation! explore how two or more variables are related to each other. It attempts to assess if the changes in one variable systematically vary with the changes in another. There is no dependence relationship between variables (e.g., IV/DV). Most often correlations are used to look at how variables are correlated to each other in a data set, usually with a focus on the variable(s) of interest.
This section explores the three most common correlation tests, one parametric (Pearson) and two non-parametric (Spearman and Kendall). All three tests compute a correlation coefficient that can range between -1 and 1. The closer the value is to the extreme (-1 or 1) the stronger the relationship is.
- < 0 - indicates a negative correlation, meaning that the as the value of x increases, the value of y decreases.
- 0 - indicates no association.
- > 0 - indicates a positive correlation, meaning that as the value of x increases, the value of y increases as well.
Null hypothesis (H0): There is no correlation between the two variables. In this case the correlation coefficient (which depending on test can be r, \(\varrho\), or \(\tau\)) is zero or close to zero.
The data set used in the examples below is called mtcars and is available in R example datasets. The data, covering 11 variables describing cars, was extracted from the 1974 Motor Trend US magazine (Table 9.1).
Table 9.1: Description of variables in the mtcars dataset
Variable | Description |
---|---|
mpg | Miles/gallon (US) |
cyl | Number of cylinders |
disp | Displacement (in cubic inches) |
hp | Horsepower |
drat | Rear axle ratio |
wt | Weight (in 1000 lb) |
qsec | 1/4 mile time |
vs | Engine (0-V, 1-Line) |
am | Transmission (0-automatic, 1-manual) |
gear | Number of forward gears |
carb | Number of carburators |
The first few rows of the data set are shown in Table 9.2.
Table 9.2: First few rows of the mtcars dataset
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
The question asked is: Is there any relationship between mpg and wt? That is, is there any correlation between the car’s weight and its fuel efficiency?
9.1 Pearson Correlations Test
Parametric test used for two interval or ratio variables. The test requires the following assumptions to be met:
- Data is bi-variate normal;
- The relationship between variables is linear.
Before running the test, we verify the assumptions for the Pearson Correlations88 An issue with the Pearson correlations test is the fact that outliers can negatively impact the test results. Therefore, an analysis of outliers should be performed for the sample before the test. test.
To test normality we use the Shapiro-Wilk test.
Shapiro-Wilk normality test
data: my.cor$mpg
W = 0.95, p-value = 0.1
Shapiro-Wilk normality test
data: my.cor$wt
W = 0.94, p-value = 0.09
The results show that the distribution of data for both variables is not significantly different from the normal distribution (both p-values are > 0.05), thus verifying the assumption of normality.
Besides using Shapiro-Wilks, data normality can be analyzed using histograms, q-q plots, and the values of skewness and kurtosis for each data set. These alternative ways of studying a data set’s normality are exemplified for other analyses.
The linearity assumption can be visualized by generating a scatter plot representation with one variable on the X axis and the other variable on the Y axis.
Looking at the scatter plot (Figure 9.1) the assumption of linearity seems to hold because the relationship between the two variables seems to be linear along the red line (regression line). Should the pattern of points show a different trend (e.g., curve), the relationship between the two variables is not linear and therefore other correlation tests should be used to analyze it.
With the assumptions verified, let’s run the Pearson Correlations test.
Pearson's product-moment correlation
data: my.cor$mpg and my.cor$wt
t = -9.6, df = 30, p-value = 1e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9338 -0.7441
sample estimates:
cor
-0.8677
The p-value < 0.05 suggests that there is a significant correlation between mpg (fuel efficiency) and wt (car’s weight).
9.2 Spearman Correlations Test
Non-parametric test used for two interval, ratio, or ordinal type variables.
Because the Spearman Correlations test does not have any assumptions about the data, it can be run directly.
Spearman's rank correlation rho
data: my.cor$mpg and my.cor$wt
S = 10000, p-value = 1e-11
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.8864
Based on the computed p-value < 0.05 it can be concluded that the two variables are significantly correlated to each other.
9.3 Kendall Correlations Test
Non-parametric test used for two interval, ratio, or ordinal type variables.
Because the Kendall Correlations test does not have any assumptions about the data, it can be run directly.
Kendall's rank correlation tau
data: my.cor$mpg and my.cor$wt
z = -5.8, p-value = 7e-09
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
-0.7278
Based on the computed p-value < 0.05 it can be concluded that the two variables are significantly correlated to each other.