# 9 Correlations

Correlations87 *Remember*, Correlation does not imply causation! explore how two or more variables are related to each other. It attempts to assess if the changes in one variable systematically vary with the changes in another. There is no dependence relationship between variables (e.g., IV/DV). Most often correlations are used to look at how variables are correlated to each other in a data set, usually with a focus on the variable(s) of interest.

This section explores the three most common correlation tests, one *parametric* (Pearson) and two *non-parametric* (Spearman and Kendall). All three tests compute a correlation coefficient that can range between -1 and 1. The closer the value is to the extreme (-1 or 1) the stronger the relationship is.

- < 0 - indicates a negative correlation, meaning that the as the value of
*x*increases, the value of*y*decreases. - 0 - indicates no association.
- > 0 - indicates a positive correlation, meaning that as the value of
*x*increases, the value of*y*increases as well.

*Null hypothesis (H _{0})*: There is no correlation between the two variables. In this case the correlation coefficient (which depending on test can be

*r*, \(\varrho\), or \(\tau\)) is zero or close to zero.

The data set used in the examples below is called *mtcars* and is available in R example datasets. The data, covering 11 variables describing cars, was extracted from the 1974 Motor Trend US magazine (Table 9.1).

Table 9.1: Description of variables in the mtcars dataset

Variable | Description |
---|---|

mpg | Miles/gallon (US) |

cyl | Number of cylinders |

disp | Displacement (in cubic inches) |

hp | Horsepower |

drat | Rear axle ratio |

wt | Weight (in 1000 lb) |

qsec | 1/4 mile time |

vs | Engine (0-V, 1-Line) |

am | Transmission (0-automatic, 1-manual) |

gear | Number of forward gears |

carb | Number of carburators |

The first few rows of the data set are shown in Table 9.2.

Table 9.2: First few rows of the mtcars dataset

mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|

Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |

Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |

Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |

Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |

Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |

Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |

The question asked is: *Is there any relationship between mpg and wt*? That is, is there any correlation between the car’s weight and its fuel efficiency?

## 9.1 Pearson Correlations Test

Parametric test used for two *interval* or *ratio* variables. The test requires the following assumptions to be met:

- Data is bi-variate normal;
- The relationship between variables is linear.

Before running the test, we verify the assumptions for the *Pearson Correlations*88 An issue with the Pearson correlations test is the fact that outliers can negatively impact the test results. Therefore, an analysis of outliers should be performed for the sample before the test. test.

To test normality we use the *Shapiro-Wilk* test.

```
Shapiro-Wilk normality test
data: my.cor$mpg
W = 0.95, p-value = 0.1
```

```
Shapiro-Wilk normality test
data: my.cor$wt
W = 0.94, p-value = 0.09
```

The results show that the distribution of data for both variables is not significantly different from the normal distribution (both *p*-values are > 0.05), thus verifying the assumption of normality.

Besides using Shapiro-Wilks, data normality can be analyzed using *histograms*, *q-q plots*, and the values of *skewness* and *kurtosis* for each data set. These alternative ways of studying a data set’s normality are exemplified for other analyses.

The linearity assumption can be visualized by generating a scatter plot representation with one variable on the X axis and the other variable on the Y axis.

Looking at the scatter plot (Figure 9.1) the assumption of linearity seems to hold because the relationship between the two variables seems to be linear along the red line (regression line). Should the pattern of points show a different trend (e.g., curve), the relationship between the two variables is not linear and therefore other correlation tests should be used to analyze it.

With the assumptions verified, let’s run the Pearson Correlations test.

```
Pearson's product-moment correlation
data: my.cor$mpg and my.cor$wt
t = -9.6, df = 30, p-value = 1e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9338 -0.7441
sample estimates:
cor
-0.8677
```

The *p*-value < 0.05 suggests that there is a significant correlation between *mpg* (fuel efficiency) and *wt* (car’s weight).

## 9.2 Spearman Correlations Test

Non-parametric test used for two *interval*, *ratio*, or *ordinal* type variables.

Because the *Spearman Correlations* test does not have any assumptions about the data, it can be run directly.

```
Spearman's rank correlation rho
data: my.cor$mpg and my.cor$wt
S = 10000, p-value = 1e-11
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.8864
```

Based on the computed *p*-value < 0.05 it can be concluded that the two variables are significantly correlated to each other.

## 9.3 Kendall Correlations Test

Non-parametric test used for two *interval*, *ratio*, or *ordinal* type variables.

Because the *Kendall Correlations* test does not have any assumptions about the data, it can be run directly.

```
Kendall's rank correlation tau
data: my.cor$mpg and my.cor$wt
z = -5.8, p-value = 7e-09
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
-0.7278
```

Based on the computed *p*-value < 0.05 it can be concluded that the two variables are significantly correlated to each other.