Correlation
Correlation
The linear correlation coefficient, \(r\), is a measure which tells us the strength and direction of a relationship between two variables. The correlation coefficient \(r\in [-1;1]\). When \(r=-1\), there is perfect negative correlation, when \(r=0\), there is no correlation and when \(r=1\) there is perfect positive correlation.
| Positive, strong | Positive, fairly strong | Positive, weak |
| \(r\approx \text{0.9}\) | \(r\approx \text{0.7}\) | \(r\approx \text{0.4}\) |
| Negative, fairly strong | Negative, weak | No correlation |
| \(r\approx -\text{0.7}\) | \(r\approx -\text{0.4}\) | \(r=0\) |
The linear correlation coefficient \(r\) can be calculated using the formula
\(r=b\dfrac{\sigma_{x}}{\sigma_{y}}\)-
where \(b\) is the gradient of the least squares regression line,
-
\(\sigma_{x}\) is the standard deviation of the \(x\)-values and
-
\(\sigma_{y}\) is the standard deviation of the \(y\)-values.
This is known as the Pearson's product moment correlation coefficient. It is much easier to do on a calculator where you simply follow the procedure for the regression equation, and go on to find \(r\).
In general:
| Positive | Strength | Negative |
| \(r=0\) | no correlation | \(r=0\) |
\(0| very weak |
\(-\text{0.25} | |
\(\text{0.25}| weak |
\(-\text{0.5} | |
\(\text{0.5}| moderate |
\(-\text{0.75} | |
\(\text{0.75}| strong |
\(-\text{0.9} | |
\(\text{0.9}| very strong |
\(-\text{1} | |
| \(r=1\) | perfect correlation | \(r=-1\) |
Correlation does not imply causation! Just because two variables are correlated does not mean that they are causally linked, i.e. if A and B are correlated, that does not mean A causes B, or vice versa. This is a common mistake made by many people, especially journalists looking for their next juicy story.
For example, ice cream sales and shark attacks are correlated. This does not mean that the sale of ice cream is somehow causing more shark attacks. Instead, a simpler explanation is that the warmer it is, the more likely people are to buy ice cream and the more likely people are to go to the beach as well, thus increasing the likelihood of a shark attack.
Example
Question
A cardiologist wanted to test the relationship between resting heart rate and the peak heart rate during exercise. Heart rate is measured in beats per minute (bpm). The following set of data was generated from 12 study participants after they had run on a treadmill at \(\text{10}\) \(\text{km/h}\) for 10 minutes.
| Resting heart rate | 48 | 56 | 90 | 65 | 75 | 78 | 80 | 72 | 82 | 76 | 68 | 62 |
| Peak heart rate | 138 | 136 | 180 | 150 | 151 | 161 | 155 | 154 | 175 | 158 | 145 | 155 |
- Draw a scatter plot of the data. Use resting heart rate as your \(x\)-variable.
- Use your calculator to determine the equation of the line of best fit.
- Estimate what the heart rate of a person with a resting heart rate of \(\text{70}\) \(\text{bpm}\) will be after exercise.
- Without using your calculator, find the correlation coefficient, \(r\). Confirm your answer using your calculator.
- What can you conclude regarding the relationship between resting heart rate and the heart rate after exercise?
Draw the scatter plot
- Choose a suitable scale for the axes.
- Draw the axes.
- Plot the points.
Calculate the equation of the line of best fit
As you learnt previously, use your calculator to determine the values for \(a\) and \(b\).
\(a = \text{86.75}\)
\(b = \text{0.96}\)
Therefore, the equation for the line of best fit is \(y = \text{86.75} + \text{0.96}x\)
Calculate the estimated value for \(y\)
If \(x = 70\), using our equation, the estimated value for \(y\) is: \[y= \text{86.75} + \text{0.96} \times 70 = \text{153.95}\]
Calculate the correlation co-efficient
The formula for \(r\) is:
\[r=b\dfrac{\sigma_{x}}{\sigma_{y}}\]We already know the value of \(b\) and you know how to calculate \(b\) by hand from worked example 5, so we are just left to determine the value for \(\sigma_{x}\) and \(\sigma_{y}\). The formula for standard deviation is:
\[\sigma_{x}= \cfrac{\sqrt{\sum\limits_{i=1}^{n}(x_i - \bar{x})^{2}}}{n}\]First, you need to determine \(\bar{x}\) and \(\bar{y}\) and then complete a table like the one below.
\begin{align*} \bar{x} &= \cfrac{\sum\limits_{i=1}^{n}x_i}{n} = 71 \\ \bar{y} &= \cfrac{\sum\limits_{i=1}^{n}y_i}{n} = \text{154.83} \text{ (rounded to two decimal places)} \end{align*}| Resting heart rate (\(x\)) | Peak heart rate (\(y\)) | \((x-\bar{x})^{2}\) | \((y-\bar{y})^{2}\) |
| 48 | 138 | 529 | \(\text{283.25}\) |
| 56 | 136 | 225 | \(\text{354.57}\) |
| 90 | 180 | 361 | \(\text{633.53}\) |
| 65 | 150 | 36 | \(\text{23.33}\) |
| 75 | 151 | 16 | \(\text{14.67}\) |
| 78 | 161 | 49 | \(\text{38.07}\) |
| 80 | 155 | 81 | \(\text{0.03}\) |
| 72 | 154 | 1 | \(\text{0.69}\) |
| 82 | 175 | 121 | \(\text{406.83}\) |
| 76 | 158 | 25 | \(\text{10.05}\) |
| 68 | 145 | 9 | \(\text{96.63}\) |
| 62 | 155 | 81 | \(\text{0.03}\) |
| \(\sum=852\) | \(\sum=\text{1 858}\) | \(\sum=\text{1 534}\) | \(\sum=\text{1 861.68}\) |
Confirm your answer using your calculator
Once you know the method for finding the equation of the best line of fit on your calculator, finding the value for \(r\) is trivial. After you have entered all your \(x\) and \(y\) values into your calculator, in STAT mode:
- on a SHARP calculator: press [RCL] then [r] (the same key as [\(\div\)])
- on a CASIO calculator: press [SHIFT] then [STAT], [5], [3] then [\(=\)]
Comment on the correlation coefficient
\[r = \text{0.87}\]
Therefore, there is a strong, positive, linear relationship between resting heart rate and peak heart rate during exercise. This means that the higher your resting heart rate, the higher your peak heart rate during exercise is likely to be.
This lesson is part of:
Statistics and Probability