Correlation coefficient formulas, calculation, interpretation, example

1235
Jonah Lester

The correlation coefficient in statistics it is an indicator that measures the tendency of two quantitative variables X and Y to have a linear or proportional relationship between them.

Generally, the pairs of variables X and Y are two characteristics of the same population. For example, X could be a person's height and Y his weight..

Figure 1. Correlation coefficient for four data pairs (X, Y). Source: F. Zapata.

In this case, the correlation coefficient would indicate whether or not there is a trend towards a proportional relationship between height and weight of a given population..

Pearson's linear correlation coefficient is denoted by the letter r lowercase and its minimum and maximum values ​​are -1 and +1 respectively. 

A value r = +1 would indicate that the set of pairs (X, Y) are perfectly aligned and that when X grows, Y will grow in the same proportion. On the other hand, if it happened that r = -1, the set of pairs would also be perfectly aligned, but in this case when X increases, Y decreases in the same proportion.

Figure 2. Different values ​​of the linear correlation coefficient. Source: Wikimedia Commons.

On the other hand, a value of r = 0 would indicate that there is no linear correlation between the variables X and Y. While a value of r = +0.8 would indicate that the pairs (X, Y) tend to cluster on one side and another of a certain straight.

The formula to calculate the correlation coefficient r is as follows:

How to calculate the correlation coefficient?

The linear correlation coefficient is a statistical quantity found in scientific calculators, most spreadsheets, and statistical programs..

However, it is convenient to know how the formula that defines it is applied, and for this a detailed calculation will be shown, carried out on a small data set.

And as it was said in the previous section, the correlation coefficient is the covariance Sxy divided by the product of the standard deviation Sx for the variables X and Sy for the variable Y.

Covariance and variance

The covariance Sxy is:

Sxy = [Σ (Xi - ) (Yi - )] / (N-1)

Where the sum goes from 1 to the N pairs of data (Xi, Yi). and are the arithmetic means of the data Xi and Yi respectively.

For its part, the standard deviation for the variable X is the square root of the variance of the data set Xi, with i from 1 to N:

Sx = √ [Σ (Xi - ) ^ 2) / (N-1)]

Similarly, the standard deviation for variable Y is the square root of the variance of the data set Yi, with i from 1 to N:

Sy = √ [Σ (Yi - )two ) / (N-1)]

Illustrative case

In order to show in detail how to calculate the correlation coefficient, we will take the following set of four pairs of data 

(X, Y): (1, 1); (2. 3); (3, 6) and (4, 7).

First we calculate the arithmetic mean for X and Y, as follows:

= (1 + 2 + 3 + 4) / 4 = 2.5

= (1 + 3 + 6 + 7) / 4 = 4.25

Then the remaining parameters are calculated:

Covariance Sxy

Sxy = [(1 - 2.5) (1 - 4.25) + (2 - 2.5) (3 - 4.25) + (3 - 2.5) (6 - 4.25) +….…. (4 - 2.5) (7 - 4.25) ] / (4-1)

Sxy = [(-1.5) (- 3.25) + (-0.5) (- 1.25) + (0.5) (1.75) +… . 

…. (1.5) (2.75)] / (3) = 10.5 / 3 = 3.5

Standard deviation Sx

Sx = √ [(-1.5)two + (-0.5)two + (0.5)two + (1.5)two) / (4-1)] = √ [5/3] = 1.29

Standard deviation Sy

Sx = √ [(-3.25)two + (-1.25)two + (1.75)two + (2.75)two) / (4-1)] = 

√ [22.75 / 3] = 2.75

Correlation coefficient r

r = 3.5 / (1.29 * 2.75) = 0.98

Interpretation

In the data set of the previous case, a strong linear correlation is observed between the variables X and Y, which is manifested both in the scatter plot (shown in Figure 1) and in the correlation coefficient, which yielded a value quite close to unity.

To the extent that the correlation coefficient is closer to 1 or -1, the more sense it makes to fit the data to a line, the result of linear regression..

Linear regression

The linear regression line is obtained from Method of least squares. in which the parameters of the regression line are obtained from the minimization of the sum of the square of the difference between the estimated Y value and the Yi of the N data.

On the other hand, the parameters a and b of the regression line y = a + bx, obtained by the least squares method, are:

* b = Sxy / (Sxtwo) For the slope

* a = - b for the intersection of the regression line with the Y-axis.

Recall that Sxy is the covariance defined above and Sxtwo is the variance or square of the standard deviation defined above. and are the arithmetic means of the data X and Y respectively.

Example

The correlation coefficient is used to determine if there is a linear correlation between two variables. It is applicable when the variables to be studied are quantitative and, furthermore, it is assumed that they follow a normal type distribution..

We have an illustrative example below: a measure of the degree of obesity is the body mass index, which is obtained by dividing the weight of a person in kilograms by the height squared of the same in units of meters squared.

You want to know if there is a strong correlation between the body mass index and the concentration of HDL cholesterol in the blood, measured in millimoles per liter. For this purpose, a study with 533 people has been carried out, which is summarized in the following graph, in which each point represents the data of a person.

Figure 3. Study of BMI and HDL cholesterol in 533 patients. Source: Aragonese Institute of Health Sciences (IACS).

From a careful observation of the graph, it is deduced that there is a certain linear trend (not very marked) between the concentration of HDL cholesterol and the body mass index. The quantitative measure of this trend is the correlation coefficient, which in this case turned out to be r = -0.276.

References

  1. González C. General Statistics. Recovered from: tarwi.lamolina.edu.pe
  2. IACS. Aragonese Institute of Health Sciences. Recovered from: ics-aragon.com 
  3. Salazar C. and Castillo S. Basic principles of statistics. (2018). Recovered from: dspace.uce.edu.ec
  4. Superprof. Correlation coefficient. Recovered from: superprof.es
  5. USAC. Descriptive statistics manual. (2011). Recovered from: statistics.ingenieria.usac.edu.gt
  6. Wikipedia. Pearson's correlation coefficient. Recovered from: es.wikipedia.com.

Yet No Comments