Coefficient of determination formulas, calculation, interpretation, examples

3385
Alexander Pearson
Coefficient of determination formulas, calculation, interpretation, examples

The coefficient of determination is a number between 0 and 1 that represents the fraction of points (X, Y) that follow the regression line of fit of a data set with two variables.

It is also known as goodness of fit and is denoted by Rtwo. To calculate it, the quotient between the variance of the data Ŷi estimated by the regression model and the variance of the data Yi corresponding to each Xi of the data is taken.

Rtwo = Sŷ / Sy

Figure 1. Correlation coefficient for four data pairs. Source: F. Zapata.

If 100% of the data are on the line of the regression function, then the coefficient of determination will be 1.

On the contrary, if for a set of data and a certain adjustment function the coefficient Rtwo turns out to be equal to 0.5, then it can be said that the fit is 50% satisfactory or good. 

Similarly, when the regression model returns values ​​of Rtwo lower than 0.5, this indicates that the chosen adjustment function does not adapt satisfactorily to the data, therefore it is necessary to look for another adjustment function.

And when the covariance or the correlation coefficient tends to zero, then the variables X and Y in the data are unrelated, and therefore Rtwo will also tend to zero.

Article index

  • 1 How to calculate the coefficient of determination?
    • 1.1 Illustrative case
  • 2 Interpretation
  • 3 Examples
    • 3.1 - Example 1
    • 3.2 - Example 2
    • 3.3 - Example 3
    • 3.4 Fit comparison
    • 3.5 Conclusions
  • 4 References

How to calculate the coefficient of determination?

In the previous section it was said that the coefficient of determination is calculated by finding the quotient between the variances:

-Estimated by the regression function of the variable Y 

-That of the variable Yi corresponding to each of the variable Xi of the N data pairs. 

Stated mathematically, it looks like this:

Rtwo = Sŷ / Sy

From this formula it follows that Rtwo represents the proportion of variance explained by the regression model. Alternatively, R can be calculatedtwo using the following formula, totally equivalent to the previous one:

Rtwo = 1 - (Sε / Sy)

Where Sε represents the variance of the residuals εi = Ŷi - Yi, while Sy is the variance of the set of Yi values ​​of the data. To determine Ŷi the regression function is applied, which means to affirm that Ŷi = f (Xi).

The variance of the data set Yi, with i from 1 to N is calculated in this way:

Sy = [Σ (Yi - )two ) / (N-1)]

And then proceed in a similar way for Sŷ or for Sε.

Illustrative case

In order to show the detail of how the calculation of the coefficient of determination We will take the following set of four pairs of data: 

(X, Y): (1, 1); (2. 3); (3, 6) and (4, 7).

A linear regression fit is proposed for this data set, which is obtained using the least squares method:

f (x) = 2.1 x - 1 

Applying this adjustment function, the torques are obtained:

(X, Ŷ): (1, 1.1); (2, 3.2); (3, 5.3) and (4, 7.4).

Then we calculate the arithmetic mean for X and Y:

= (1 + 2 + 3 + 4) / 4 = 2.5

= (1 + 3 + 6 + 7) / 4 = 4.25

Variance Sy

Sy = [(1 - 4.25)two + (3 - 4.25)two + (6 - 4.25)two +….…. (7 - 4.25)two] / (4-1) =

= [(-3.25)two+ (-1.25)two + (1.75)two + (2.75)two) / (3)] = 7,583

Variance Sŷ

Sŷ = [(1.1 - 4.25)two + (3.2 - 4.25)two + (5.3 - 4.25)two +….…. (7.4 - 4.25)two] / (4-1) =

= [(-3.25)two + (-1.25)two + (1.75)two + (2.75)two) / (3)] = 7.35

Coefficient of determination Rtwo

Rtwo = Sŷ / Sy = 7.35 / 7.58 = 0.97

Interpretation

The determination coefficient for the illustrative case considered in the previous segment turned out to be 0.98. In other words, the linear adjustment through the function:

 f (x) = 2.1x - 1

It is 98% reliable in explaining the data with which it was obtained using the least squares method. 

In addition to the coefficient of determination, there is the linear correlation coefficient or also known as Pearson's coefficient. This coefficient, denoted as r, is calculated by the following relationship:

r = Sxy / (Sx Sy)

Here the numerator represents the covariance between the variables X and Y, while the denominator is the product of the standard deviation for the variable X and the standard deviation for the variable Y.

Pearson's coefficient can take values ​​between -1 and +1. When this coefficient tends to +1 there is a direct linear correlation between X and Y. If it tends to -1 instead, there is a linear correlation but when X increases Y decreases. Finally, it is close to 0 there is no correlation between the two variables.

It should be noted that the coefficient of determination coincides with the square of the Pearson coefficient, only when the first has been calculated based on a linear fit, but this equality is not valid for other non-linear fits.

Examples

- Example 1

A group of high school students set out to determine an empirical law for the period of a pendulum as a function of its length. To achieve this objective, they carry out a series of measurements in which they measure the time of a pendulum oscillation for different lengths obtaining the following values:

Length (m) Period (s)
0.1 0.6
0.4 1.31
0.7 1.78
1 1.93
1.3 2.19
1.6 2.66
1.9 2.77
3 3.62

It is requested to make a scatter plot of the data and perform a linear fit through regression. Also, show the regression equation and its coefficient of determination.

Solution

Figure 2. Solution graph for exercise 1. Source: F. Zapata.

A fairly high coefficient of determination (95%) can be observed, so it could be thought that the linear fit is optimal. However, if the points are viewed together, they appear to have a tendency to curve downward. This detail is not contemplated in the linear model.

- Example 2

For the same data in Example 1, make a scatter plot of the data. On this occasion, unlike example 1, a regression adjustment is requested using a potential function.

Figure 3. Solution graph for exercise 2. Source: F. Zapata.

Also show the fit function and its coefficient of determination Rtwo.

Solution

The potential function is of the form f (x) = AxB, where A and B are constants that are determined by least squares method.

The previous figure shows the potential function and its parameters, as well as the coefficient of determination with a very high value of 99%. Notice that the data follows the curvature of the trend line.

- Example 3

Using the same data from Example 1 and Example 2, perform a second-degree polynomial fit. Show graph, polynomial of fit, and coefficient of determination Rtwo correspondent.

Solution

Figure 4. Solution graph for exercise 3. Source: F. Zapata.

With the second degree polynomial fit you can see a trend line that fits well the curvature of the data. Also, the coefficient of determination is above the linear fit and below the potential fit..

Fit comparison

Of the three fits shown, the one with the highest coefficient of determination is the potential fit (example 2).

The potential fit coincides with the physical theory of the pendulum, which, as is known, establishes that the period of a pendulum is proportional to the square root of its length, the constant of proportionality being 2π / √g where g is the acceleration of gravity.

Not only does this type of potential fit have the highest coefficient of determination, but the exponent and constant of proportionality match the physical model.. 

Conclusions.

-Regression fit determines the parameters of the function that is intended to explain the data using the least squares method. This method consists of minimizing the sum of the squared difference between the Y value of adjustment and the Yi value of the data for the Xi values ​​of the data. This determines the parameters of the adjustment function.

-As we have seen, the most common adjustment function is the line, but it is not the only one, since the adjustments can also be polynomial, potential, exponential, logarithmic and others.. 

-In any case, the coefficient of determination depends on the data and the type of fit and is an indication of the goodness of the fit applied..

-Finally, the coefficient of determination indicates the percentage of total variability between the Y value of the data with respect to the Ŷ value of the fit for the given X.

References

  1. González C. General Statistics. Recovered from: tarwi.lamolina.edu.pe
  2. IACS. Aragonese Institute of Health Sciences. Recovered from: ics-aragon.com
  3. Salazar C. and Castillo S. Basic principles of statistics. (2018). Recovered from: dspace.uce.edu.ec
  4. Superprof. Determination coefficient. Recovered from: superprof.es
  5. USAC. Descriptive statistics manual. (2011). Recovered from: statistics.ingenieria.usac.edu.gt.
  6. Wikipedia. Determination coefficient. Recovered from: es.wikipedia.com.

Yet No Comments