Chi-square (χ²) distribution, how to calculate it, examples

4739
Alexander Pearson

The proof Chi squared or chi-squaretwo, where χ is the Greek letter called “chi”) is used to determine the behavior of a certain variable and also when you want to know if two or more variables are statistically independent.

To check the behavior of a variable, the test to be performed is called chi square test of fit. To find out if two or more variables are statistically independent, the test is called chi square of independence, also called contingency.

Figure 1. Hypothesis tests using chi square

These tests are part of statistical decision theory, in which a population is studied and decisions are made about it, analyzing one or more samples taken from it. This requires making certain assumptions about the variables, called hypothesis, which may or may not be true.

There are some tests to contrast these conjectures and determine which are valid, within a certain margin of confidence, including the chi-square test, which can be applied to compare two and more populations..

As we will see, two types of hypothesis are usually raised about some population parameter in two samples: the null hypothesis, called Hor (the samples are independent), and the alternative hypothesis, denoted as H1, (the samples are correlated) which is the opposite of that.

Article index

  • 1 When is the chi-square test used?
    • 1.1 Conditions to apply it
  • 2 Chi square distribution
    • 2.1 Degrees of freedom
    • 2.2 Formulation of hypotheses
  • 3 How is the chi-square statistic calculated?
    • 3.1 Acceptance criteria for Ho
  • 4 Calculation example
  • 5 References

When is the chi-square test used?

The chi square test is applied to variables that describe qualities, such as sex, marital status, blood group, eye color and preferences of various types.

The test is intended when you want to:

-Checking whether a distribution is appropriate to describe a variable, which is called goodness of fit. Using the chi-square test, it is possible to know if there are significant differences between the selected theoretical distribution and the observed frequency distribution..

-Know if two variables X and Y are independent from the statistical point of view. This is known as independence test.

Since it is applied to qualitative or categorical variables, the chi-square test is widely used in social sciences, management, and medicine..

Conditions to apply it

There are two important requirements to apply it correctly:

-The data must be grouped in frequencies.

-The sample has to be large enough for the chi-square distribution to be valid, otherwise its value is overestimated and leads to the rejection of the null hypothesis when it should not be the case..

The general rule is that if a frequency with a value less than 5 appears in the grouped data, it is not used. If there is more than one frequency less than 5, then they must be combined into one to obtain a frequency with a numerical value greater than 5.

Chi square distribution

χtwo it is a continuous distribution of probabilities. Actually there are different curves, depending on a parameter k called degrees of freedom of the random variable.

Its properties are:

-The area under the curve is equal to 1.

-The values ​​of χtwo they are positive.

-The distribution is asymmetric, that is, it has a bias.

Figure 2. Chi square distribution for watt degrees of freedom. Source: Wikimedia Commons.

Degrees of freedom

As the degrees of freedom increase, the chi-square distribution tends towards normality, as can be seen from the figure.

For a given distribution, the degrees of freedom are determined through the contingency table, which is the table where the observed frequencies of the variables are recorded.

If a table has F rows and c columns, the value of k it is:

k = (f - 1) ⋅ (c - 1)

Formulation of hypotheses

When the chi-square test is of fit, the following hypotheses are formulated:

-Hor: the variable X has a probability distribution f (x) with the specific parameters y1, Ytwo… , Yp

-H1: X has another probability distribution.

The probability distribution assumed in the null hypothesis can be, for example, the known normal distribution, and the parameters would be the mean μ and the standard deviation σ.

In addition, the null hypothesis is evaluated with a certain level of significance, that is, a measure of the error that would be committed when rejecting it being true.

Usually this level is set to 1%, 5% or 10% and the lower it is, the more reliable the test result..

And if the chi-square test of contingency is used, which, as we have said, serves to verify the independence between two variables X and Y, the hypotheses are:

-Hor: variables X and Y are independent.

-H1: X and Y are dependent.

Again, it is necessary to specify a level of significance to know the measure of the error when making the decision..

How is the chi-square statistic calculated?

The chi square statistic is calculated as follows:

The summation is carried out from the first class i = 1 to the last one, which is i = k.

What's more:

-For is an observed frequency (comes from the data obtained).

-Fand is the expected or theoretical frequency (needs to be calculated from the data).

To accept or reject the null hypothesis, we calculate χtwo for the observed data and compared to a value called critical chi square, which depends on the degrees of freedom k and the level of significance α:

χtwocritical =  χtwok, α

If, for example, we want to carry out the test with a significance level of 1%, then α = 0.01, if it is going to be 5% then α = 0.05 and so on. We define p, the parameter of the distribution, as:

p = 1 - α

These critical chi square values ​​are determined by tables containing the cumulative area value. For example, for k = 1, which represents 1 degree of freedom and α = 0.05, which equals p = 1- 0.05 = 0.95, the value of χtwo is 3,841.

Figure 3. Table of values ​​of the chi square distribution. Source: F. Zapata.

H acceptance criteriaor

The criterion for accepting Hor it is:

-Yes χtwo < χtwocritical  H is acceptedor, otherwise it is rejected (see figure 1).

Calculation example

In the following application the chi square test will be used as a test of independence.

Suppose that the researchers want to know if the preference for black coffee is related to the gender of the person, and specify the answer with a significance level of α = 0.05.

For this, a sample of 100 people interviewed and their responses are available:

Step 1

Establish the hypotheses:

-Hor: gender and preference for black coffee are independent.
-H1: the taste for black coffee is related to the gender of the person.

Step 2

Calculate the expected frequencies for the distribution, for which the totals added in the last row and in the right column of the table are required. Each cell in the red box has an expected value Fand, which is calculated by multiplying the total of your row F by the total of your column C, divided by the total of the sample N:

Fand = (F x C) / N

The results are as follows for each cell:

-C1: (36 x 47) / 100 = 16.92
-C2: (64 x 47) / 100 = 30.08
-C3: (36 x 53) / 100 = 19.08
-C4: (64 x 53) / 100 = 33.92

Step 3

Next, the chi-square statistic must be calculated for this distribution, according to the given formula:

Step 4

Determine χtwocritical, knowing that the recorded data is in f = 2 rows and c = 2 columns, therefore, the number of degrees of freedom is:

k = (2-1) ⋅ (2-1) = 1.

Which means that we must look in the table shown above for the value of χtwok, α = χtwo1; 0.05 , which is:

χtwocritical = 3,841

Step 5

Compare the values ​​and decide:

χtwo = 2.9005

χtwocritical = 3,841

Since χtwo < χtwocritical the null hypothesis is accepted and it is concluded that the preference for black coffee is not related to the gender of the person, with a significance level of 5%.

References

  1. Chi Square Test for Independence. Recovered from: saylordotorg.github.io.
  2. Med Wave. Statistics applied to health sciences: the chi-square test. Recovered from: medwave.cl.
  3. Probabilities and Statistics. Chi-square goodness-of-fit test. Recovered from: probayestadistica.com.
  4. Triola, M. 2012. Elementary Statistics. 11th. Edition. Addison wesley.
  5. UNAM. Chi square test. Recovered from: asesorias.cuautitlan2.unam.mx.

Yet No Comments