Hypergeometric distribution formulas, equations, model

5130
Basil Manning

The hypergeometric distribution is a discrete statistical function, suitable for calculating the probability in randomized experiments with two possible outcomes. The condition that is required to apply it is that they are small populations, in which the extractions are not replaced and the probabilities are not constant.. 

Therefore, when an element of the population is chosen to know the result (true or false) of a certain characteristic, that same element cannot be chosen again..

Figure 1. In a bolt population like this, there are surely defective specimens. Source: Pixabay.

Certainly, the next element chosen is thus more likely to obtain a true result, if the previous element had a negative result. This means that the probability varies as elements are extracted from the sample..

The main applications of the hypergeometric distribution are: quality control in processes with little population and the calculation of probabilities in games of chance..

As for the mathematical function that defines the hypergeometric distribution, it consists of three parameters, which are:

- Number of population elements (N)

- Sample size (m) 

- Number of events in the entire population with a favorable (or unfavorable) result of the studied characteristic (n).

Article index

  • 1 Formulas and equations
    • 1.1 Important statistical variables
  • 2 Model and properties 
    • 2.1 Main properties of the hypergeometric distribution
    • 2.2 Approximation by the binomial distribution
  • 3 Examples
    • 3.1 Example 1
    • 3.2 Example 2
  • 4 Solved exercises
    • 4.1 Exercise 1
    • 4.2 Exercise 2
    • 4.3 Exercise 3
  • 5 References

Formulas and equations

The formula for the hypergeometric distribution gives the probability P about what x favorable cases of a certain characteristic occur. The way to write it mathematically, based on the combinatorial numbers is:

In the above expression N, n Y m are parameters and x the variable itself. 

-Total population is N.

-Number of positive results of a certain binary characteristic with respect to the total population is n.

-Quantity of sample items is m.

In this case, X is a random variable that takes the value x P (x) indicates the probability of occurrence of x favorable cases of the characteristic studied.

Important statistical variables

Other statistical variables for the hypergeometric distribution are:

- Half μ = m * n / N

- Variance σ ^ 2 = m * (n / N) * (1-n / N) * (N-m) / (N-1)

- Typical deviation σ which is the square root of the variance.

Model and properties 

To arrive at the model of the hypergeometric distribution, we start from the probability of obtaining x favorable cases in a sample size m. Said sample contains elements that comply with the property under study and elements that do not.

Remember that n represents the number of favorable cases in the total population of N elements. Then the probability would be calculated like this:

P (x) = (# of ways to get x # of failed ways) / (total # of ways to select)

Expressing the above in the form of combinatorial numbers, we arrive at the following probability distribution model:

Main properties of the hypergeometric distribution

They are as follows:

- The sample must always be small, even if the population is large.

- The elements of the sample are extracted one by one, without incorporating them back into the population.

- The property to study is binary, that is, it can only take two values: 1 or 0, O well certain or fake.

In each element extraction step, the probability changes depending on the previous results.

Approximation using the binomial distribution

Another property of the hypergeometric distribution is that it can be approximated by the binomial distribution, denoted as Bi, as long as the population N is large and at least 10 times larger than the sample m. In this case it would look like this:

P (N, n, m; x) = Bi (m, n / N, x)           

Applicable as long as N is large and N> 10m

Examples

Example 1

Suppose a machine that produces screws and the accumulated data indicates that 1% come out with defects. Then in a box of N = 500 screws the number of defective will be:

n = 500 * 1/100 = 5

Probabilities using the hypergeometric distribution

Suppose that from that box (that is, from that population) we take a sample of m = 60 bolts.

The probability that no screw (x = 0) in the sample is defective is 52.63%. This result is reached by using the hypergeometric distribution function:

P (500, 5, 60, 0) = 0.5263

The probability that x = 3 screws in the sample are defective is: P (500, 5, 60; 3) = 0.0129.

On the other hand, the probability that x = 4 screws of the sixty of the sample are defective is: P (500, 5, 60; 4) = 0.0008.

Finally, the probability that x = 5 screws in that sample are defective is: P (500, 5, 60; 5) = 0.

But if you want to know the probability that in that sample there are more than 3 defective screws, then you have to obtain the cumulative probability, adding:

P (3) + P (4) + P (5) = 0.0129 + 0.0008 + 0 = 0.0137.

This example is illustrated in figure 2, obtained by using GeoGebra a free software widely used in schools, institutes and universities.

Figure 2. Example of hypergeometric distribution. Prepared by F. Zapata with GeoGebra.

Example 2

A Spanish deck deck has 40 cards, of which 10 have gold and the remaining 30 do not. Suppose that 7 cards are drawn randomly from that deck, which are not reincorporated into the deck.

If X is the number of golds present in the 7 cards drawn, then the probability that there will be x golds in a 7-card draw is given by the hypergeometric distribution P (40,10,7; x).

Let's see this like this: to calculate the probability of having 4 golds in a 7-card draw we use the formula of the hypergeometric distribution with the following values:

And the result is: 4.57% probability.

But if you want to know the probability of getting more than 4 cards, then you have to add:

P (4) + P (5) + P (6) + P (7) = 5.20%

Solved exercises

The following set of exercises is intended to illustrate and assimilate the concepts that have been presented in this article. It is important that the reader tries to solve them on his own, before looking at the solution.

Exercise 1

A condom factory has found that out of every 1,000 condoms produced by a certain machine, 5 come out defective. For quality control, 100 condoms are randomly selected and the lot is rejected if there is at least one or more defective. Answer:

a) What is the possibility that a lot of 100 will be discarded?

b) Is this quality control criterion efficient??

Solution

In this case, very large combinatorial numbers will appear. Calculation is difficult unless a suitable software package is available.

But since it is a large population and the sample is ten times smaller than the total population, it is possible to use the approximation of the hypergeometric distribution by the binomial distribution:

P (1000,5,100; x) = Bi (100, 5/1000, x) = Bi (100, 0.005, x) = C (100, x) * 0.005 ^ x (1-0.005) ^ (100-x)

In the above expression C (100, x) is a combinatorial number. Then the probability of there being more than one defective will be calculated like this:

P (x> = 1) = 1 - Bi (0) = 1- 0.6058 = 0.3942

It is an excellent approximation, if it is compared with the value obtained by applying the hypergeometric distribution: 0.4102

It can be said that, with a 40% probability, a batch of 100 prophylactics should be discarded, which is not very efficient..

But, being a little less demanding in the quality control process and discarding the batch of 100 only if there are two or more defectives, then the probability of discarding the batch would fall to just 8%.

Exercise 2

A plastic block machine works in such a way that out of every 10 pieces, one comes out deformed. In a sample of 5 pieces, what is the possibility that only one piece is defective?.

Solution

Population: N = 10

Number n of defectives for every N: n = 1

Sample size: m = 5

P (10, 1, 5; 1) = C (1,1) * C (9,4) / C (10,5) = 1 * 126/252 = 0.5

Therefore there is a 50% probability that in a sample of 5, a cue will come out deformed.

Exercise 3

In a meeting of young high school graduates there are 7 ladies and 6 gentlemen. Among the girls, 4 study humanities and 3 science. In the boy group, 1 studies humanities and 5 science. Calculate the following:

a) Choosing three girls at random: what is the probability that they all study humanities?.

b) If three attendees to the friends' meeting are chosen at random: What is the possibility that three of them, regardless of gender, study science all three, or humanities also all three?.

c) Now select two friends at random and call x to the random variable "number of those who study humanities". Between the two chosen, determine the mean or expected value of x and the variance σ ^ 2.

Solution to 

Population is the total number of girls: N = 7. Those who study humanities are n = 4, of the total. The random sample of girls will be m = 3.

In this case, the probability that all three are humanities students is given by the hypergeometric function:

P (N = 7, n = 4, m = 3, x = 3) = C (4, 3) C (3, 0) / C (7, 3) = 0.1143

So there is an 11.4% probability that three girls chosen at random will study humanities..

Solution b

The values ​​to use now are:

-Population: N = 14

-Quantity that studies letters is: n = 6 and the

-Sample size: m = 3.

-Number of friends studying humanities: x

According to this, x = 3 means that all three study humanities, but x = 0 means that none study humanities. The probability that all three study the same is given by the sum:

P (14, 6, 3, x = 0) + P (14, 6, 3, x = 3) = 0.0560 + 0.1539 = 0.2099

Then we have a 21% probability that three meeting attendees, chosen at random, will study the same thing.

Solution c

Here we have the following values:

N = 14 total population of friends, n = 6 total number in the population studying humanities, the sample size is m = 2.

Hope is:

E (x) = m * (n / N) = 2 * (6/14) = 0.8572

And the variance:

σ (x) ^ 2 =  m * (n / N) * (1-n / N) * (Nm) / (N-1) = 2 * (6/14) * (1-6 / 14) * (14-2) / (14 -1) =

= 2 * (6/14) * (1-6 / 14) * (14-2) / (14-1) = 2 * (3/7) * (1-3 / 7) * (12) / (13 )  = 0.4521

References

  1. Discrete probability distributions. Recovered from: biplot.usal.es
  2. Statistic and probability. Hypergeometric distribution. Recovered from: projectdescartes.org
  3. CDPYE-UGR. Hypergeometric distribution. Recovered from: ugr.es
  4. Geogebra. Classical geogebra, probability calculus. Recovered from geogebra.org
  5. Try easy. Solved problems of hypergeometric distribution. Recovered from: probafacil.com
  6. Minitab. Hypergeometric distribution. Recovered from: support.minitab.com
  7. University of Vigo. Main discrete distributions. Recovered from: anapg.webs.uvigo.es
  8. Vitutor. Statistics and combinatorics. Recovered from: vitutor.net
  9. Weisstein, Eric W. Hypergeometric Distribution. Recovered from: mathworld.wolfram.com
  10. Wikipedia. Hypergeometric distribution. Recovered from: es.wikipedia.com

Yet No Comments