Main Page
Statistics
Pooled data examples and exercise solved

Pooled data examples and exercise solved

3955

445

Robert Johnston

The grouped data are those that have been classified into categories or classes, taking their frequency as a criterion. This is done in order to simplify the handling of large amounts of data and establish its trends..

Once organized into these classes by their frequencies, the data make up a frequency distribution, from which useful information is extracted through its characteristics.

Figure 1. With the grouped data, it is possible to construct graphs and calculate statistical parameters that describe trends. Source: Pixabay.

Here is a simple example of grouped data:

Suppose that the height of 100 female students, selected from all the basic physics courses of a university, is measured and the following results are obtained:

The results obtained were divided into 5 classes, which appear in the left column.

The first class, between 155 and 159 cm, has 6 students, the second class 160 - 164 cm has 14 students, the third class of 165 to 169 cm is the one with the largest number of members: 47. Then the class continues 170-174 cm with 28 students and finally the 175-174 cm with only 5.

The number of members of each class is precisely the frequency or Absolute frecuency and when adding them all, the total data is obtained, which in this example is 100.

Article index

1 Characteristics of the frequency distribution
- 1.1 Frequency
- 1.2 Limits
- 1.3 Borders
- 1.4 Amplitude
- 1.5 Class mark
2 Measures of central tendency and dispersion for grouped data
- 2.1 Average
- 2.2 Median
- 2.3 Fashion
- 2.4 Variance and standard deviation
3 Exercise resolved
- 3.1 Solution a
- 3.2 Solution b
- 3.3 Solution d
4 References

Characteristics of the frequency distribution

Frequency

As we have seen, the frequency is the number of times that a piece of data is repeated. And to facilitate the calculations of the properties of the distribution, such as the mean and variance, the following quantities are defined:

-Cumulative frequency: it is obtained by adding the frequency of a class with the previous accumulated frequency. The first of all frequencies matches that of the interval in question, and the last is the total number of data.

-Relative frequency: calculated by dividing the absolute frequency of each class by the total number of data. And if you multiply by 100 you have the relative percentage frequency.

-Cumulative relative frequency: is the sum of the relative frequencies of each class with the previous accumulated. The last of the accumulated relative frequencies must equal 1.

For our example, the frequencies look like this:

Limits

The extreme values of each class or interval are called class limits. As we can see, each class has a lower and a higher limit. For example, the first class in the study about heights has a lower limit of 155 cm and a higher limit of 159 cm..

This example has limits that are clearly defined, however it is possible to define open limits: if instead of defining the exact values, say "height less than 160 cm", "height less than 165 cm" and so on.

Borders

Height is a continuous variable, so it can be considered that the first class actually starts at 154.5 cm, since by rounding this value to the nearest integer, we get 155 cm.

This class covers all values up to 159.5 cm, because after this, the heights are rounded to 160.0 cm. A height of 159.7 cm already belongs to the following class.

The actual class boundaries for this example are, in cm:

154.5 - 159.5
159.5 - 164.5
164.5 - 169.5
169.5 - 174.5
174.5 - 179.5

Amplitude

The width of a class is obtained by subtracting the boundaries. For the first interval of our example we have 159.5 - 154.5 cm = 5 cm.

The reader can verify that for the other intervals of the example the amplitude is also 5 cm. However, it should be noted that distributions can be constructed with intervals of different amplitude.

Class mark

It is the midpoint of the interval and is obtained by the average between the upper limit and the lower limit.

For our example, the first class mark is (155 + 159) / 2 = 157 cm. The reader can see that the remaining class marks are: 162, 167, 172 and 177 cm.

Determining the class marks is important, as they are necessary to find the arithmetic mean and the variance of the distribution.

Measures of central tendency and dispersion for pooled data

The most commonly used measures of central tendency are the mean, the median and the mode, and they precisely describe the tendency of the data to cluster around a certain central value..

Half

It is one of the main measures of central tendency. In the grouped data, the arithmetic mean can be calculated using the formula:

-X is the mean

-F_iis the frequency of the class

-m_i is the class mark

-g is the number of classes

-n is the total number of data

Median

For the median, the interval where the n / 2 observation is found must be identified. In our example, this observation is number 50, because there are a total of 100 data points. This observation is in the range 165-169 cm.

Then you have to interpolate to find the numerical value that corresponds to that observation, for which the formula is used:

Where:

-c = width of the interval where the median is found

-B_M = the lower bound of the interval to which the median belongs

-F_m = number of observations contained in the median interval

-n / 2 = half of total data

-F_BM = total number of observations before median interval

fashion

For the mode, the modal class is identified, the one that contains the most observations, whose class mark is known.

Variance and standard deviation

Variance and standard deviation are measures of dispersion. If we denote the variance with s^two and the standard deviation, which is the square root of the variance as s, for grouped data we will have respectively:

Exercise resolved

For the distribution of heights of female university students proposed at the beginning, calculate the values of:

a) Average

b) Median

c) Fashion

d) Variance and standard deviation.

Figure 2. When dealing with a large number of values, such as the heights of a large group of students, it is preferable to group the data into classes. Source: Pixabay.

Solution to

Let's build the following table to facilitate the calculations:

Substituting values and carrying out the summation directly:

X = (6 x 157 + 14 x 162 + 47 x 167 + 28 x 172+ 5 x 177) / 100 cm =

= 167.6 cm

Solution b

The interval to which the median belongs is 165-169 cm because it is the interval with the highest frequency.

Let's identify each of these values in the example, with the help of Table 2:

c = 5 cm (see the amplitude section)

B_M = 164.5 cm

F_m = 47

n / 2 = 100/2 = 50

F_BM = 20

Substituting in the formula:

The interval that contains most of the observations is 165-169 cm, whose class mark is 167 cm.

Solution d

We expand the previous table by adding two additional columns:

We apply the formula:

And we develop the summation:

s^two = (6 x 112.36 + 14 x 31.36 + 47 x 0.36 + 28 x 19.36 + 5 x 88.36) / 99 = = 21.35 cm^two

Therefore:

s = √21.35 cm^two = 4.6 cm

References

Berenson, M. 1985. Statistics for management and economics. Interamericana S.A.
Canavos, G. 1988. Probability and Statistics: Applications and methods. Mcgraw hill.
Devore, J. 2012. Probability and Statistics for Engineering and Science. 8th. Edition. Cengage.
Levin, R. 1988. Statistics for Administrators. 2nd. Edition. Prentice hall.
Spiegel, M. 2009. Statistics. Schaum series. 4th Edition. Mcgraw hill.
Walpole, R. 2007. Probability and Statistics for Engineering and Sciences. Pearson.