The grouped data are those that have been classified into categories or classes, taking their frequency as a criterion. This is done in order to simplify the handling of large amounts of data and establish its trends..
Once organized into these classes by their frequencies, the data make up a frequency distribution, from which useful information is extracted through its characteristics.
Here is a simple example of grouped data:
Suppose that the height of 100 female students, selected from all the basic physics courses of a university, is measured and the following results are obtained:
The results obtained were divided into 5 classes, which appear in the left column.
The first class, between 155 and 159 cm, has 6 students, the second class 160 - 164 cm has 14 students, the third class of 165 to 169 cm is the one with the largest number of members: 47. Then the class continues 170-174 cm with 28 students and finally the 175-174 cm with only 5.
The number of members of each class is precisely the frequency or Absolute frecuency and when adding them all, the total data is obtained, which in this example is 100.
Article index
As we have seen, the frequency is the number of times that a piece of data is repeated. And to facilitate the calculations of the properties of the distribution, such as the mean and variance, the following quantities are defined:
-Cumulative frequency: it is obtained by adding the frequency of a class with the previous accumulated frequency. The first of all frequencies matches that of the interval in question, and the last is the total number of data.
-Relative frequency: calculated by dividing the absolute frequency of each class by the total number of data. And if you multiply by 100 you have the relative percentage frequency.
-Cumulative relative frequency: is the sum of the relative frequencies of each class with the previous accumulated. The last of the accumulated relative frequencies must equal 1.
For our example, the frequencies look like this:
The extreme values of each class or interval are called class limits. As we can see, each class has a lower and a higher limit. For example, the first class in the study about heights has a lower limit of 155 cm and a higher limit of 159 cm..
This example has limits that are clearly defined, however it is possible to define open limits: if instead of defining the exact values, say "height less than 160 cm", "height less than 165 cm" and so on.
Height is a continuous variable, so it can be considered that the first class actually starts at 154.5 cm, since by rounding this value to the nearest integer, we get 155 cm.
This class covers all values up to 159.5 cm, because after this, the heights are rounded to 160.0 cm. A height of 159.7 cm already belongs to the following class.
The actual class boundaries for this example are, in cm:
The width of a class is obtained by subtracting the boundaries. For the first interval of our example we have 159.5 - 154.5 cm = 5 cm.
The reader can verify that for the other intervals of the example the amplitude is also 5 cm. However, it should be noted that distributions can be constructed with intervals of different amplitude.
It is the midpoint of the interval and is obtained by the average between the upper limit and the lower limit.
For our example, the first class mark is (155 + 159) / 2 = 157 cm. The reader can see that the remaining class marks are: 162, 167, 172 and 177 cm.
Determining the class marks is important, as they are necessary to find the arithmetic mean and the variance of the distribution.
The most commonly used measures of central tendency are the mean, the median and the mode, and they precisely describe the tendency of the data to cluster around a certain central value..
It is one of the main measures of central tendency. In the grouped data, the arithmetic mean can be calculated using the formula:
-X is the mean
-Fi is the frequency of the class
-mi is the class mark
-g is the number of classes
-n is the total number of data
For the median, the interval where the n / 2 observation is found must be identified. In our example, this observation is number 50, because there are a total of 100 data points. This observation is in the range 165-169 cm.
Then you have to interpolate to find the numerical value that corresponds to that observation, for which the formula is used:
Where:
-c = width of the interval where the median is found
-BM = the lower bound of the interval to which the median belongs
-Fm = number of observations contained in the median interval
-n / 2 = half of total data
-FBM = total number of observations before median interval
For the mode, the modal class is identified, the one that contains the most observations, whose class mark is known.
Variance and standard deviation are measures of dispersion. If we denote the variance with stwo and the standard deviation, which is the square root of the variance as s, for grouped data we will have respectively:
Y
For the distribution of heights of female university students proposed at the beginning, calculate the values of:
a) Average
b) Median
c) Fashion
d) Variance and standard deviation.
Let's build the following table to facilitate the calculations:
Substituting values and carrying out the summation directly:
X = (6 x 157 + 14 x 162 + 47 x 167 + 28 x 172+ 5 x 177) / 100 cm =
= 167.6 cm
The interval to which the median belongs is 165-169 cm because it is the interval with the highest frequency.
Let's identify each of these values in the example, with the help of Table 2:
c = 5 cm (see the amplitude section)
BM = 164.5 cm
Fm = 47
n / 2 = 100/2 = 50
FBM = 20
Substituting in the formula:
The interval that contains most of the observations is 165-169 cm, whose class mark is 167 cm.
We expand the previous table by adding two additional columns:
We apply the formula:
And we develop the summation:
stwo = (6 x 112.36 + 14 x 31.36 + 47 x 0.36 + 28 x 19.36 + 5 x 88.36) / 99 = = 21.35 cmtwo
Therefore:
s = √21.35 cmtwo = 4.6 cm
Yet No Comments