The mean provides an estimate of the central tendency for grouped data. Since raw data points are not available, we rely on the midpoints of each interval. The general formula for the mean of grouped data is:
The formula is given by:
\( \text{Mean} \, ( \overline{x} ) = \dfrac{\sum (m_i \cdot n_i)}{N} \)
where:
Calculating the midpoints: For each class interval, the midpoint is computed as:
\( m_i = \dfrac{\text{Lower Limit} + \text{Upper Limit}}{2} \)
Once you have identified the midpoint for each class interval, multiply each midpoint by its corresponding frequency. Sum all these products to arrive at \(\sum (m_i \cdot n_i)\). Finally, divide the total by the overall number of observations (\(N\)).
Consider the sample grouped data presented in the following table:
Class Interval | Frequency (n) | Midpoint (m) | Product \(m \times n\) |
---|---|---|---|
10 - 20 | 5 | 15 | 75 |
20 - 30 | 10 | 25 | 250 |
30 - 40 | 15 | 35 | 525 |
40 - 50 | 10 | 45 | 450 |
50 - 60 | 5 | 55 | 275 |
Total | 45 | 1575 |
Here, the total frequency \(N = 45\) and \(\sum (m_i \cdot n_i) = 1575\). Hence, the estimated mean is:
\( \overline{x} = \dfrac{1575}{45} = 35 \)
This result indicates that the central value of the data distribution is near 35, which is the best estimate given the aggregated information.
The median represents the middle value of the dataset. For grouped data, this is estimated by identifying the median class and then using interpolation to approximate the middle value.
\( \text{Median} \, (M) = L + \left( \dfrac{\frac{N}{2} - CF}{f} \right) \times h \)
where:Consider the following cumulative frequency table derived from grouped data:
Class Interval | Frequency (f) | Cumulative Frequency (CF) |
---|---|---|
10 - 20 | 5 | 5 |
20 - 30 | 10 | 15 |
30 - 40 | 15 | 30 |
40 - 50 | 10 | 40 |
50 - 60 | 5 | 45 |
In this case, the total number of observations \(N = 45\). Thus, \(N/2 = 22.5\). Review the cumulative frequency distribution to identify the median class. The class interval 30 – 40 has a cumulative frequency of 30 and is the first interval where the cumulative frequency exceeds 22.5.
With the median class determined as 30 – 40, we need the following values:
Applying these values to the formula:
\( \text{Median} = 30 + \left( \dfrac{22.5 - 15}{15} \right) \times 10 \)
Calculation details:
\( \text{Median} = 30 + \left( \dfrac{7.5}{15} \right) \times 10 = 30 + 0.5 \times 10 = 30 + 5 = 35 \)
Thus, the estimated median is 35, meaning that half of the data lies below 35 and half above, based on grouped values.
The mode is the value that appears most frequently in the dataset. For grouped data, the mode is typically not a single data point; instead, it is represented by the modal class – the class interval with the highest frequency. To estimate the mode with greater accuracy, a formula is used that adjusts for the frequencies of the neighboring classes.
The formula for estimating the mode is:
\( \text{Mode} = L + \left( \dfrac{f_m - f_{m-1}}{(2f_m - f_{m-1} - f_{m+1})} \right) \times h \)
where:
Suppose we have the following group frequencies:
Class Interval | Frequency |
---|---|
10 - 20 | 5 |
20 - 30 | 10 |
30 - 40 | 15 |
40 - 50 | 10 |
50 - 60 | 5 |
The modal class in this dataset is 30 - 40, since it has the highest frequency (\( f_m = 15 \)). Assume the frequency of the class preceding the modal class is \( f_{m-1} = 10 \) and the frequency of the class succeeding it is also \( f_{m+1} = 10 \). The width of the class interval \( h \) is \(10\) (calculated as \(40 - 30\)).
Using these values:
\( \text{Mode} = 30 + \left( \dfrac{15 - 10}{2 \times 15 - 10 - 10} \right) \times 10 \)
Simplify the fraction:
\( \text{Mode} = 30 + \left( \dfrac{5}{30 - 20} \right) \times 10 = 30 + \left( \dfrac{5}{10} \right) \times 10 \)
\( \text{Mode} = 30 + 0.5 \times 10 = 30 + 5 = 35 \)
Hence, the estimated mode is 35. This illustrates that 35 is the value that is most representative, given that it lies within the modal class adjusted by nearby frequency differences.
The calculations above provide estimates that are essential in summarizing large datasets presented in grouped form. Each measure gives us a different perspective:
Note that, due to the nature of grouped data, these estimated measures rely heavily on the assumption that the data is uniformly distributed within the intervals. Small deviations in uniformity can slightly alter the actual mean, median, or mode if the raw data were available.
Measure | Formula | Key Steps | Usage |
---|---|---|---|
Mean | \( \overline{x} = \dfrac{\sum (m_i \cdot n_i)}{N} \) |
|
General average; sensitive to extreme values |
Median | \( M = L + \left( \dfrac{\frac{N}{2} - CF}{f} \right) \times h \) |
|
Middle value; robust in skewed data |
Mode | \( \text{Mode} = L + \left( \dfrac{f_m - f_{m-1}}{(2f_m - f_{m-1} - f_{m+1})} \right) \times h \) |
|
Most frequent value; useful in multi-modal distributions |
When applying these formulas to real-world data, it is important to consider the following aspects:
Grouped data assumes that the values within each interval are evenly distributed. If the actual data is not uniformly distributed within a group, the calculated mean, median, and mode might only serve as approximations.
The width of the class intervals (h) has a direct impact on the accuracy of both the median and mode calculations. Consistent class intervals tend to yield more reliable measures, while varying widths may require additional adjustments or considerations.
In a skewed dataset, the mean may be pulled toward the tail of the distribution, whereas the median remains a more accurate indicator of the central value. The mode, meanwhile, indicates the peak frequency and can help identify the most common range within the data.
These measures are widely used in statistical analysis for summarizing large datasets, whether in academic research, business analytics, or survey data interpretation. The estimation techniques enable analysts to perform reliable calculations even in the absence of individual data points.