Unit 2: Measures of Central Tendency & Dispersion
        
        
            1. Measures of Central Tendency (Averages)
            A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set. They are also known as "averages."
            
            Mathematical Averages
            
            1. Arithmetic Mean (AM or Mean)
            The sum of all observations divided by the number of observations.
            
                Ungrouped Data: x̄ = (x1 + x2 + ... + xn) / n = (Σx) / n
                
                Grouped Data: x̄ = (Σf * x) / (Σf) = (Σf * x) / N
                
                (where x = midpoint of class, f = frequency, N = total frequency)
            
            
                - Pros: Easy to calculate, rigidly defined, uses all data points.
- Cons: Highly affected by extreme values (outliers).
2. Geometric Mean (GM)
            The n-th root of the product of n observations. Used for averaging ratios, percentages, or growth rates.
            
                Ungrouped Data: GM = (x1 * x2 * ... * xn)^(1/n)
                
                Using Logs: log(GM) = (Σ log(x)) / n  =>  GM = Antilog[ (Σ log(x)) / n ]
            
            
                - Pros: Less affected by outliers than AM, good for multiplicative data.
- Cons: Cannot be calculated if any value is zero or negative.
3. Harmonic Mean (HM)
            The reciprocal of the arithmetic mean of the reciprocals of the observations. Used for averaging rates and speeds.
            
                Ungrouped Data: HM = n / (Σ (1/x))
                
                Grouped Data: HM = N / (Σ (f/x))
            
            
                - Pros: Gives more weight to smaller values.
- Cons: Cannot be calculated if any value is zero. Difficult to understand.
                Note: For any set of positive numbers: AM ≥ GM ≥ HM.
            
            Positional / Partitional Averages
            
            1. Median (Md)
            The middle value of a dataset that has been arranged in order (ascending or descending).
            
                Ungrouped Data:
                
                - If n is odd: Median = Value of the ((n+1)/2)-th item.
                
                - If n is even: Median = Average of the (n/2)-th and ((n/2) + 1)-th items.
                
                Grouped Data: Median = L + [ ( (N/2) - cf ) / f ] * h
                
                (L = lower boundary of median class, N = total frequency, cf = cumulative frequency *before* median class, f = frequency of median class, h = class width)
            
            
                - Pros: Not affected by outliers (resistant measure), can be calculated for open-ended classes.
- Cons: Does not use all data points.
2. Mode (Mo)
            The value that appears most frequently in a dataset.
            
                - A dataset can be unimodal (1 mode), bimodal (2 modes), or multimodal (>2 modes).
                Grouped Data: Mode = L + [ (f1 - f0) / (2*f1 - f0 - f2) ] * h
                
                (L = lower boundary of modal class, f1 = freq of modal class, f0 = freq of pre-modal class, f2 = freq of post-modal class, h = class width)
            
            
                - Pros: Easy to find, not affected by outliers, the only average for nominal data.
- Cons: May not exist or may not be unique, ill-defined.
                Empirical Relationship: For a moderately skewed distribution:
                
                Mean - Mode ≈ 3 * (Mean - Median)  or  Mode ≈ 3*Median - 2*Mean
            
        
        
        
            2. Partition Values
            Values that divide an ordered dataset into a number of equal parts.
            
            1. Quartiles (Q)
            Divide the data into 4 equal parts.
            
                - Q1 (First/Lower Quartile): 25% of data is below it.
- Q2 (Second Quartile): 50% of data is below it. Q2 is the Median.
- Q3 (Third/Upper Quartile): 75% of data is below it.
2. Deciles (D)
            Divide the data into 10 equal parts (D1, D2, ... D9).
            
            3. Percentiles (P)
            Divide the data into 100 equal parts (P1, P2, ... P99).
            
            
                Note: Q1 = P25,  Q2 = D5 = P50 = Median,  Q3 = P75.
            
            
            
                Formula for Grouped Data (Percentile 'k'):
                
                Pk = L + [ ( (k*N/100) - cf ) / f ] * h
                
                (To find Q1, use k=25. To find Q3, use k=75. To find D4, use k=40, etc.)
            
        
        
        
            3. Measures of Dispersion (Variability)
            Measures that describe the spread, scatter, or variation of data points in a dataset. A low dispersion means data is clustered tightly around the center.
            Absolute Measures of Dispersion
            (Expressed in the same units as the data)
            1. Range
            The simplest measure. The difference between the largest and smallest observation.
            Range = Largest Value (L) - Smallest Value (S)
            
                - Pros: Easy to calculate.
- Cons: Based on only two values, highly affected by outliers.
2. Inter-Quartile Range (IQR)
            The range of the middle 50% of the data. It is a resistant measure of spread.
            IQR = Q3 - Q1
            3. Quartile Deviation (QD) or Semi-Interquartile Range
            Half of the Inter-Quartile Range.
            QD = (Q3 - Q1) / 2
            
                - Pros: Not affected by outliers.
- Cons: Ignores 50% of the data (the extremes).
4. Mean Deviation (MD)
            The arithmetic mean of the absolute deviations of the observations from a measure of central tendency (mean, median, or mode).
            
                MD (from mean): (Σ |x - x̄|) / n
                
                MD (from median): (Σ |x - Md|) / n
            
            
                - Pros: Uses all data points.
- Cons: Ignores negative signs (mathematically problematic), difficult to compute.
5. Variance and Standard Deviation (SD)
            The most important and widely used measures of dispersion.
            
                Variance (σ² or s²): The average of the squared deviations from the mean.
            
            
                Population Variance: σ² = (Σ (x - μ)²) / N
                
                Sample Variance: s² = (Σ (x - x̄)²) / (n - 1)  (Note the 'n-1' for unbiased estimate)
            
            
                Standard Deviation (σ or s): The square root of the variance.
            
            
                SD (σ or s) = sqrt(Variance)
                
                Computational Formula: s = sqrt[ ( (Σx²) - ( (Σx)² / n ) ) / (n - 1) ]
            
            
                - Pros: Uses all data, mathematically sound, basis for many other statistical methods.
- Cons: Affected by outliers (due to squaring).
        
            4. Coefficient of Variation (Relative Dispersion)
            Absolute measures (like SD) cannot be used to compare the variability of two different datasets if they have different units (e.g., heights vs. weights) or different means.
            We use a relative measure, the Coefficient of Variation (CV).
            
                Coefficient of Variation (CV): The ratio of the standard deviation to the mean, usually expressed as a percentage.
            
            
                CV = (Standard Deviation / Mean) * 100
                
                CV = (s / x̄) * 100
            
            
                - It is a unit-free measure.
- A lower CV means the data is more consistent or less variable.
- A higher CV means the data is less consistent or more variable.
                Exam Tip: A common question is: "Team A has a mean score of 80 with SD=5. Team B has a mean score of 50 with SD=4. Which team is more consistent?"
                
                - CV(A) = (5 / 80) * 100 = 6.25%
                
                - CV(B) = (4 / 50) * 100 = 8%
                
                - Answer: Team A is more consistent because its CV is lower.
            
        
        
        
            5. Graphical Representation of Measures
            
            1. Ogives (Cumulative Frequency Curves)
            An ogive is a graph of a cumulative frequency distribution. It is used to graphically locate partition values (Median, Quartiles, etc.).
            
                - Less Than Ogive:
                    
                        - Plot Upper Class Boundaries on the X-axis.
- Plot Less Than Cumulative Frequencies on the Y-axis.
- The curve rises from left to right.
 
- More Than Ogive:
                    
                        - Plot Lower Class Boundaries on the X-axis.
- Plot More Than Cumulative Frequencies on the Y-axis.
- The curve falls from left to right.
 
                Finding the Median: The Median is the X-coordinate of the intersection point of the "Less Than" and "More Than" ogives.
                
                Alternatively, on a "Less Than" ogive, find the N/2 value on the Y-axis, draw a horizontal line to the curve, and then a vertical line down to the X-axis. This X-value is the Median.
            
            2. Box Plot (Box-and-Whisker Plot)
            A graphical summary of a distribution based on five numbers: Minimum, Q1, Median (Q2), Q3, and Maximum.
            
                - A box is drawn from Q1 to Q3.
- A vertical line is drawn inside the box at the Median.
- "Whiskers" extend from the box to the Minimum and Maximum values (or to the last non-outlier point).
- Outliers are often plotted as individual points.
A box plot clearly shows the center (Median), spread (IQR/box length), and skewness (position of median in the box) of the data.
            
        
        
            6. Moments
            Moments are a set of statistical parameters used to describe the characteristics (shape, center, spread) of a distribution.
            
            1. Raw Moments (μ'r) - Moments about Origin (Zero)
            The r-th raw moment is the arithmetic mean of the r-th power of the observations.
            
                μ'r = (Σ xr) / n   (Ungrouped)
                
                μ'r = (Σ f * xr) / N   (Grouped)
            
            
                - The first raw moment (r=1) is the Arithmetic Mean: μ'1 = (Σx) / n = x̄
2. Central Moments (μr) - Moments about the Mean
            The r-th central moment is the arithmetic mean of the r-th power of the deviations from the mean.
            
                μr = (Σ (x - x̄)r) / n   (Ungrouped)
                
                μr = (Σ f * (x - x̄)r) / N   (Grouped)
            
            
                - The first central moment (r=1) is always zero: μ1 = (Σ(x - x̄)) / n = 0
- The second central moment (r=2) is the Variance: μ2 = (Σ(x - x̄)²) / n = σ²
- The third central moment (r=3), μ3, is used to measure skewness.
- The fourth central moment (r=4), μ4, is used to measure kurtosis.
        
            7. Sheppard's Corrections for Moments
            When calculating moments from grouped data (a continuous frequency distribution), we assume all values in a class are at the midpoint. This introduces a "grouping error."
            Sheppard's corrections are used to adjust the calculated moments (μ'r) to get a more accurate estimate, assuming the distribution is continuous and tapers off to zero at both ends.
            
  Let 'h' be the uniform class width.
                
                Corrected μ1 = μ1 = 0 (No change)
                
                Corrected μ2 = μ2 - (h² / 12)
                
                Corrected μ3 = μ3 (No change)
                
                Corrected μ4 = μ4 - (h² / 2) * μ2 + (7 * h4 / 240)
            
            
                Exam Tip: You usually only need to remember the correction for the second moment (variance). Corrected Variance = Calculated Variance - (h²/12).
            
        
        
        
            8. Measures of Skewness and Kurtosis
            
            Skewness (Shape)
            Measures the asymmetry or lack of symmetry of a distribution.
            
                - Symmetrical Distribution: The "bell" shape is identical on both sides of the center.
                    
                
- Positively Skewed (Skewed to the Right): The "tail" of the distribution is longer on the right side.
                    
                
- Negatively Skewed (Skewed to the Left): The "tail" of the distribution is longer on the left side.
                    
                
Measures of SkewGness:
            
                - Karl Pearson's Coefficient (Skp):
                    Skp = (Mean - Mode) / Standard Deviation (Approximate) Skp = 3 * (Mean - Median) / Standard Deviation 
- Bowley's Coefficient (Skb): (Based on quartiles)
                    Skb = (Q3 + Q1 - 2*Median) / (Q3 - Q1) 
- Moment-based Coefficient (β1 and γ1):
                    
                        β1 = (μ3)² / (μ2)³
                        
 γ1 = sqrt(β1) = μ3 / (μ2)1.5
 (If γ1 > 0, positive skew. If γ1 < 0, negative skew. If γ1 = 0, symmetrical)
Kurtosis (Peakedness)
            Measures the peakedness or flatness of a distribution compared to the standard Normal distribution.
            
                - Leptokurtic: More peaked, sharper peak, and heavier/fatter tails. (Kurtosis > 3)
- Mesokurtic: The "normal" bell shape. (Kurtosis = 3)
- Platykurtic: Flatter, more rounded peak, and lighter/thinner tails. (Kurtosis < 3)
Measures of Kurtosis (β2 and γ2):
            
                β2 = μ4 / (μ2)²
                
                γ2 = β2 - 3  (This is "excess kurtosis")
            
            
                - If β2 = 3 (or γ2 = 0), it is Mesokurtic.
- If β2 > 3 (or γ2 > 0), it is Leptokurtic.
- If β2 < 3 (or γ2 < 0), it is Platykurtic.