Unit 2: Measures of Central Tendency & Dispersion
        
        
            1. Measures of Central Tendency (Averages)
            A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set.
            
            1. Arithmetic Mean (AM or Mean)
            The sum of all observations divided by the number of observations.
            
                Ungrouped Data: x̄ = (x1 + x2 + ... + xn) / n = (Σx) / n
                
                Grouped Data: x̄ = (Σf * x) / (Σf) = (Σf * x) / N
                
                (where x = midpoint of class, f = frequency, N = total frequency)
            
            
                - Pros: Easy to calculate, rigidly defined, uses all data points.
- Cons: Highly affected by extreme values (outliers).
2. Median (Md)
            The middle value of a dataset that has been arranged in order (ascending or descending).
            
                Ungrouped Data:
                
                - If n is odd: Median = Value of the ((n+1)/2)-th item.
                
                - If n is even: Median = Average of the (n/2)-th and ((n/2) + 1)-th items.
                
                Grouped Data: Median = L + [ ( (N/2) - cf ) / f ] * h
                
                (L = lower boundary of median class, N = total frequency, cf = cumulative frequency *before* median class, f = frequency of median class, h = class width)
            
            
                - Pros: Not affected by outliers, can be calculated for open-ended classes.
- Cons: Does not use all data points.
3. Mode (Mo)
            The value that appears most frequently in a dataset.
            
                Grouped Data: Mode = L + [ (f1 - f0) / (2*f1 - f0 - f2) ] * h
                
                (L = lower boundary of modal class, f1 = freq of modal class, f0 = freq of pre-modal class, f2 = freq of post-modal class, h = class width)
            
            
                - Pros: Not affected by outliers, the only average for nominal data.
- Cons: May not exist or may not be unique.
                Empirical Relationship: For a moderately skewed distribution:
                
                Mean - Mode ≈ 3 * (Mean - Median)  or  Mode ≈ 3*Median - 2*Mean
            
            4. Geometric Mean (GM)
            The n-th root of the product of n observations. Used for averaging ratios, percentages, or growth rates.
            
                Ungrouped Data: GM = (x1 * x2 * ... * xn)^(1/n)
                
                Using Logs: log(GM) = (Σ log(x)) / n  =>  GM = Antilog[ (Σ log(x)) / n ]
            
            
                - Pros: Less affected by outliers than AM.
- Cons: Cannot be calculated if any value is zero or negative.
5. Harmonic Mean (HM)
            The reciprocal of the arithmetic mean of the reciprocals of the observations. Used for averaging rates and speeds.
            
                Ungrouped Data: HM = n / (Σ (1/x))
                
                Grouped Data: HM = N / (Σ (f/x))
            
            
                - Pros: Gives more weight to smaller values.
- Cons: Cannot be calculated if any value is zero.
                Note: For any set of positive numbers: AM ≥ GM ≥ HM.
            
        
        
        
            2. Partition Values
            Values that divide an ordered dataset into a number of equal parts. The Median is a partition value (it divides data into 2 parts).
            
            1. Quartiles (Q)
            Divide the data into 4 equal parts (Q1, Q2, Q3).
            
            2. Deciles (D)
            Divide the data into 10 equal parts (D1, D2, ... D9).
            
            3. Percentiles (P)
            Divide the data into 100 equal parts (P1, P2, ... P99).
            
            
                Note: Q1 = P25,  Q2 = D5 = P50 = Median,  Q3 = P75.
            
            
            
                Formula for Grouped Data (Percentile 'k'):
                
                Pk = L + [ ( (k*N/100) - cf ) / f ] * h
                
                (To find Q1, use k=25. To find Q3, use k=75. To find D4, use k=40, etc.)
            
        
        
        
            3. Measures of Dispersion (Variability)
            Measures that describe the spread or scatter of data points in a dataset.
            1. Range
            The simplest measure. The difference between the largest and smallest observation.
            Range = Largest Value (L) - Smallest Value (S)
            2. Quartile Deviation (QD)
            Also known as the Semi-Interquartile Range. It measures the spread of the middle 50% of the data.
            QD = (Q3 - Q1) / 2
            
                - Pros: Not affected by outliers.
- Cons: Ignores 50% of the data.
3. Mean Deviation (MD)
            The arithmetic mean of the absolute deviations of the observations from a measure of central tendency (usually the median or mean).
            
                MD (from median): (Σ |x - Median|) / n
            
            
                - Pros: Uses all data points.
- Cons: Ignores negative signs, which is mathematically problematic.
4. Variance and Standard Deviation (SD)
            The most important and widely used measures of dispersion.
            
                Variance (s²): The average of the squared deviations from the mean.
            
            
                Sample Variance: s² = (Σ (x - x̄)²) / (n - 1)
            
            
                Standard Deviation (s): The square root of the variance.
            
            
                SD (s) = sqrt(Variance)
            
            
                - Pros: Mathematically sound, basis for many other statistical methods.
- Cons: Affected by outliers (due to squaring).
        
            4. Coefficient of Variation (Relative Dispersion)
            We cannot compare the standard deviations of two datasets if they have different units (e.g., heights in cm vs. weights in kg) or different means.
            We use a relative measure, the Coefficient of Variation (CV).
            
                Coefficient of Variation (CV): The ratio of the standard deviation to the mean, expressed as a percentage.
            
            
                CV = (Standard Deviation / Mean) * 100
                
                CV = (s / x̄) * 100
            
            
                - It is a unit-free measure.
- A lower CV means the data is more consistent (less variable).
- A higher CV means the data is less consistent (more variable).
        
            5. Moments
            Moments are statistical parameters used to describe the characteristics of a distribution (center, spread, shape).
            
            1. Raw Moments (μ'r) - Moments about Origin (Zero)
            The r-th raw moment is the arithmetic mean of the r-th power of the observations.
            
                μ'r = (Σ xr) / n
            
            
                - The first raw moment (r=1) is the Arithmetic Mean: μ'1 = x̄
2. Central Moments (μr) - Moments about the Mean
            The r-th central moment is the arithmetic mean of the r-th power of the deviations from the mean.
            
                μr = (Σ (x - x̄)r) / n
            
            
                - The first central moment (r=1) is always zero: μ1 = 0
- The second central moment (r=2) is the Variance: μ2 = s² (or σ²)
- The third central moment (r=3), μ3, is used to measure skewness.
- The fourth central moment (r=4), μ4, is used to measure kurtosis.
        
            6. Measures of Skewness and Kurtosis
            
            Skewness (Shape - Asymmetry)
            Measures the asymmetry or lack of symmetry of a distribution.
            
                - Symmetrical Distribution: Mean = Median = Mode. Skewness = 0.
- Positively Skewed (Right-skewed): Long tail to the right. Mean > Median > Mode.
- Negatively Skewed (Left-skewed): Long tail to the left. Mean < Median < Mode.
Moment-based Coefficient (β1 and γ1):
            
                β1 = (μ3)² / (μ2)³
                
                γ1 = sqrt(β1) = μ3 / (μ2)1.5
            
            (If γ1 > 0, positive skew. If γ1 < 0, negative skew. If γ1 = 0, symmetrical)
            
            Kurtosis (Peakedness)
            Measures the peakedness or flatness of a distribution compared to the standard Normal distribution.
            
                - Leptokurtic: More peaked, sharper peak, and heavier tails.
- Mesokurtic: The "normal" bell shape.
- Platykurtic: Flatter, more rounded peak, and lighter tails.
Measures of Kurtosis (β2 and γ2):
            
                β2 = μ4 / (μ2)²
                
                γ2 = β2 - 3  (This is "excess kurtosis")
            
            
                - If β2 = 3 (or γ2 = 0), it is Mesokurtic.
- If β2 > 3 (or γ2 > 0), it is Leptokurtic.
- If β2 < 3 (or γ2 < 0), it is Platykurtic.