Unit 4: Correlation and Regression
        
        
            1. Bivariate Data and Scatter Diagram
            
            Bivariate Data
            Data that involves two different variables, where we are interested in the relationship between them. Each observation consists of a pair of values (x, y).
            
                - Example: Height (x) and Weight (y) of students.
- Example: Advertising Expenditure (x) and Sales (y) of a product.
Scatter Diagram (or Scatter Plot)
            The simplest way to visualize bivariate data. It's a graph where each (x, y) pair is plotted as a single point on a 2D plane.
            The pattern of the points helps us identify the type (linear, non-linear) and strength of the relationship.
            
                - Positive Correlation: Points drift upwards from left to right. (As x increases, y increases).
- Negative Correlation: Points drift downwards from left to right. (As x increases, y decreases).
- No Correlation: Points are randomly scattered with no clear pattern.
- Strength: Tightly packed points mean strong correlation; loosely scattered points mean weak correlation.
        
            2. Correlation and Regression (Introduction)
            
            Correlation
            Correlation measures the strength and direction of a *linear* relationship between two quantitative variables. It tells us *if* there is a relationship and *how strong* it is. It results in a single number, the correlation coefficient (r).
            
            Regression
            Regression goes one step further. If a correlation exists, regression describes that relationship with a mathematical equation (a line). This equation can then be used for prediction.
            
            
                Correlation vs. Causation: A strong correlation does NOT imply that one variable *causes* the other. It only means they move together. (e.g., ice cream sales and drowning incidents are correlated, but both are caused by a third variable: hot weather).
            
        
        
        
            3. Karl Pearson's Coefficient of Correlation (r)
            Also known as the "product-moment correlation coefficient." It is a numerical measure of the strength and direction of the linear relationship.
            
            Properties of 'r':
            
                - Range: 'r' always lies between -1 and +1.
                    
                        - r = +1: Perfect positive linear correlation.
- r = -1: Perfect negative linear correlation.
- r = 0: No linear correlation.
 
- Symmetrical: rxy = ryx.
- Independent of Change of Origin and Scale: If you add, subtract, multiply, or divide x or y by constants, 'r' does not change.
Formula for 'r':
            
                r = [ n(Σxy) - (Σx)(Σy) ] / sqrt[ [n(Σx²) - (Σx)²] * [n(Σy²) - (Σy)²] ]
            
            
                Exam Tip: To use this formula, create a table with 5 columns: x, y, x², y², xy. Then, find the sum (Σ) of each column and plug the values into the formula along with 'n' (the number of pairs).
            
        
        
        
            4. Spearman's Rank Correlation Coefficient (R)
            This coefficient measures the strength of association between two variables when the data is ordinal (ranked).
            
            Procedure:
            
                - Assign ranks (Rx) to the x-values (from 1 to n).
- Assign ranks (Ry) to the y-values (from 1 to n).
- Calculate the difference in ranks for each pair: d = Rx - Ry.
- Calculate Σd².
Formula (when ranks are not tied):
            
                R = 1 - [ ( 6 * Σd² ) / ( n * (n² - 1) ) ]
            
            Formula (when ranks are tied):
            If two or more items have the same value, assign them the average rank. A correction factor (CF) must be used.
            
                CF = Σ [ m * (m² - 1) / 12 ] 
                
                (where 'm' is the number of times an item is repeated, summed for all ties in both x and y)
                
                Corrected R = 1 - [ ( 6 * (Σd² + CF) ) / ( n * (n² - 1) ) ]
            
        
        
        
            5. Regression Lines and Coefficients
            A "line of regression" is the line of best fit for the data. There are two lines:
            1. Regression Line of Y on X
            Used to predict Y given X.
            
            Equation: (Y - ȳ) = byx * (X - x̄)
            
            byx is the regression coefficient of Y on X (the slope).
            
                byx = r * (σy / σx) = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σx²) - (Σx)² ]
            
            2. Regression Line of X on Y
            Used to predict X given Y.
            
            Equation: (X - x̄) = bxy * (Y - ȳ)
            
            bxy is the regression coefficient of X on Y.
            
                bxy = r * (σx / σy) = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σy²) - (Σy)² ]
            
            Properties of Regression Coefficients:
            
                - The correlation coefficient 'r' is the geometric mean of the two regression coefficients: r² = byx * bxy.
- 'r', byx, and bxy all have the same sign.
- If one coefficient is > 1, the other must be < 1.
        
            6. Angle Between Two Regression Lines
            The two regression lines intersect at (x̄, ȳ). The angle (θ) between them gives an idea of the correlation strength.
            
                tan(θ) = [ (1 - r²) / (r) ] * [ (σx * σy) / (σx² + σy²) ]
            
            
            Key Insights:
            
                - If r = 0: tan(θ) = ∞, so θ = 90°. The lines are perpendicular (uncorrelated).
- If r = +1 or -1: tan(θ) = 0, so θ = 0°. The two lines are coincident (the same line).
        
            7. Principle of Least Squares and Fitting a Straight Line
            
            Principle of Least Squares
            This is the method used to find the "best-fit" line (the regression line). It finds the line that minimizes the sum of the squares of the vertical errors (residuals) between the observed data points (y) and the values predicted by the line (ŷ).
            Fitting a Straight Line (y = a + bx)
            To find the parameters 'a' (intercept) and 'b' (slope) for the line of best fit, we use the Principle of Least Squares. This generates two "Normal Equations" which we can solve simultaneously.
            
            
                Normal Equations for a Straight Line:
                
                    - Σy = n*a + b*(Σx)
- Σxy = a*(Σx) + b*(Σx²)
            How to solve:
                1. From your data, calculate: n, Σx, Σy, Σxy, Σx²
                2. Plug these 5 values into the two normal equations.
                3. You now have two simultaneous linear equations with two unknowns (a, b). Solve for 'a' and 'b'.
            
            
                Note: The value 'b' found here is identical to the regression coefficient byx.