Unit 4: Correlation and Regression

1. Bivariate Data and Scatter Diagram
2. Correlation and Regression (Introduction)
3. Karl Pearson's Coefficient of Correlation (r)
4. Spearman's Rank Correlation Coefficient
5. Regression Lines and Coefficients
6. Angle Between Two Regression Lines
7. Principle of Least Squares and Fitting a Straight Line

1. Bivariate Data and Scatter Diagram

Bivariate Data

Data that involves two different variables, where we are interested in the relationship between them. Each observation consists of a pair of values (x, y).

Example: Height (x) and Weight (y) of students.
Example: Advertising Expenditure (x) and Sales (y) of a product.

Scatter Diagram (or Scatter Plot)

The simplest way to visualize bivariate data. It's a graph where each (x, y) pair is plotted as a single point on a 2D plane.

The pattern of the points helps us identify the type (linear, non-linear) and strength of the relationship.

Positive Correlation: Points drift upwards from left to right. (As x increases, y increases).
Negative Correlation: Points drift downwards from left to right. (As x increases, y decreases).
No Correlation: Points are randomly scattered with no clear pattern.
Strength: Tightly packed points mean strong correlation; loosely scattered points mean weak correlation.

2. Correlation and Regression (Introduction)

Correlation

Correlation measures the strength and direction of a *linear* relationship between two quantitative variables. It tells us *if* there is a relationship and *how strong* it is. It results in a single number, the correlation coefficient (r).

Regression

Regression goes one step further. If a correlation exists, regression describes that relationship with a mathematical equation (a line). This equation can then be used for prediction.

Correlation vs. Causation: A strong correlation does NOT imply that one variable *causes* the other. It only means they move together. (e.g., ice cream sales and drowning incidents are correlated, but both are caused by a third variable: hot weather).

3. Karl Pearson's Coefficient of Correlation (r)

Also known as the "product-moment correlation coefficient." It is a numerical measure of the strength and direction of the linear relationship.

Properties of 'r':

Range: 'r' always lies between -1 and +1.
- r = +1: Perfect positive linear correlation.
- r = -1: Perfect negative linear correlation.
- r = 0: No linear correlation.
Symmetrical: r_xy = r_yx.
Independent of Change of Origin and Scale: If you add, subtract, multiply, or divide x or y by constants, 'r' does not change.

Formula for 'r':

r = [ n(Σxy) - (Σx)(Σy) ] / sqrt[ [n(Σx²) - (Σx)²] * [n(Σy²) - (Σy)²] ]

Exam Tip: To use this formula, create a table with 5 columns: x, y, x², y², xy. Then, find the sum (Σ) of each column and plug the values into the formula along with 'n' (the number of pairs).

4. Spearman's Rank Correlation Coefficient (R)

This coefficient measures the strength of association between two variables when the data is ordinal (ranked).

Procedure:

Assign ranks (R_x) to the x-values (from 1 to n).
Assign ranks (R_y) to the y-values (from 1 to n).
Calculate the difference in ranks for each pair: d = R_x - R_y.
Calculate Σd².

Formula (when ranks are not tied):

R = 1 - [ ( 6 * Σd² ) / ( n * (n² - 1) ) ]

Formula (when ranks are tied):

If two or more items have the same value, assign them the average rank. A correction factor (CF) must be used.

CF = Σ [ m * (m² - 1) / 12 ]
(where 'm' is the number of times an item is repeated, summed for all ties in both x and y)

Corrected R = 1 - [ ( 6 * (Σd² + CF) ) / ( n * (n² - 1) ) ]

5. Regression Lines and Coefficients

A "line of regression" is the line of best fit for the data. There are two lines:

1. Regression Line of Y on X

Used to predict Y given X.
Equation: (Y - ȳ) = b_yx * (X - x̄)

b_yx is the regression coefficient of Y on X (the slope).

b_yx = r * (σ_y / σ_x) = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σx²) - (Σx)² ]

2. Regression Line of X on Y

Used to predict X given Y.
Equation: (X - x̄) = b_xy * (Y - ȳ)

b_xy is the regression coefficient of X on Y.

b_xy = r * (σ_x / σ_y) = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σy²) - (Σy)² ]

Properties of Regression Coefficients:

The correlation coefficient 'r' is the geometric mean of the two regression coefficients: r² = b_yx * b_xy.
'r', b_yx, and b_xy all have the same sign.
If one coefficient is > 1, the other must be < 1.

6. Angle Between Two Regression Lines

The two regression lines intersect at (x̄, ȳ). The angle (θ) between them gives an idea of the correlation strength.

tan(θ) = [ (1 - r²) / (r) ] * [ (σ_x * σ_y) / (σ_x² + σ_y²) ]

Key Insights:

If r = 0: tan(θ) = ∞, so θ = 90°. The lines are perpendicular (uncorrelated).
If r = +1 or -1: tan(θ) = 0, so θ = 0°. The two lines are coincident (the same line).

7. Principle of Least Squares and Fitting a Straight Line

Principle of Least Squares

This is the method used to find the "best-fit" line (the regression line). It finds the line that minimizes the sum of the squares of the vertical errors (residuals) between the observed data points (y) and the values predicted by the line (ŷ).

Fitting a Straight Line (y = a + bx)

To find the parameters 'a' (intercept) and 'b' (slope) for the line of best fit, we use the Principle of Least Squares. This generates two "Normal Equations" which we can solve simultaneously.

Normal Equations for a Straight Line:

Σy = n*a + b*(Σx)

Σxy = a*(Σx) + b*(Σx²)

How to solve: 1. From your data, calculate: n, Σx, Σy, Σxy, Σx² 2. Plug these 5 values into the two normal equations. 3. You now have two simultaneous linear equations with two unknowns (a, b). Solve for 'a' and 'b'.

Note: The value 'b' found here is identical to the regression coefficient b_yx.