Unit 4: Correlation and Regression

Table of Contents

1. Bivariate Data and Scatter Diagram

Bivariate Data

Data that involves two different variables, where we are interested in the relationship between them. Each observation consists of a pair of values (x, y).

Scatter Diagram (or Scatter Plot)

The simplest way to visualize bivariate data. It's a graph where each (x, y) pair is plotted as a single point on a 2D plane.

The pattern of the points helps us identify the type (linear, non-linear) and strength of the relationship.


2. Correlation and Regression (Introduction)

Correlation

Correlation measures the strength and direction of a *linear* relationship between two quantitative variables. It tells us *if* there is a relationship and *how strong* it is. It results in a single number, the correlation coefficient (r).

Regression

Regression goes one step further. If a correlation exists, regression describes that relationship with a mathematical equation (a line). This equation can then be used for prediction.

Correlation vs. Causation: A strong correlation does NOT imply that one variable *causes* the other. It only means they move together. (e.g., ice cream sales and drowning incidents are correlated, but both are caused by a third variable: hot weather).

3. Karl Pearson's Coefficient of Correlation (r)

Also known as the "product-moment correlation coefficient." It is a numerical measure of the strength and direction of the linear relationship.

Properties of 'r':

Formula for 'r':

r = [ n(Σxy) - (Σx)(Σy) ] / sqrt[ [n(Σx²) - (Σx)²] * [n(Σy²) - (Σy)²] ]
Exam Tip: To use this formula, create a table with 5 columns: x, y, x², y², xy. Then, find the sum (Σ) of each column and plug the values into the formula along with 'n' (the number of pairs).

4. Spearman's Rank Correlation Coefficient (R)

This coefficient measures the strength of association between two variables when the data is ordinal (ranked).

Procedure:

  1. Assign ranks (Rx) to the x-values (from 1 to n).
  2. Assign ranks (Ry) to the y-values (from 1 to n).
  3. Calculate the difference in ranks for each pair: d = Rx - Ry.
  4. Calculate Σd².

Formula (when ranks are not tied):

R = 1 - [ ( 6 * Σd² ) / ( n * (n² - 1) ) ]

Formula (when ranks are tied):

If two or more items have the same value, assign them the average rank. A correction factor (CF) must be used.

CF = Σ [ m * (m² - 1) / 12 ]
(where 'm' is the number of times an item is repeated, summed for all ties in both x and y)

Corrected R = 1 - [ ( 6 * (Σd² + CF) ) / ( n * (n² - 1) ) ]

5. Regression Lines and Coefficients

A "line of regression" is the line of best fit for the data. There are two lines:

1. Regression Line of Y on X

Used to predict Y given X.
Equation: (Y - ȳ) = byx * (X - x̄)

byx is the regression coefficient of Y on X (the slope).

byx = r * (σy / σx) = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σx²) - (Σx)² ]

2. Regression Line of X on Y

Used to predict X given Y.
Equation: (X - x̄) = bxy * (Y - ȳ)

bxy is the regression coefficient of X on Y.

bxy = r * (σx / σy) = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σy²) - (Σy)² ]

Properties of Regression Coefficients:


6. Angle Between Two Regression Lines

The two regression lines intersect at (x̄, ȳ). The angle (θ) between them gives an idea of the correlation strength.

tan(θ) = [ (1 - r²) / (r) ] * [ (σx * σy) / (σx² + σy²) ]

Key Insights:


7. Principle of Least Squares and Fitting a Straight Line

Principle of Least Squares

This is the method used to find the "best-fit" line (the regression line). It finds the line that minimizes the sum of the squares of the vertical errors (residuals) between the observed data points (y) and the values predicted by the line (ŷ).

Fitting a Straight Line (y = a + bx)

To find the parameters 'a' (intercept) and 'b' (slope) for the line of best fit, we use the Principle of Least Squares. This generates two "Normal Equations" which we can solve simultaneously.

Normal Equations for a Straight Line:
  1. Σy = n*a + b*(Σx)
  2. Σxy = a*(Σx) + b*(Σx²)

How to solve: 1. From your data, calculate: n, Σx, Σy, Σxy, Σx² 2. Plug these 5 values into the two normal equations. 3. You now have two simultaneous linear equations with two unknowns (a, b). Solve for 'a' and 'b'.

Note: The value 'b' found here is identical to the regression coefficient byx.