Unit 4: Correlation and Regression

Table of Contents

1. Bivariate Data and Scatter Diagram

Bivariate Data

Data that involves two different variables, where we are interested in the relationship between them. Each observation consists of a pair of values (x, y).

Scatter Diagram (or Scatter Plot)

The simplest way to visualize bivariate data. It's a graph where each (x, y) pair is plotted as a single point on a 2D plane.

The pattern of the points helps us identify the type (linear, non-linear) and strength of the relationship.


2. Karl Pearson's Coefficient of Correlation (r)

Also known as the "product-moment correlation coefficient." It is a numerical measure of the strength and direction of the linear relationship between two quantitative variables.

Properties of 'r':

Formulas for 'r':

1. Covariance Method

r = Cov(x, y) / (σx * σy)

Where:
- Cov(x, y) = ( Σ[(x - x̄)(y - ȳ)] ) / n (Covariance of x and y)
- σx = sqrt( (Σ(x - x̄)²) / n ) (Standard deviation of x)
- σy = sqrt( (Σ(y - ȳ)²) / n ) (Standard deviation of y)

2. Raw Data (Computational) Formula

This is the most practical formula for calculations.

r = [ n(Σxy) - (Σx)(Σy) ] / sqrt[ [n(Σx²) - (Σx)²] * [n(Σy²) - (Σy)²] ]
Exam Tip: To use this formula, create a table with 5 columns: x, y, x², y², xy. Then, find the sum (Σ) of each column and plug the values into the formula along with 'n' (the number of pairs).

3. Spearman's Rank Correlation Coefficient (ρ or R)

This coefficient measures the strength and direction of the monotonic relationship between two variables. It is used when:

  1. The data is ordinal (ranked), like "best," "second best."
  2. The data is quantitative, but we suspect outliers or a non-linear (but still monotonic) relationship.

It is essentially Pearson's 'r' calculated on the *ranks* of the data, not the values themselves.

Formula (when ranks are not tied):

R = 1 - [ ( 6 * Σd² ) / ( n * (n² - 1) ) ]

Where:
- d = Difference between the ranks of a pair: Rx - Ry
- n = Number of pairs of observations

Formula (when ranks are tied):

If two or more items have the same value, we assign them the average rank. (e.g., if 3 items are tied for 5th, they all get rank (5+6+7)/3 = 6).

When ties occur, the formula must be corrected:

R = 1 - [ ( 6 * (Σd² + CF) ) / ( n * (n² - 1) ) ]

Where CF is the Correction Factor:
CF = Σ [ m * (m² - 1) / 12 ]
- 'm' is the number of times an item is repeated (tied). You sum this for *all* tied groups in *both* x and y.
Example of CF: If in x, one value repeats 2 times (m=2) and in y, one value repeats 3 times (m=3):
CF = [ 2*(2²-1)/12 ] + [ 3*(3²-1)/12 ] = [ 2*3/12 ] + [ 3*8/12 ] = 0.5 + 2 = 2.5

4. Regression: Lines of Regression

If correlation shows a relationship exists, regression describes that relationship with an equation. This equation can be used for prediction.

A "line of regression" is the line of best fit for the data (found using the Principle of Least Squares from Unit 3). In bivariate analysis, there are two regression lines.

1. Regression Line of Y on X

This line is used to predict the value of Y, given a value of X.

Equation: (Y - ȳ) = byx * (X - x̄)

Here, byx is the regression coefficient of Y on X (the slope).

byx = Cov(x, y) / σx²
byx = r * (σy / σx)
byx = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σx²) - (Σx)² ] (Note: Denominator is same as in 'r', but without the sqrt)

2. Regression Line of X on Y

This line is used to predict the value of X, given a value of Y.

Equation: (X - x̄) = bxy * (Y - ȳ)

Here, bxy is the regression coefficient of X on Y.

bxy = Cov(x, y) / σy²
bxy = r * (σx / σy)
bxy = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σy²) - (Σy)² ]
Note: Both regression lines pass through the point (x̄, ȳ), which is the mean of x and the mean of y.

5. Properties of Regression Coefficients

These properties are extremely important for exam questions.

  1. Geometric Mean: The correlation coefficient 'r' is the geometric mean of the two regression coefficients (byx and bxy).
    r² = byx * bxy => r = ± sqrt(byx * bxy)
  2. Sign: 'r', byx, and bxy all have the same sign. If byx is positive and bxy is positive, 'r' must be positive.
  3. Magnitude: If one regression coefficient is greater than 1 (numerically), the other *must* be less than 1 (numerically). (Their product, r², cannot exceed 1).
  4. Independence: Regression coefficients are *not* independent of change of scale, but they *are* independent of change of origin. (This is different from 'r'!).
Exam Tip: A classic question: "The two regression coefficients are 1.5 and 0.8. Find 'r'."
- Answer: r² = 1.5 * 0.8 = 1.2. This is impossible, as r² cannot be > 1. The data is inconsistent.
Another: "The two regression coefficients are -0.9 and -0.4. Find 'r'."
- Answer: r² = (-0.9) * (-0.4) = 0.36.
- r = ± sqrt(0.36) = ±0.6.
- Since both coefficients are negative, 'r' must also be negative.
- r = -0.6

6. Angle Between Two Regression Lines

The two regression lines (Y on X, X on Y) intersect at the point (x̄, ȳ). The angle (θ) between them gives an idea of the correlation strength.

tan(θ) = [ (1 - r²) / (r) ] * [ (σx * σy) / (σx² + σy²) ]

A simpler form using the slopes (m1 = byx, m2 = 1/bxy):
tan(θ) = | (m1 - m2) / (1 + m1 * m2) |

Key Insights:


7. Coefficient of Determination (r²)

Coefficient of Determination (r²): The square of the correlation coefficient (r). It represents the proportion of the total variance in the dependent variable (Y) that is explained or accounted for by the independent variable (X).

Coefficient of Non-Determination (k²): This is the unexplained portion.
k² = 1 - r²


8. Other Correlation Concepts

Concept: Intra-class Correlation Coefficient

This coefficient measures the correlation *within* a class or group. It is used when you have data in groups (e.g., test scores of siblings in different families) and you want to see how similar items *within* the same group are, compared to items from different groups.

It assesses the homogeneity within groups. A high intra-class correlation means items in the same group are very similar.

Concept: Correlation Ratio (η²)

Pearson's 'r' only measures linear relationships. What if the relationship is a strong curve (e.g., a U-shape)? Pearson's 'r' might be 0, which is misleading.

The Correlation Ratio (η², "eta-squared") is a measure of association that can detect non-linear relationships. It is always positive (0 ≤ η² ≤ 1) and is related to the (unexplained) variance from a regression.