Unit 4: Correlation and Regression

1. Bivariate Data and Scatter Diagram
2. Karl Pearson's Coefficient of Correlation (r)
3. Spearman's Rank Correlation Coefficient
4. Regression: Lines of Regression
5. Properties of Regression Coefficients
6. Angle Between Two Regression Lines
7. Coefficient of Determination (r²)
8. Other Correlation Concepts

1. Bivariate Data and Scatter Diagram

Bivariate Data

Data that involves two different variables, where we are interested in the relationship between them. Each observation consists of a pair of values (x, y).

Example: Height (x) and Weight (y) of students.
Example: Advertising Expenditure (x) and Sales (y) of a product.

Scatter Diagram (or Scatter Plot)

The simplest way to visualize bivariate data. It's a graph where each (x, y) pair is plotted as a single point on a 2D plane.

The pattern of the points helps us identify the type (linear, non-linear) and strength of the relationship.

Perfect Positive Correlation: Points form a perfect straight line sloping upwards.
High Positive Correlation: Points are tightly packed around a straight line sloping upwards.
Low Positive Correlation: Points are loosely scattered around a line sloping upwards.
Perfect Negative Correlation: Points form a perfect straight line sloping downwards.
High/Low Negative Correlation: Same as positive, but sloping downwards.
No Correlation (Zero Correlation): Points are randomly scattered with no clear pattern.

2. Karl Pearson's Coefficient of Correlation (r)

Also known as the "product-moment correlation coefficient." It is a numerical measure of the strength and direction of the linear relationship between two quantitative variables.

Properties of 'r':

Range: 'r' always lies between -1 and +1.
- r = +1: Perfect positive linear correlation.
- r = -1: Perfect negative linear correlation.
- r = 0: No linear correlation (there might still be a non-linear relationship).
Symmetrical: The correlation between x and y is the same as between y and x. (r_xy = r_yx).
Independent of Change of Origin and Scale: If you add/subtract a constant to x (change of origin) or multiply/divide x by a constant (change of scale), 'r' does not change. This is a very important property.

Formulas for 'r':

1. Covariance Method

r = Cov(x, y) / (σ_x * σ_y)

Where:
- Cov(x, y) = ( Σ[(x - x̄)(y - ȳ)] ) / n (Covariance of x and y)
- σ_x = sqrt( (Σ(x - x̄)²) / n ) (Standard deviation of x)
- σ_y = sqrt( (Σ(y - ȳ)²) / n ) (Standard deviation of y)

2. Raw Data (Computational) Formula

This is the most practical formula for calculations.

r = [ n(Σxy) - (Σx)(Σy) ] / sqrt[ [n(Σx²) - (Σx)²] * [n(Σy²) - (Σy)²] ]

Exam Tip: To use this formula, create a table with 5 columns: x, y, x², y², xy. Then, find the sum (Σ) of each column and plug the values into the formula along with 'n' (the number of pairs).

3. Spearman's Rank Correlation Coefficient (ρ or R)

This coefficient measures the strength and direction of the monotonic relationship between two variables. It is used when:

The data is ordinal (ranked), like "best," "second best."
The data is quantitative, but we suspect outliers or a non-linear (but still monotonic) relationship.

It is essentially Pearson's 'r' calculated on the *ranks* of the data, not the values themselves.

Formula (when ranks are not tied):

R = 1 - [ ( 6 * Σd² ) / ( n * (n² - 1) ) ]

Where:
- d = Difference between the ranks of a pair: R_x - R_y
- n = Number of pairs of observations

Formula (when ranks are tied):

If two or more items have the same value, we assign them the average rank. (e.g., if 3 items are tied for 5th, they all get rank (5+6+7)/3 = 6).

When ties occur, the formula must be corrected:

R = 1 - [ ( 6 * (Σd² + CF) ) / ( n * (n² - 1) ) ]

Where CF is the Correction Factor:
CF = Σ [ m * (m² - 1) / 12 ]
- 'm' is the number of times an item is repeated (tied). You sum this for *all* tied groups in *both* x and y.

Example of CF: If in x, one value repeats 2 times (m=2) and in y, one value repeats 3 times (m=3):
CF = [ 2*(2²-1)/12 ] + [ 3*(3²-1)/12 ] = [ 2*3/12 ] + [ 3*8/12 ] = 0.5 + 2 = 2.5

4. Regression: Lines of Regression

If correlation shows a relationship exists, regression describes that relationship with an equation. This equation can be used for prediction.

A "line of regression" is the line of best fit for the data (found using the Principle of Least Squares from Unit 3). In bivariate analysis, there are two regression lines.

1. Regression Line of Y on X

This line is used to predict the value of Y, given a value of X.

Equation: (Y - ȳ) = b_yx * (X - x̄)

Here, b_yx is the regression coefficient of Y on X (the slope).

b_yx = Cov(x, y) / σ_x²
b_yx = r * (σ_y / σ_x)
b_yx = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σx²) - (Σx)² ] (Note: Denominator is same as in 'r', but without the sqrt)

2. Regression Line of X on Y

This line is used to predict the value of X, given a value of Y.

Equation: (X - x̄) = b_xy * (Y - ȳ)

Here, b_xy is the regression coefficient of X on Y.

b_xy = Cov(x, y) / σ_y²
b_xy = r * (σ_x / σ_y)
b_xy = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σy²) - (Σy)² ]

Note: Both regression lines pass through the point (x̄, ȳ), which is the mean of x and the mean of y.

5. Properties of Regression Coefficients

These properties are extremely important for exam questions.

Geometric Mean: The correlation coefficient 'r' is the geometric mean of the two regression coefficients (b_yx and b_xy).
r² = b_yx * b_xy => r = ± sqrt(b_yx * b_xy)
Sign: 'r', b_yx, and b_xy all have the same sign. If b_yx is positive and b_xy is positive, 'r' must be positive.
Magnitude: If one regression coefficient is greater than 1 (numerically), the other *must* be less than 1 (numerically). (Their product, r², cannot exceed 1).
Independence: Regression coefficients are *not* independent of change of scale, but they *are* independent of change of origin. (This is different from 'r'!).

Exam Tip: A classic question: "The two regression coefficients are 1.5 and 0.8. Find 'r'."
- Answer: r² = 1.5 * 0.8 = 1.2. This is impossible, as r² cannot be > 1. The data is inconsistent.
Another: "The two regression coefficients are -0.9 and -0.4. Find 'r'."
- Answer: r² = (-0.9) * (-0.4) = 0.36.
- r = ± sqrt(0.36) = ±0.6.
- Since both coefficients are negative, 'r' must also be negative.
- r = -0.6

6. Angle Between Two Regression Lines

The two regression lines (Y on X, X on Y) intersect at the point (x̄, ȳ). The angle (θ) between them gives an idea of the correlation strength.

tan(θ) = [ (1 - r²) / (r) ] * [ (σ_x * σ_y) / (σ_x² + σ_y²) ]

A simpler form using the slopes (m1 = b_yx, m2 = 1/b_xy):
tan(θ) = | (m1 - m2) / (1 + m1 * m2) |

Key Insights:

If r = 0: tan(θ) = ∞ (infinity), so θ = 90°. The lines are perpendicular. This makes sense, as the variables are uncorrelated.
If r = +1 or -1: tan(θ) = 0, so θ = 0°. The two lines are coincident (they become the same line). This means perfect correlation.
The closer 'r' is to 0, the larger the angle. The closer 'r' is to ±1, the smaller the angle.

7. Coefficient of Determination (r²)

Coefficient of Determination (r²): The square of the correlation coefficient (r). It represents the proportion of the total variance in the dependent variable (Y) that is explained or accounted for by the independent variable (X).

Range: 0 ≤ r² ≤ 1 (since it's a square).
Example: If r = 0.8, then r² = 0.64.
Interpretation: This means 64% of the variation in Y can be explained by the linear relationship with X. The remaining 36% (1 - r²) is unexplained variation, due to other factors or random error.

Coefficient of Non-Determination (k²): This is the unexplained portion.
k² = 1 - r²

8. Other Correlation Concepts

Concept: Intra-class Correlation Coefficient

This coefficient measures the correlation *within* a class or group. It is used when you have data in groups (e.g., test scores of siblings in different families) and you want to see how similar items *within* the same group are, compared to items from different groups.

It assesses the homogeneity within groups. A high intra-class correlation means items in the same group are very similar.

Concept: Correlation Ratio (η²)

Pearson's 'r' only measures linear relationships. What if the relationship is a strong curve (e.g., a U-shape)? Pearson's 'r' might be 0, which is misleading.

The Correlation Ratio (η², "eta-squared") is a measure of association that can detect non-linear relationships. It is always positive (0 ≤ η² ≤ 1) and is related to the (unexplained) variance from a regression.

If the relationship is perfectly linear, η² = r².
If the relationship is non-linear, η² > r².
It measures the proportion of variance in Y explained by X, regardless of whether the relationship is linear.