Unit 4: Regression and Curve Fitting

1. Regression: Types of Regression (Lines)
2. Regression Coefficients and their Properties
3. Angle Between Two Regression Lines
4. Principle of Least Squares
5. Fitting of Linear, Polynomials, and Exponential Curves
6. Coefficient of Determination (r²)

1. Regression: Types of Regression (Lines)

If correlation shows a relationship exists, regression describes that relationship with an equation. This equation can be used for prediction.

A "line of regression" is the line of best fit for the data. In bivariate analysis, there are two main types (lines) of linear regression.

1. Regression Line of Y on X

This line is used to predict the value of Y, given a value of X.

Equation: (Y - ȳ) = b_yx * (X - x̄)

Here, b_yx is the regression coefficient of Y on X (the slope). It represents the average change in Y for a one-unit change in X.

2. Regression Line of X on Y

This line is used to predict the value of X, given a value of Y.

Equation: (X - x̄) = b_xy * (Y - ȳ)

Here, b_xy is the regression coefficient of X on Y. It represents the average change in X for a one-unit change in Y.

Note: Both regression lines always pass through the point (x̄, ȳ), which is the mean of x and the mean of y.

2. Regression Coefficients and their Properties

The coefficients b_yx and b_xy are the slopes of the two regression lines.

Formulas for Coefficients:

b_yx = Cov(x, y) / σ_x² = r * (σ_y / σ_x)
b_xy = Cov(x, y) / σ_y² = r * (σ_x / σ_y)

Computational Formulas:
b_yx = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σx²) - (Σx)² ]
b_xy = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σy²) - (Σy)² ]

Properties of Regression Coefficients:

Geometric Mean: The correlation coefficient 'r' is the geometric mean of the two regression coefficients.
r² = b_yx * b_xy => r = ± sqrt(b_yx * b_xy)
Sign: 'r', b_yx, and b_xy all have the same sign.
Magnitude: If one regression coefficient is greater than 1, the other *must* be less than 1 (as their product, r², cannot exceed 1).

Exam Tip: A classic question: "The two regression coefficients are 1.6 and 0.9. Is this possible?"
- Answer: No. r² = 1.6 * 0.9 = 1.44, which is > 1. This is impossible.

3. Angle Between Two Regression Lines

The two regression lines intersect at (x̄, ȳ). The angle (θ) between them indicates the strength of the correlation.

tan(θ) = [ (1 - r²) / (r) ] * [ (σ_x * σ_y) / (σ_x² + σ_y²) ]

Key Insights:

If r = 0: tan(θ) = ∞, so θ = 90°. The lines are perpendicular. The variables are uncorrelated.
If r = +1 or -1: tan(θ) = 0, so θ = 0°. The two lines are coincident (they become the same line). This means perfect correlation.

4. Principle of Least Squares

This is the fundamental method used to find the "best-fit" line (the regression line) for a set of data points.

Principle: The line of best fit is the one that minimizes the sum of the squares of the vertical errors (residuals).

Residual (Error): e_i = (Observed y_i) - (Predicted ŷ_i)
Goal: Minimize the Sum of Squared Errors (SSE).

Minimize: SSE = Σ (e_i)² = Σ (y_i - ŷ_i)²

For a straight line ŷ = a + bx, we use calculus (partial derivatives w.r.t. 'a' and 'b') to find the values that minimize SSE. This process generates the "Normal Equations."

5. Fitting of Linear, Polynomials, and Exponential Curves

Using the Principle of Least Squares, we can derive the Normal Equations needed to fit specific curves to data.

1. Fitting a Linear Equation (Straight Line)

Equation: y = a + bx

Normal Equations:

Σy = n*a + b*(Σx)

Σxy = a*(Σx) + b*(Σx²)

Solve these two simultaneous equations for 'a' and 'b'.

2. Fitting a Polynomial (Parabola / Quadratic)

Equation: y = a + bx + cx²

Normal Equations:

Σy = n*a + b*(Σx) + c*(Σx²)

Σxy = a*(Σx) + b*(Σx²) + c*(Σx³)

Σx²y = a*(Σx²) + b*(Σx³) + c*(Σx⁴)

Solve these three simultaneous equations for 'a', 'b', and 'c'.

3. Fitting an Exponential Curve

Equation: y = a * b^x

This is not linear. We must transform it by taking the logarithm.

log(y) = log(a) + x * log(b)

Now, let Y = log(y), A = log(a), and B = log(b).
The equation becomes a straight line: Y = A + Bx

We use the normal equations for a straight line, but with Y instead of y:

Normal Equations (Exponential):

ΣY = n*A + B*(Σx) => Σ(log y) = n*log(a) + log(b)*(Σx)

ΣxY = A*(Σx) + B*(Σx²) => Σ(x log y) = log(a)*(Σx) + log(b)*(Σx²)

Solve for A and B, then find a = antilog(A) and b = antilog(B).

6. Coefficient of Determination (r²)

Coefficient of Determination (r²): The square of the correlation coefficient (r). It represents the proportion of the total variance in the dependent variable (Y) that is explained or accounted for by the linear relationship with the independent variable (X).

Range: 0 ≤ r² ≤ 1 (since it's a square).
Example: If r = 0.9, then r² = 0.81.
Interpretation: This means 81% of the variation in Y can be explained by X. The remaining 19% (1 - r²) is unexplained variation, due to other factors or random error.