Unit 4: Regression and Curve Fitting

Table of Contents

1. Regression: Types of Regression (Lines)

If correlation shows a relationship exists, regression describes that relationship with an equation. This equation can be used for prediction.

A "line of regression" is the line of best fit for the data. In bivariate analysis, there are two main types (lines) of linear regression.

1. Regression Line of Y on X

This line is used to predict the value of Y, given a value of X.

Equation: (Y - ȳ) = byx * (X - x̄)

Here, byx is the regression coefficient of Y on X (the slope). It represents the average change in Y for a one-unit change in X.

2. Regression Line of X on Y

This line is used to predict the value of X, given a value of Y.

Equation: (X - x̄) = bxy * (Y - ȳ)

Here, bxy is the regression coefficient of X on Y. It represents the average change in X for a one-unit change in Y.

Note: Both regression lines always pass through the point (x̄, ȳ), which is the mean of x and the mean of y.

2. Regression Coefficients and their Properties

The coefficients byx and bxy are the slopes of the two regression lines.

Formulas for Coefficients:

byx = Cov(x, y) / σx² = r * (σy / σx)
bxy = Cov(x, y) / σy² = r * (σx / σy)
Computational Formulas:
byx = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σx²) - (Σx)² ]
bxy = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σy²) - (Σy)² ]

Properties of Regression Coefficients:

  1. Geometric Mean: The correlation coefficient 'r' is the geometric mean of the two regression coefficients.
    r² = byx * bxy => r = ± sqrt(byx * bxy)
  2. Sign: 'r', byx, and bxy all have the same sign.
  3. Magnitude: If one regression coefficient is greater than 1, the other *must* be less than 1 (as their product, r², cannot exceed 1).
Exam Tip: A classic question: "The two regression coefficients are 1.6 and 0.9. Is this possible?"
- Answer: No. r² = 1.6 * 0.9 = 1.44, which is > 1. This is impossible.

3. Angle Between Two Regression Lines

The two regression lines intersect at (x̄, ȳ). The angle (θ) between them indicates the strength of the correlation.

tan(θ) = [ (1 - r²) / (r) ] * [ (σx * σy) / (σx² + σy²) ]

Key Insights:


4. Principle of Least Squares

This is the fundamental method used to find the "best-fit" line (the regression line) for a set of data points.

Principle: The line of best fit is the one that minimizes the sum of the squares of the vertical errors (residuals).
Minimize: SSE = Σ (ei)² = Σ (yi - ŷi

For a straight line ŷ = a + bx, we use calculus (partial derivatives w.r.t. 'a' and 'b') to find the values that minimize SSE. This process generates the "Normal Equations."


5. Fitting of Linear, Polynomials, and Exponential Curves

Using the Principle of Least Squares, we can derive the Normal Equations needed to fit specific curves to data.

1. Fitting a Linear Equation (Straight Line)

Equation: y = a + bx

Normal Equations:
  1. Σy = n*a + b*(Σx)
  2. Σxy = a*(Σx) + b*(Σx²)

Solve these two simultaneous equations for 'a' and 'b'.

2. Fitting a Polynomial (Parabola / Quadratic)

Equation: y = a + bx + cx²

Normal Equations:
  1. Σy = n*a + b*(Σx) + c*(Σx²)
  2. Σxy = a*(Σx) + b*(Σx²) + c*(Σx³)
  3. Σx²y = a*(Σx²) + b*(Σx³) + c*(Σx⁴)

Solve these three simultaneous equations for 'a', 'b', and 'c'.

3. Fitting an Exponential Curve

Equation: y = a * bx

This is not linear. We must transform it by taking the logarithm.

log(y) = log(a) + x * log(b)

Now, let Y = log(y), A = log(a), and B = log(b).
The equation becomes a straight line: Y = A + Bx

We use the normal equations for a straight line, but with Y instead of y:

Normal Equations (Exponential):
  1. ΣY = n*A + B*(Σx) => Σ(log y) = n*log(a) + log(b)*(Σx)
  2. ΣxY = A*(Σx) + B*(Σx²) => Σ(x log y) = log(a)*(Σx) + log(b)*(Σx²)

Solve for A and B, then find a = antilog(A) and b = antilog(B).


6. Coefficient of Determination (r²)

Coefficient of Determination (r²): The square of the correlation coefficient (r). It represents the proportion of the total variance in the dependent variable (Y) that is explained or accounted for by the linear relationship with the independent variable (X).