If correlation shows a relationship exists, regression describes that relationship with an equation. This equation can be used for prediction.
A "line of regression" is the line of best fit for the data. In bivariate analysis, there are two main types (lines) of linear regression.
This line is used to predict the value of Y, given a value of X.
Equation: (Y - ȳ) = byx * (X - x̄)
Here, byx is the regression coefficient of Y on X (the slope). It represents the average change in Y for a one-unit change in X.
This line is used to predict the value of X, given a value of Y.
Equation: (X - x̄) = bxy * (Y - ȳ)
Here, bxy is the regression coefficient of X on Y. It represents the average change in X for a one-unit change in Y.
The coefficients byx and bxy are the slopes of the two regression lines.
The two regression lines intersect at (x̄, ȳ). The angle (θ) between them indicates the strength of the correlation.
This is the fundamental method used to find the "best-fit" line (the regression line) for a set of data points.
Principle: The line of best fit is the one that minimizes the sum of the squares of the vertical errors (residuals).
For a straight line ŷ = a + bx, we use calculus (partial derivatives w.r.t. 'a' and 'b') to find the values that minimize SSE. This process generates the "Normal Equations."
Using the Principle of Least Squares, we can derive the Normal Equations needed to fit specific curves to data.
Equation: y = a + bx
Normal Equations:
- Σy = n*a + b*(Σx)
- Σxy = a*(Σx) + b*(Σx²)
Solve these two simultaneous equations for 'a' and 'b'.
Equation: y = a + bx + cx²
Normal Equations:
- Σy = n*a + b*(Σx) + c*(Σx²)
- Σxy = a*(Σx) + b*(Σx²) + c*(Σx³)
- Σx²y = a*(Σx²) + b*(Σx³) + c*(Σx⁴)
Solve these three simultaneous equations for 'a', 'b', and 'c'.
Equation: y = a * bx
This is not linear. We must transform it by taking the logarithm.
Now, let Y = log(y), A = log(a), and B = log(b).
            
The equation becomes a straight line: Y = A + Bx
We use the normal equations for a straight line, but with Y instead of y:
Normal Equations (Exponential):
- ΣY = n*A + B*(Σx) => Σ(log y) = n*log(a) + log(b)*(Σx)
- ΣxY = A*(Σx) + B*(Σx²) => Σ(x log y) = log(a)*(Σx) + log(b)*(Σx²)
Solve for A and B, then find a = antilog(A) and b = antilog(B).
Coefficient of Determination (r²): The square of the correlation coefficient (r). It represents the proportion of the total variance in the dependent variable (Y) that is explained or accounted for by the linear relationship with the independent variable (X).