Unit 4: Curve Fitting (Regression)

Table of Contents

1. Fitting of a Straight Line to Data (Linear Regression)

This is the process of finding the "line of best fit" that describes the linear relationship between an independent variable (X) and a dependent variable (Y). The goal is to find the equation y = a + bx.

Method: Using a Scatter Plot Trendline

This is the most common method in Excel (as covered in Unit 2).

  1. Create a Scatter (X, Y) plot of your data.
  2. Select the chart, click the "+" icon (Chart Elements).
  3. Check Trendline.
  4. In the "Format Trendline" options, ensure Linear is selected.
  5. Check the boxes for "Display Equation on chart" and "Display R-squared value on chart".

The equation displayed (e.g., y = 1.5x + 10.2) is the fitted straight line.


2. Fitting of Higher-Order Polynomials

Sometimes a relationship is not a straight line; it's a curve. We can fit a polynomial (e.g., quadratic, cubic) to model this curve.

Method: Using a Scatter Plot Trendline

  1. Create a Scatter (X, Y) plot.
  2. Add a Trendline (click "+" icon).
  3. Go to More Options...
  4. In the "Format Trendline" pane, select Polynomial.
  5. In the Order box, type 2 (for quadratic) or 3 (for cubic), etc.
  6. Check "Display Equation on chart" and "Display R-squared value on chart".

Excel will display the equation for the best-fit curve (e.g., y = 0.5x² - 2x + 4).


3. Interpretation of the Fitted Models

Getting the equation is only half the battle. You must interpret what it means.

Interpreting a Linear Model (y = a + bx)

Interpreting a Polynomial Model (e.g., y = a + bx + cx²)

Interpretation is more complex because the relationship is not constant. The slope depends on the value of X. It is often used to find an optimal point (a maximum or minimum).


4. R-squared (Coefficient of Determination)

R-squared (R²): A statistical measure (from 0 to 1) that represents the proportion of the variance in the dependent variable (Y) that is explained by the independent variable(s) (X) in the model.

Interpretation:

Note: R-squared is the square of the correlation coefficient (r) *only* in simple linear regression (one X variable).

5. Checking Assumptions of Regression Analysis

Linear regression models are only valid if certain assumptions about the data are met. The most important ones can be checked in Excel by analyzing the residuals (errors).

Residual = (Actual Y) - (Predicted Y)

How to get Residuals in Excel:

  1. Go to Data > Data Analysis > Regression.
  2. Input your Y Range and X Range.
  3. In the "Residuals" section, check the box for Residuals and Residual Plots.
  4. Click OK. Excel will create a new table with predicted Y values and their residuals.

Key Assumptions to Check:

  1. Linearity: The underlying relationship must be linear.
    • Check: Look at the Scatter Plot. Do the points look like a straight line, not a curve?
  2. Independence of Errors: The residuals should be random.
    • Check: Look at the Residual Plot (Residuals vs. X). The points should look like random scatter (no pattern, no U-shape, no funnel shape).
  3. Homoscedasticity (Constant Variance): The spread of the residuals should be consistent across all values of X.
    • Check: On the Residual Plot, look for a funnel shape (fanning out). If you see one, this assumption is violated (this is called "heteroscedasticity").

6. Finding the Equation of the Best Fit

This involves comparing different models (e.g., Linear, Polynomial, Exponential) and choosing the one that "best fits" the data.

The Primary Criterion: R-squared (R²)

The model with the highest R-squared value is generally considered the "best fit" because it explains the most variation in the data.

Procedure:

  1. Create a Scatter Plot of your X and Y data.
  2. Test Model 1 (Linear):
    • Add a Linear trendline.
    • Display the Equation and R-squared value. Write down the R² (e.g., R² = 0.82).
  3. Test Model 2 (Polynomial Order 2):
    • Add a Polynomial (Order 2) trendline.
    • Display its Equation and R-squared. Write down the R² (e.g., R² = 0.91).
  4. Test Model 3 (Exponential):
    • Add an Exponential trendline.
    • Display its Equation and R-squared. Write down the R² (e.g., R² = 0.88).
  5. Compare:
    • R² (Linear) = 0.82
    • R² (Poly-2) = 0.91
    • R² (Expo) = 0.88

Conclusion: The Polynomial (Order 2) model is the "best fit" because it has the highest R-squared (0.91). Its equation is the equation of the best fit for this dataset.