Unit 3: Bivariate Data and Correlation

1. Bivariate Data and Scatter Diagram
2. Karl Pearson's Coefficient of Correlation (r)
3. Spearman's Rank Correlation Coefficient

1. Bivariate Data and Scatter Diagram

Bivariate Data

Data that involves two different variables, where we are interested in the relationship between them. Each observation consists of a pair of values (x, y).

Example: Height (x) and Weight (y) of students.
Example: Advertising Expenditure (x) and Sales (y) of a product.

Scatter Diagram (or Scatter Plot)

The simplest way to visualize bivariate data. It's a graph where each (x, y) pair is plotted as a single point on a 2D plane.

The pattern of the points helps us identify the type (linear, non-linear) and strength of the relationship.

Perfect Positive Correlation: Points form a perfect straight line sloping upwards.
High Positive Correlation: Points are tightly packed around a straight line sloping upwards.
Low Positive Correlation: Points are loosely scattered around a line sloping upwards.
Perfect Negative Correlation: Points form a perfect straight line sloping downwards.
High/Low Negative Correlation: Same as positive, but sloping downwards.
No Correlation (Zero Correlation): Points are randomly scattered with no clear pattern.

2. Karl Pearson's Coefficient of Correlation (r)

Also known as the "product-moment correlation coefficient." It is a numerical measure of the strength and direction of the linear relationship between two quantitative variables.

Properties of 'r':

Range: 'r' always lies between -1 and +1.
- r = +1: Perfect positive linear correlation.
- r = -1: Perfect negative linear correlation.
- r = 0: No linear correlation (there might still be a non-linear relationship).
Symmetrical: The correlation between x and y is the same as between y and x. (r_xy = r_yx).
Independent of Change of Origin and Scale: If you add/subtract a constant to x or y, or multiply/divide them by constants, 'r' does not change.

Formulas for 'r':

1. Covariance Method

r = Cov(x, y) / (σ_x * σ_y)

Where:
- Cov(x, y) = ( Σ[(x - x̄)(y - ȳ)] ) / n (Covariance of x and y)
- σ_x = sqrt( (Σ(x - x̄)²) / n ) (Standard deviation of x)
- σ_y = sqrt( (Σ(y - ȳ)²) / n ) (Standard deviation of y)

2. Raw Data (Computational) Formula

This is the most practical formula for calculations.

r = [ n(Σxy) - (Σx)(Σy) ] / sqrt[ [n(Σx²) - (Σx)²] * [n(Σy²) - (Σy)²] ]

Exam Tip: To use this formula, create a table with 5 columns: x, y, x², y², xy. Then, find the sum (Σ) of each column and plug the values into the formula along with 'n' (the number of pairs).

3. Spearman's Rank Correlation Coefficient (ρ or R)

This coefficient measures the strength and direction of the monotonic relationship (a relationship that consistently increases or decreases, but not necessarily in a straight line) between two variables.

It is used when:

The data is ordinal (ranked), like "best," "second best."
The quantitative data has significant outliers.

It is simply Pearson's 'r' calculated on the *ranks* of the data, not the values themselves.

Formula (when ranks are not tied):

R = 1 - [ ( 6 * Σd² ) / ( n * (n² - 1) ) ]

Where:
- d = Difference between the ranks of a pair: R_x - R_y
- n = Number of pairs of observations

Formula (when ranks are tied):

If two or more items have the same value, we assign them the average rank. (e.g., if 3 items are tied for 5th, they all get rank (5+6+7)/3 = 6).

When ties occur, a Correction Factor (CF) must be added to Σd².

CF = Σ [ m * (m² - 1) / 12 ]
- 'm' is the number of times an item is repeated (tied). You sum this for *all* tied groups in *both* x and y.

Corrected Formula:
R = 1 - [ ( 6 * (Σd² + CF) ) / ( n * (n² - 1) ) ]