Unit 1: Statistical Methods

1. Statistical Methods: Definition, Scope, and Limitations
2. Concepts of Statistical Population and Sample
3. Types of Data
4. Scales of Measurement
5. Collection of Data
6. Presentation of Data: Classification and Tabulation
7. Frequency Distributions and Graphical Representations

1. Statistical Methods: Definition, Scope, and Limitations

Definition of Statistics

Statistics is a branch of science that deals with the collection, organization, presentation, analysis, and interpretation of data to make effective decisions.

Collection: Gathering data from various sources (surveys, experiments, etc.).
Organization: Arranging data in a systematic way (e.g., editing, classifying).
Presentation: Displaying data in an understandable format (e.g., tables, graphs).
Analysis: Using statistical tools to process the data (e.g., calculating mean, correlation).
Interpretation: Drawing meaningful conclusions from the analyzed data.

Scope of Statistics

Statistics is used in almost every field:

Business and Economics: Market research, quality control, financial forecasting.
Science: Designing experiments, testing hypotheses (e.g., in medicine, biology).
Government: Policy making, census data, economic planning.
Social Sciences: Analyzing survey data, studying social trends.
Technology: Data mining, machine learning, artificial intelligence.

Limitations of Statistics

Deals with aggregates, not individuals: Statistics provides insights about a group, not a single person or object.
Deals only with quantitative data: It cannot directly study qualitative phenomena like honesty or beauty, though they can sometimes be quantified indirectly (e.g., through rating scales).
Statistical laws are true only "on average": They are not universally exact like laws of physics. They are laws of probability.
Prone to misuse: Data can be manipulated to present a misleading picture. "There are three kinds of lies: lies, damned lies, and statistics."
Results are only as good as the data: If the data collected is biased or flawed, the conclusions will also be flawed (Garbage In, Garbage Out - GIGO).

Exam Tip: Be prepared to define statistics and list its five stages (Collection to Interpretation). The limitations are a very common short-answer question.

2. Concepts of Statistical Population and Sample

Population

Population (or Universe): The entire group of individuals, items, or objects of interest in a statistical study.

Finite Population: A population where the number of units is countable and finite.
- Example: Number of students in your college.
- Example: Number of cars manufactured by a company in a year.
Infinite Population: A population where the number of units is theoretically infinite or so large that it is considered infinite for practical purposes.
- Example: The set of all possible outcomes of rolling a die.
- Example: The population of stars in the sky.

Sample

Sample: A subset or a part of the population selected to represent the characteristics of the whole population.

We study samples because it is often too costly, time-consuming, or impossible to study the entire population. The process of selecting a sample is called sampling.

Example: To find the average height of all students in your college (population), you select 100 students (sample) and measure their heights.

Parameter vs. Statistic

This is a crucial distinction:

Parameter: A numerical measure that describes a characteristic of the population. (e.g., population mean 'μ', population standard deviation 'σ'). Parameters are usually unknown.
Statistic: A numerical measure that describes a characteristic of the sample. (e.g., sample mean 'x̄', sample standard deviation 's'). We use statistics to estimate unknown parameters.

Mnemonic: Parameter for Population. Statistic for Sample.

3. Types of Data

Data can be broadly classified into two main types:

1. Qualitative (or Categorical) Data

Data that represents characteristics or attributes. It cannot be measured numerically but can be sorted into categories.

Example: Gender (Male, Female, Other)
Example: Eye Color (Blue, Brown, Green)
Example: Blood Type (A, B, AB, O)

2. Quantitative (or Numerical) Data

Data that is numerical and represents a measurable quantity.

a) Discrete Data

Data that can only take specific, distinct values (usually integers). It is "counted." There are gaps between possible values.

Example: Number of children in a family (0, 1, 2, 3... but not 2.5)
Example: Number of cars passing a toll booth in an hour.
Example: Number of correct answers on a quiz.

b) Continuous Data

Data that can take any value within a given range. It is "measured." There are no gaps between possible values (though our measurements are limited by our tools).

Example: Height of a student (e.g., 170.5 cm, 170.51 cm...)
Example: Temperature of a room.
Example: Time taken to run a race.

Common Mistake: Don't confuse "discrete" with "finite." Shoe size (e.g., 7, 7.5, 8, 8.5) is discrete because it can only take specific values, not *any* value between 7 and 9. Money is also technically discrete (you can't have $10.502), but it's often treated as continuous due to the large number of possible values.

Other Data Classifications

a) Cross-Sectional Data

Data collected on different subjects (people, firms, countries) at the same point in time or over the same period.

Example: The stock prices of 100 different companies on January 1, 2025.
Example: The GDP of all European countries in the year 2024.

b) Time Series Data

Data collected on the same subject or variable over a period of time, usually at regular intervals.

Example: The stock price of *one* company (e.g., Google) recorded daily for a year.
Example: The monthly rainfall in a city from 2010 to 2020.

4. Scales of Measurement

These scales (or levels) describe the nature of information within the values assigned to variables. They are hierarchical (each level up adds more properties).

1. Nominal Scale

The simplest scale. Data consists of categories or names only. There is no natural order or ranking.

Properties: Identity (e.g., 'Male' is different from 'Female').
Operations: Counting (frequency), finding the mode.
Examples: Jersey numbers, zip codes, eye color, marital status.

2. Ordinal Scale

Data can be categorized and these categories have a natural order or rank. However, the *differences* between the ranks are not meaningful or uniform.

Properties: Identity + Order.
Operations: Count, mode, median, rank correlation.
Examples: Customer satisfaction (Poor, Fair, Good, Excellent), educational level (High School, Bachelor's, Master's), grades (A, B, C).

3. Interval Scale

Data is numerical, ordered, and the differences between values are meaningful and uniform. However, there is no true zero point (zero is arbitrary and doesn't mean "absence").

Properties: Identity + Order + Meaningful Differences.
Operations: Count, mode, median, mean, standard deviation. Addition and subtraction are meaningful.
Examples: Temperature in Celsius or Fahrenheit (0°C doesn't mean "no heat"), calendar years (Year 0 is arbitrary), IQ scores.

4. Ratio Scale

The highest level of measurement. It has all the properties of the interval scale, plus a true zero point, which indicates the "absence" of the quantity.

Properties: Identity + Order + Meaningful Differences + True Zero.
Operations: All statistical operations, including multiplication and division (we can form ratios).
Examples: Height, weight, age, income, distance, number of customers. (0 kg means "no weight," $0 means "no money").

Exam Tip: A classic question is "Differentiate between Interval and Ratio scales." The key answer is the true zero point. Ask yourself: "Does 0 mean the absence of the thing?" If yes, it's Ratio. If no, it's Interval.

5. Collection of Data

1. Primary Data

Data collected for the first time by the researcher, specifically for the purpose of the study. It is original, raw data.

Major Sources (Methods of Collection):

Direct Personal Interview: Face-to-face interview with respondents.
- Pros: High accuracy, can use non-verbal-cues, doubts can be clarified.
- Cons: Expensive, time-consuming, potential for interviewer bias.
Indirect Oral Investigation: Interviewing third parties or witnesses who are close to the situation. (Used when the direct respondent is unavailable or unwilling).
Questionnaires: A set of written questions given to respondents.
- Mailed Questionnaire: Sent by post or email. (Pros: Wide coverage, low cost. Cons: Low response rate, doubts can't be clarified).
- Questionnaires filled by enumerators: The investigator goes to respondents and fills the questionnaire. (Combines features of interview and questionnaire).
Observation: The researcher observes and records behavior or events as they happen. (e.g., observing traffic patterns).
Experiments: Controlled studies to determine cause-and-effect. (e.g., a medical drug trial).

2. Secondary Data

Data that has already been collected by someone else for some other purpose, but is used by the researcher for their current study. It is "second-hand" data.

Major Sources:

Published Sources:
- Government publications (e.g., Census reports, economic surveys).
- International publications (e.g., World Bank, IMF reports).
- Journals, magazines, newspapers.
- Research papers and university reports.
Unpublished Sources:
- Internal company records, databases.
- Unpublished theses, private records.

Precautions in Using Secondary Data

Before using secondary data, you must check its:

Reliability: Who collected the data? What was their reputation? Were the methods sound?
Suitability: Does the data fit your research purpose? The original purpose might be different.
Adequacy: Is the data sufficient for your study? Is the sample size large enough? Is it up-to-date?

6. Presentation of Data: Classification and Tabulation

After collection, raw data is unorganized and hard to understand. We must organize it.

Classification

The process of sorting data into groups or classes based on their common characteristics.

Geographical: By location (e.g., sales by state).
Chronological: By time (e.g., population by year).
Qualitative: By attribute (e.g., students by gender).
Quantitative: By numerical value (e.g., people by income group).

Tabulation

The systematic arrangement of classified data into rows and columns with a title and headings.

Main Parts of a Statistical Table:

Table Number: For identification (e.g., "Table 1.1").
Title: A clear and concise description of the table's contents.
Headnote: A brief note below the title explaining the unit of measurement (e.g., "in '000s" or "in USD").
Stubs: The headings for the rows (usually on the left).
Captions: The headings for the columns.
Body: The main part of the table containing the numerical data.
Footnote: To clarify any specific item in the table.
Source Note: To indicate the source of the data (especially for secondary data).

Example of a Table:

Table 1: Student Enrollment by Gender and Course, 2025
Course	Male	Female	Total
Statistics	50	70	120
Economics	80	60	140
Total	130	130	260
Source: College Admission Records

7. Frequency Distributions and Graphical Representations

Frequency Distribution

A table that organizes data into classes (or groups) and shows the number of observations (frequency) that fall into each class.

1. Discrete Frequency Distribution

Used for discrete data. We list each distinct value and its corresponding frequency.

Example: Number of children in 20 families: 0, 1, 2, 2, 1, 3, 0, 1, 1, 2, 3, 2, 1, 0, 1, 2, 2, 1, 1, 0

Discrete Frequency Distribution
Number of Children (x)	Tally Marks	Frequency (f)
0	\|\|\|\|	4
1	\|\|\|\| \|\|\|	8
2	\|\|\|\| \|	6
3	\|\|	2
Total		20

2. Continuous Frequency Distribution

Used for continuous data (or discrete data with a wide range). Data is grouped into class intervals.

Class Limits: The lowest (Lower Limit) and highest (Upper Limit) values a class can have. (e.g., 10-19).
Class Boundaries: The true limits of a class, used to ensure no gaps. (e.g., 9.5 - 19.5).
Class Width: The difference between the upper and lower class boundaries.
Class Mark (Midpoint): (Lower Limit + Upper Limit) / 2.
Cumulative Frequency (CF): The sum of frequencies up to a certain class (Less than CF) or from that class onwards (More than CF).

Graphical Representations

1. Histogram

A graph of a continuous frequency distribution. It consists of adjacent rectangles.

The X-axis represents the class boundaries.
The Y-axis represents the frequency (or frequency density).
The area of each rectangle is proportional to the frequency of that class.
If class widths are equal, the height of the rectangle is proportional to the frequency.
There are no gaps between the bars.

2. Frequency Polygon

A line graph representing a frequency distribution.

It is drawn by plotting the class marks (midpoints) on the X-axis against the frequencies on the Y-axis.
The points are then joined by straight lines.
The polygon is "closed" by joining the first and last points to hypothetical class marks at either end with zero frequency.
A frequency polygon can also be drawn by joining the midpoints of the tops of the rectangles in a histogram.

3. Frequency Curve

A smoothed version of a frequency polygon. It is drawn as a freehand curve through the points of a frequency polygon.

It gives a better idea of the shape of the distribution (e.g., normal, skewed).

Exam Tip: Know the key difference: A Histogram uses class *boundaries* on the X-axis and has adjacent bars. A Bar Graph (used for categorical data) uses class *names* and has gaps between the bars. A Frequency Polygon uses class *midpoints*.