Imagine you’re the quality control manager at a smartphone factory producing 10,000 phones daily. How do you ensure quality without testing every single phone? Or consider a political poll predicting election results from just 1,500 voters out of millions. How can such small samples reveal meaningful truths about vast populations?
The Central Question of Statistics
How can we draw reliable conclusions about large groups (populations) by studying only small subsets (samples)?
This chapter explores the mathematical foundation that makes statistical inference possible — the theory of sampling distributions .
Before diving into formal definitions, let’s explore these concepts through a scenario that might be very familiar to you.
Example: The SSR Card Drop Rate Investigation
The Setup: Your favorite mobile game claims that the legendary SSR cards have a 1% drop rate. But you and your classmates suspect the game company might be lying — the drop rates seem lower than advertised.
The Investigation:
Population: All possible cards that could be drawn from the game’s virtual pool — an infinite collection where 1% should be SSR cards
Parameter: The game company’s claimed true SSR drop rate: p = 0.01 p = 0.01 p = 0.01 (1%)
Sample: 10 classmates collectively perform n = 200 n = 200 n = 200 card draws
Statistic: You observe 0 0 0 SSR cards out of 200 draws, giving sample drop rate: p ^ = 0 200 = 0 \hat{p} = \frac{0}{200} = 0 p ^ = 200 0 = 0
The Key Question: Since p ^ = 0 < 0.01 \hat{p} = 0 < 0.01 p ^ = 0 < 0.01 , does this prove the game company is deceiving players? Or could this difference be just random luck?
The Statistical Challenge: To answer this question, we need to understand how p ^ \hat{p} p ^ varies from sample to sample — this is the heart of sampling distribution theory!
This investigation perfectly illustrates why we need to study sampling distributions. Let’s now build the formal vocabulary to analyze such problems systematically.
Now that we’ve seen these concepts in action, let’s define them precisely:
Definition: Sampling Unit
A sampling unit is an individual member of a population that can be selected for inclusion in a sample.
Definition: Sampling Frame
A sampling frame is a practical list of sampling units used to represent a population. It serves as the actual source from which samples are drawn.
Example: Student Life at University
Scenario: The university wants to study students’ daily screen time.
Population: All 25,000 students enrolled at the university
Sampling Unit: Each individual student
Sampling Frame: The university’s student database (24,800 students with valid contact info)
Sample: 500 students randomly selected from the database
Statistics:
Sample mean screen time: X ˉ = 6.2 \bar{X} = 6.2 X ˉ = 6.2 hours
Sample range: R = 9.1 R = 9.1 R = 9.1 hours
Sample median: m = 6.0 m = 6.0 m = 6.0 hours
Key Insight: Notice that the sampling frame (24,800) is smaller than the population (25,000) due to missing contact information. This is common in practice!
Example: In-Class Exercise
Identify the five key concepts for the following scenario:
Scenario: GCGS canteen wants to determine whether the food served by the new window is popular among students.
Identify:
Population: ____________
Sampling unit: ____________
A realistic sampling frame: ____________
A practical sample size: ____________
Relevant statistics: ____________
Now let’s focus on the most crucial concept: what exactly makes something a “statistic”?
The Golden Rule of Statistics
A statistic is ONLY a number calculated from sample data.
Key Requirements:
Must be calculated using only observed sample values
Cannot involve any unknown population parameters
Different samples will produce different statistic values
Let’s test your understanding with concrete examples:
Example: What Is and Isn’t a Statistic — Practice Problems
Suppose we have a sample of 5 students’ heights: X 1 = 160 , X 2 = 165 , X 3 = 170 , X 4 = 155 , X 5 = 175 X_1 = 160, X_2 = 165, X_3 = 170, X_4 = 155, X_5 = 175 X 1 = 160 , X 2 = 165 , X 3 = 170 , X 4 = 155 , X 5 = 175 (all in cm). The population mean height μ \mu μ and variance σ 2 \sigma^2 σ 2 are unknown.
Identify which of the following are statistics:
X ˉ = 160 + 165 + 170 + 155 + 175 5 = 165 \bar{X} = \frac{160 + 165 + 170 + 155 + 175}{5} = 165 X ˉ = 5 160 + 165 + 170 + 155 + 175 = 165 cm — Answer: _____
s 2 = ( 160 − 165 ) 2 + ( 165 − 165 ) 2 + ⋯ + ( 175 − 165 ) 2 5 − 1 = 62.5 s^2 = \frac{(160-165)^2 + (165-165)^2 + \cdots + (175-165)^2}{5-1} = 62.5 s 2 = 5 − 1 ( 160 − 165 ) 2 + ( 165 − 165 ) 2 + ⋯ + ( 175 − 165 ) 2 = 62.5 — Answer: _____
X 1 − μ = 160 − μ X_1 - \mu = 160 - \mu X 1 − μ = 160 − μ (where μ \mu μ is the unknown population mean) — Answer: _____
X ˉ − μ σ / 5 \frac{\bar{X} - \mu}{\sigma/\sqrt{5}} σ / 5 X ˉ − μ (the standardized sample mean) — Answer: _____
The number of students taller than 170 cm = 1 — Answer: _____
2 X ˉ + 10 = 2 ( 165 ) + 10 = 340 2\bar{X} + 10 = 2(165) + 10 = 340 2 X ˉ + 10 = 2 ( 165 ) + 10 = 340 — Answer: _____
∑ i = 1 5 ( X i − μ ) 2 \sum_{i=1}^5 (X_i - \mu)^2 ∑ i = 1 5 ( X i − μ ) 2 (sum of squared deviations from population mean) — Answer: _____
Example: Real-World Statistics in Action
Let’s revisit our mobile game investigation with this new understanding:
Scenario: 10 students perform 200 card draws total, observing 3 SSR cards.
Which of these are statistics?
The observed number of SSR cards: 3
Sample SSR rate: p ^ = 3 200 = 0.015 \hat{p} = \frac{3}{200} = 0.015 p ^ = 200 3 = 0.015
The true population SSR rate: p = 0.01 p = 0.01 p = 0.01
The difference between our estimate and the claimed rate: p ^ − 0.01 = 0.005 \hat{p} - 0.01 = 0.005 p ^ − 0.01 = 0.005
The probability of getting exactly 3 SSRs if p = 0.01 p = 0.01 p = 0.01
The Critical Insight: Statistics are our “messengers” — they carry information from the sample to help us learn about the unknown population. But they’re imperfect messengers because they vary from sample to sample!
Remember our SSR investigation? We observed p ^ = 0.015 \hat{p} = 0.015 p ^ = 0.015 from 200 draws, which is higher than the claimed 0.01. But before concluding the game company is lying, we need to understand: How much should p ^ \hat{p} p ^ vary due to random sampling?
Example: Class Activity — Simulating the SSR Drop Rate
The Setup:
We’ll assume the game company is telling the truth: p = 0.01 p = 0.01 p = 0.01 (1% SSR rate).
Step 1 — Simulation:
Use a random number generator (such as random.org) to simulate drawing 200 cards 20 times:
Generate 200 random integers between 1 1 1 and 100 100 100
Count how many are = 1 = 1 = 1 (these represent SSR cards)
Calculate your personal p ^ = SSR count 200 \hat{p} = \frac{\text{SSR count}}{200} p ^ = 200 SSR count
Step 2 — Collect Data:
Collect all 20 p ^ \hat{p} p ^ values.
Step 3 — Create the Distribution:
Count frequencies and create a histogram:
p ^ \hat{p} p ^ ValueCount Frequency 0.00 0.01 0.02 0.03 0.04 0.05 0.06
Step 4 — Draw your conclusion:
Draw your conclusion based on the distribution you created. How suspicious is the announced rate by the game company?
The Statistical Revelation: Sampling distributions allow us to quantify sampling error and determine whether an observed statistic represents a rare event or normal variation.
This experiment demonstrates the revolutionary insight: instead of thinking of p ^ \hat{p} p ^ (or any statistic) as just a number, we recognize it as a random variable with its own distribution.
Definition: Sampling Distribution
The sampling distribution of a statistic is the probability distribution that specifies all possible values the statistic can take and the probability of each value occurring, across all possible samples of the same size from the same population.
The Key Insight: Every time you take a sample, your statistic will be different. The sampling distribution tells you how these different values are distributed and helps you distinguish between “normal variation” and “something unusual is happening.”
Now that we’ve experienced sampling distributions through simulation, let’s see how to construct them mathematically. We’ll use a different discrete example to build our theoretical understanding.
Example: The Mystery Treasure Box — Advanced Statistical Construction
Gaming Scenario: You’re playing an adventure game where treasure boxes contain different valuable coins. Market research shows the treasure drop rates, and you want to understand the risk patterns when opening multiple boxes.
Population: A vast collection of treasure boxes with the following contents:
Population Distribution:
Item Value (coins) Probability Rarity Gold Coin 100 0.1 Legendary Silver Coin 50 0.3 Rare Copper Coin 10 0.6 Common
Research Question: Instead of studying average value (which is predictable), let’s investigate something more exciting: “What’s the distribution of the most valuable item found when opening 2 boxes?”
This statistic M = max ( X 1 , X 2 ) M = \max(X_1, X_2) M = max ( X 1 , X 2 ) represents the “best luck” in a small treasure hunt!
Step 1: Enumerate All Possible Outcomes
For 2 boxes, we have 3 2 = 9 3^2 = 9 3 2 = 9 possible combinations:
Sample ( X 1 , X 2 ) (X_1, X_2) ( X 1 , X 2 ) Probability Max Value M M M 1 (10, 10) 0.6 × 0.6 = 0.36 0.6 \times 0.6 = 0.36 0.6 × 0.6 = 0.36 10 2 (10, 50) 0.6 × 0.3 = 0.18 0.6 \times 0.3 = 0.18 0.6 × 0.3 = 0.18 50 3 (10, 100) 0.6 × 0.1 = 0.06 0.6 \times 0.1 = 0.06 0.6 × 0.1 = 0.06 100 4 (50, 10) 0.3 × 0.6 = 0.18 0.3 \times 0.6 = 0.18 0.3 × 0.6 = 0.18 50 5 (50, 50) 0.3 × 0.3 = 0.09 0.3 \times 0.3 = 0.09 0.3 × 0.3 = 0.09 50 6 (50, 100) 0.3 × 0.1 = 0.03 0.3 \times 0.1 = 0.03 0.3 × 0.1 = 0.03 100 7 (100, 10) 0.1 × 0.6 = 0.06 0.1 \times 0.6 = 0.06 0.1 × 0.6 = 0.06 100 8 (100, 50) 0.1 × 0.3 = 0.03 0.1 \times 0.3 = 0.03 0.1 × 0.3 = 0.03 100 9 (100, 100) 0.1 × 0.1 = 0.01 0.1 \times 0.1 = 0.01 0.1 × 0.1 = 0.01 100
Step 2: Construct the Sampling Distribution of the Maximum M M M
Max Value M M M Samples Probability Gaming Interpretation 10 {1} 0.36 ”Unlucky run” 50 {2, 4, 5} 0.18 + 0.18 + 0.09 = 0.45 0.18 + 0.18 + 0.09 = 0.45 0.18 + 0.18 + 0.09 = 0.45 “Decent find” 100 {3, 6, 7, 8, 9} 0.06 + 0.03 + 0.06 + 0.03 + 0.01 = 0.19 0.06 + 0.03 + 0.06 + 0.03 + 0.01 = 0.19 0.06 + 0.03 + 0.06 + 0.03 + 0.01 = 0.19 “Jackpot!”
Statistical Insights:
P ( Best item is just copper ) = 0.36 P(\text{Best item is just copper}) = 0.36 P ( Best item is just copper ) = 0.36 — More than 1/3 chance of disappointment!
P ( Find at least one gold coin ) = 0.19 P(\text{Find at least one gold coin}) = 0.19 P ( Find at least one gold coin ) = 0.19 — Much lower than the individual gold rate (0.1)
E ( M ) = 10 × 0.36 + 50 × 0.45 + 100 × 0.19 = 45.1 E(M) = 10 \times 0.36 + 50 \times 0.45 + 100 \times 0.19 = 45.1 E ( M ) = 10 × 0.36 + 50 × 0.45 + 100 × 0.19 = 45.1 coins
Gaming Strategy Implication: Even though gold has 10% individual drop rate, there’s only 19% chance of getting gold as your best item in 2 boxes. This shows why maximum statistics behave very differently from means!
Example: In-Class Exercise — Card Game Five-Draw Pity System
A popular card game uses the classic “Five-Draw Pity System”: if you don’t get a legendary card in your first 4 draws, the 5th draw is guaranteed to be legendary!
Normal Draw Rates:
Card Type Probability Legendary 0.2 Epic 0.8
Pity System Rule: If draws 1, 2, 3, AND 4 are all Epic, then draw 5 is automatically Legendary.
Research Question: What’s the sampling distribution of the “first legendary position” statistic L L L ?
Step 1: List the key scenarios and calculate probabilities:
Case Pattern Probability Calculation Statistic L L L 1 (L, *, *, *, *) P ( X 1 = L ) = 0.2 P(X_1 = L) = 0.2 P ( X 1 = L ) = 0.2 L = L = L = 2 (E, L, *, *, *) P ( X 1 = E ) × P ( X 2 = L ) = P(X_1 = E) \times P(X_2 = L) = P ( X 1 = E ) × P ( X 2 = L ) = ___L = L = L = 3 (E, E, L, *, *) ___ L = L = L = 4 (E, E, E, L, *) ___ L = L = L = 5 (E, E, E, E, L) ___ L = L = L =
Step 2: Construct the sampling distribution of L L L :
L L L ValueProbability Gaming Experience 1 0.2 ”Instant jackpot!“ 2 ___ “Quick luck” 3 ___ “Third time’s charm” 4 ___ “Almost triggered pity” 5 ___ “Pity system saves you”
Step 3: Answer these strategic questions:
What’s the probability of needing the pity system?
What’s the expected position of the first legendary?
How does this compare to 5 independent draws with no pity system?
Example: In-Class Exercise — Package Delivery Service
A delivery company has two types of packages: small (containing 100 items) and large (containing 200 items). The company maintains a 4:3 ratio of small to large packages in their warehouse.
Let Z Z Z represent the number of items in a randomly selected package.
Find the distribution of Z Z Z .
Calculate E ( Z ) E(Z) E ( Z ) and Var ( Z ) \text{Var}(Z) Var ( Z ) .
Now consider selecting a random sample of 3 packages with item counts Z 1 , Z 2 , Z 3 Z_1, Z_2, Z_3 Z 1 , Z 2 , Z 3 .
List all possible combinations of package types in your sample.
Find the sampling distribution of the sample mean Z ˉ \bar{Z} Z ˉ .
Example: In-Class Exercise — Grading System
A teacher uses a mixed grading system where homework counts for 50% and quiz scores count for 50% of the final grade, therefore the score is calculated as:
Y = H + Q 2 Y = \frac{H + Q}{2} Y = 2 H + Q
where H H H is the homework score and Q Q Q is the quiz score of the student.
Population Information:
Homework scores: 70% of students get 8 points, 30% get 6 points
Quiz scores: 60% of students get 10 points, 40% get 8 points
Homework and quiz scores are independent
List all possible values of Y Y Y .
Find the distribution of Y Y Y .
The teacher randomly selects 3 3 3 students and records their scores as Y 1 , Y 2 , Y 3 Y_1, Y_2, Y_3 Y 1 , Y 2 , Y 3 .
Calculate the sampling distribution of the sample range.
Example: In-Class Exercise — HeyTea Market Research
Real-World Scenario: HeyTea is considering opening a new store near our school. They hired you as student consultants to conduct market research!
Question: “What percentage of students would buy milk tea at least once a week if we opened here?”
Mission: Design and analyze a sampling study to help HeyTea make this important business decision.
Step 1: Define Your Study Framework
Population: ____________
Parameter of Interest: Let p p p = true proportion of students who would buy milk tea weekly. HeyTea needs p ≥ 0.40 p \geq 0.40 p ≥ 0.40 (40%) to make the store profitable.
Sampling Frame: ____________
Sample Size: Your team decides to survey n = 50 n = 50 n = 50 randomly selected students.
Step 2: Business Implications
If your class found p ^ c l a s s = 0.36 \hat{p}_{class} = 0.36 p ^ c l a ss = 0.36 (36%), should HeyTea immediately conclude the market isn’t viable?
What does sampling variability tell us about making business decisions from limited data?
Statistical Challenge: If the true population proportion is actually p = 0.4 p = 0.4 p = 0.4 (above the profitability threshold), what’s the probability that a sample of 50 students would give p ^ ≤ 0.36 \hat{p} \leq 0.36 p ^ ≤ 0.36 , potentially leading HeyTea to make the wrong business decision?
Now we can return to our original question with the proper theoretical framework!
Now let’s approach this problem with the most appropriate statistical model. Since we’re dealing with rare events (low probability, large sample), the Poisson distribution is perfect!
Example: Understanding the SSR Count Distribution
The Smarter Approach: Instead of analyzing proportions, let’s directly study the count of SSR cards!
Our Statistic: X = X = X = number of SSR cards in 200 draws
The Mathematical Foundation: If the company’s claim is true (p = 0.01 p = 0.01 p = 0.01 ), then:
X ∼ Binomial ( n = 200 , p = 0.01 ) ≈ Poisson ( λ = n p = 200 × 0.01 = 2 ) X \sim \text{Binomial}(n = 200, p = 0.01) \approx \text{Poisson}(\lambda = np = 200 \times 0.01 = 2) X ∼ Binomial ( n = 200 , p = 0.01 ) ≈ Poisson ( λ = n p = 200 × 0.01 = 2 )
This makes intuitive sense: we expect about 2 SSR cards on average in 200 draws.
What This Distribution Looks Like:
X X X (SSR Count)P ( X = x ) P(X = x) P ( X = x ) Interpretation 0 e − 2 ⋅ 2 0 0 ! = 0.135 e^{-2} \cdot \frac{2^0}{0!} = 0.135 e − 2 ⋅ 0 ! 2 0 = 0.135 ”Terrible luck” 1 e − 2 ⋅ 2 1 1 ! = 0.271 e^{-2} \cdot \frac{2^1}{1!} = 0.271 e − 2 ⋅ 1 ! 2 1 = 0.271 ”Below average” 2 e − 2 ⋅ 2 2 2 ! = 0.271 e^{-2} \cdot \frac{2^2}{2!} = 0.271 e − 2 ⋅ 2 ! 2 2 = 0.271 ”Exactly expected” 3 e − 2 ⋅ 2 3 3 ! = 0.180 e^{-2} \cdot \frac{2^3}{3!} = 0.180 e − 2 ⋅ 3 ! 2 3 = 0.180 ”Our observation!“ 4 e − 2 ⋅ 2 4 4 ! = 0.090 e^{-2} \cdot \frac{2^4}{4!} = 0.090 e − 2 ⋅ 4 ! 2 4 = 0.090 ”Lucky streak” 5+ ≈ 0.053 \approx 0.053 ≈ 0.053 “Extremely lucky”
The Critical Question: How unusual is observing 0 SSR cards if the company is honest?
Calculating the Evidence:
P ( X = 0 ) = 0.135 P(X = 0) = 0.135 P ( X = 0 ) = 0.135
Interpretation: There’s about a 13.5% chance of getting 0 SSR cards even when the true rate is 1%. This is not rare!
The “More Extreme” Evidence:
What if we wanted to test in both directions (company could be lying up OR down)?
P ( ∣ X − 2 ∣ ≥ 2 ) = P ( X = 0 or X ≥ 4 ) = 0.135 + 0.090 = 0.225 P(|X - 2| \geq 2) = P(X = 0 \text{ or } X \geq 4) = 0.135 + 0.090 = 0.225 P ( ∣ X − 2∣ ≥ 2 ) = P ( X = 0 or X ≥ 4 ) = 0.135 + 0.090 = 0.225
About 22.5% of samples would deviate this much from the expected value of 2!
Conclusion: Our observation of 0 SSR cards provides weak evidence against the company’s claim. It’s well within the realm of random variation.
What we’ve just done is the foundation of statistical hypothesis testing — the subject of our next chapter!
The Process We Followed:
Null Hypothesis: Assume the company is honest: “True SSR rate = 1%”
Choose Right Statistic: Count of SSR cards: X = 0 X = 0 X = 0 (better than proportion for rare events)
Find Sampling Distribution: Under null hypothesis, X ∼ Poisson ( 2 ) X \sim \text{Poisson}(2) X ∼ Poisson ( 2 )
Calculate p-value: P ( X ≥ 0 ) = 0.135 P(X \geq 0) = 0.135 P ( X ≥ 0 ) = 0.135 (probability of our evidence or stronger)
Make Decision: 13.5% is quite high → insufficient evidence to reject company’s claim
Why This Approach is Powerful:
Objective: We use precise probability calculations instead of subjective judgment
Calibrated: We quantify exactly how unusual our observation is
Fair: We give the company the “benefit of the doubt” (assume innocence first)
Systematic: The same process works for any claim about any population parameter
Coming Next Chapter — Formal Hypothesis Testing:
How to set up null and alternative hypotheses systematically
Decision rules: When is evidence “strong enough” to reject a claim?
One-tailed vs two-tailed tests: Directional vs non-directional claims
The Revolution: We’ve moved from “That seems suspicious…” to “There’s a 13.5% chance of this happening by coincidence.” This precision transforms business decisions, scientific conclusions, and public policy!
Statistical Wisdom: You’ve now experienced the evolution from “gut feeling” → “precise probability” → “rational decision.” This is the essence of scientific thinking!
Example: WST02/01/June18/4
The volume of milk, M litres, in cartons produced by a dairy, has distribution N ( μ , σ 2 ) N(\mu, \sigma^2) N ( μ , σ 2 ) , where μ \mu μ and σ \sigma σ are unknown. A random sample of 12 cartons is taken and the volume of milk in each carton is measured (M 1 M_1 M 1 , M 2 M_2 M 2 , …, M 12 M_{12} M 12 ). A statistic X is based on this sample.
Explain what is meant by “a random sample” in this case. \hfill (1)
State the population in this case. \hfill (1)
Write down the distribution of M 12 − μ σ \displaystyle \frac{M_{12} - \mu}{\sigma} σ M 12 − μ \hfill (1)
Explain what you understand by the sampling distribution of X. \hfill (1)
State, giving a reason, which of the following is not a statistic based on this sample.
3 M 1 + 2 M 11 6 \displaystyle 3M_1 + \frac{2M_{11}}{6} 3 M 1 + 6 2 M 11
∑ i = 1 12 ( M i − μ σ ) 2 \displaystyle \sum_{i=1}^{12} \left( \frac{M_i - \mu}{\sigma} \right)^2 i = 1 ∑ 12 ( σ M i − μ ) 2
∑ i = 1 12 ( 2 M i − 3 ) \displaystyle \sum_{i=1}^{12}(2M_i - 3) i = 1 ∑ 12 ( 2 M i − 3 )
\hfill (2)
Example: WST02/01/Jan15/3
Explain what you understand by
a statistic, \hfill (1)
a sampling distribution \hfill (1)
A factory stores screws in packets. A small packet contains 100 screws and a large packet contains 200 screws. The factory keeps small and large packets in the ratio 4:3 respectively.
Find the mean and the variance of the number of screws in the packets stored at the factory. \hfill (3)
A random sample of 3 packets is taken from the factory and Y 1 Y_1 Y 1 , Y 2 Y_2 Y 2 and Y 3 Y_3 Y 3 denote the number of screws in each of these packets.
List all the possible samples \hfill (2)
Find the sampling distribution of Y ˉ \bar{Y} Y ˉ \hfill (4)
Example: WST02/01/June15/5
A bag contains a large number of counters with 35% of the counters having a value of 6 and 65% of the counters having a value of 9.
A random sample of size 2 is taken from the bag and the value of each counter is recorded as X 1 X_1 X 1 and X 2 X_2 X 2 respectively.
The statistic Y is calculated using the formula
Y = 2 X 1 + X 2 3 Y = \frac{2X_1 + X_2}{3} Y = 3 2 X 1 + X 2
List all the possible values of Y. \hfill (2)
Find the sampling distribution of Y. \hfill (5)
Find E(Y). \hfill (2)