S2 Chapter 6: Sampling Distributions

From Single Numbers to Patterns: Understanding the Nature of Statistical Investigation

Imagine you’re the quality control manager at a smartphone factory producing 10,000 phones daily. How do you ensure quality without testing every single phone? Or consider a political poll predicting election results from just 1,500 voters out of millions. How can such small samples reveal meaningful truths about vast populations?

This chapter explores the mathematical foundation that makes statistical inference possible — the theory of sampling distributions.

1. The Language of Sampling

1.1 A Real-World Mystery: The Mobile Game Investigation

Before diving into formal definitions, let’s explore these concepts through a scenario that might be very familiar to you.

The Setup: Your favorite mobile game claims that the legendary SSR cards have a 1% drop rate. But you and your classmates suspect the game company might be lying — the drop rates seem lower than advertised.

The Investigation:

Population: All possible cards that could be drawn from the game’s virtual pool — an infinite collection where 1% should be SSR cards
Parameter: The game company’s claimed true SSR drop rate: $p = 0.01$ (1%)
Sample: 10 classmates collectively perform $n = 200$ card draws
Statistic: You observe $0$ SSR cards out of 200 draws, giving sample drop rate: $\hat{p} = \frac{0}{200} = 0$

The Key Question: Since $\hat{p} = 0 < 0.01$ , does this prove the game company is deceiving players? Or could this difference be just random luck?

The Statistical Challenge: To answer this question, we need to understand how $\hat{p}$ varies from sample to sample — this is the heart of sampling distribution theory!

This investigation perfectly illustrates why we need to study sampling distributions. Let’s now build the formal vocabulary to analyze such problems systematically.

1.2 Building Our Vocabulary — The Five Fundamental Concepts

Now that we’ve seen these concepts in action, let’s define them precisely:

1.3 Real-World Examples: Connecting Concepts to Life

1.4 Statistics: The Bridge Between Sample and Population

Now let’s focus on the most crucial concept: what exactly makes something a “statistic”?

Let’s test your understanding with concrete examples:

Suppose we have a sample of 5 students’ heights: $X_1 = 160, X_2 = 165, X_3 = 170, X_4 = 155, X_5 = 175$ (all in cm). The population mean height $\mu$ and variance $\sigma^2$ are unknown.

Identify which of the following are statistics:

$\bar{X} = \frac{160 + 165 + 170 + 155 + 175}{5} = 165$ cm — Answer: _____
$s^2 = \frac{(160-165)^2 + (165-165)^2 + \cdots + (175-165)^2}{5-1} = 62.5$ — Answer: _____
$X_1 - \mu = 160 - \mu$ (where $\mu$ is the unknown population mean) — Answer: _____
$\frac{\bar{X} - \mu}{\sigma/\sqrt{5}}$ (the standardized sample mean) — Answer: _____
The number of students taller than 170 cm = 1 — Answer: _____
$2\bar{X} + 10 = 2(165) + 10 = 340$ — Answer: _____
$\sum_{i=1}^5 (X_i - \mu)^2$ (sum of squared deviations from population mean) — Answer: _____

The Critical Insight: Statistics are our “messengers” — they carry information from the sample to help us learn about the unknown population. But they’re imperfect messengers because they vary from sample to sample!

2. The Revolutionary Concept: Sampling Distributions

2.1 Returning to Our Mobile Game Mystery

Remember our SSR investigation? We observed $\hat{p} = 0.015$ from 200 draws, which is higher than the claimed 0.01. But before concluding the game company is lying, we need to understand: How much should $\hat{p}$ vary due to random sampling?

2.2 Discovering Sampling Distributions Through Simulation

The Setup:

We’ll assume the game company is telling the truth: $p = 0.01$ (1% SSR rate).

Step 1 — Simulation: Use a random number generator (such as random.org) to simulate drawing 200 cards 20 times:

Generate 200 random integers between $1$ and $100$
Count how many are $= 1$ (these represent SSR cards)
Calculate your personal $\hat{p} = \frac{\text{SSR count}}{200}$

Step 2 — Collect Data: Collect all 20 $\hat{p}$ values.

Step 3 — Create the Distribution: Count frequencies and create a histogram:

$\hat{p}$ Value	Count	Frequency
0.00
0.01
0.02
0.03
0.04
0.05
0.06

Step 4 — Draw your conclusion: Draw your conclusion based on the distribution you created. How suspicious is the announced rate by the game company?

The Statistical Revelation: Sampling distributions allow us to quantify sampling error and determine whether an observed statistic represents a rare event or normal variation.

2.3 From Intuition to Theory

This experiment demonstrates the revolutionary insight: instead of thinking of $\hat{p}$ (or any statistic) as just a number, we recognize it as a random variable with its own distribution.

The Key Insight: Every time you take a sample, your statistic will be different. The sampling distribution tells you how these different values are distributed and helps you distinguish between “normal variation” and “something unusual is happening.”

2.4 Mathematical Analysis: From Simulation to Theory

Now that we’ve experienced sampling distributions through simulation, let’s see how to construct them mathematically. We’ll use a different discrete example to build our theoretical understanding.

Gaming Scenario: You’re playing an adventure game where treasure boxes contain different valuable coins. Market research shows the treasure drop rates, and you want to understand the risk patterns when opening multiple boxes.

Population: A vast collection of treasure boxes with the following contents:

Population Distribution:

Item	Value (coins)	Probability	Rarity
Gold Coin	100	0.1	Legendary
Silver Coin	50	0.3	Rare
Copper Coin	10	0.6	Common

Research Question: Instead of studying average value (which is predictable), let’s investigate something more exciting: “What’s the distribution of the most valuable item found when opening 2 boxes?”

This statistic $M = \max(X_1, X_2)$ represents the “best luck” in a small treasure hunt!

Step 1: Enumerate All Possible Outcomes For 2 boxes, we have $3^2 = 9$ possible combinations:

Sample	$(X_1, X_2)$	Probability	Max Value $M$
1	(10, 10)	$0.6 \times 0.6 = 0.36$	10
2	(10, 50)	$0.6 \times 0.3 = 0.18$	50
3	(10, 100)	$0.6 \times 0.1 = 0.06$	100
4	(50, 10)	$0.3 \times 0.6 = 0.18$	50
5	(50, 50)	$0.3 \times 0.3 = 0.09$	50
6	(50, 100)	$0.3 \times 0.1 = 0.03$	100
7	(100, 10)	$0.1 \times 0.6 = 0.06$	100
8	(100, 50)	$0.1 \times 0.3 = 0.03$	100
9	(100, 100)	$0.1 \times 0.1 = 0.01$	100

Step 2: Construct the Sampling Distribution of the Maximum $M$

Max Value $M$	Samples	Probability	Gaming Interpretation
10	{1}	0.36	”Unlucky run”
50	{2, 4, 5}	$0.18 + 0.18 + 0.09 = 0.45$	“Decent find”
100	{3, 6, 7, 8, 9}	$0.06 + 0.03 + 0.06 + 0.03 + 0.01 = 0.19$	“Jackpot!”

Statistical Insights:

$P(\text{Best item is just copper}) = 0.36$ — More than 1/3 chance of disappointment!
$P(\text{Find at least one gold coin}) = 0.19$ — Much lower than the individual gold rate (0.1)
$E(M) = 10 \times 0.36 + 50 \times 0.45 + 100 \times 0.19 = 45.1$ coins

Gaming Strategy Implication: Even though gold has 10% individual drop rate, there’s only 19% chance of getting gold as your best item in 2 boxes. This shows why maximum statistics behave very differently from means!

A popular card game uses the classic “Five-Draw Pity System”: if you don’t get a legendary card in your first 4 draws, the 5th draw is guaranteed to be legendary!

Normal Draw Rates:

Card Type	Probability
Legendary	0.2
Epic	0.8

Pity System Rule: If draws 1, 2, 3, AND 4 are all Epic, then draw 5 is automatically Legendary.

Research Question: What’s the sampling distribution of the “first legendary position” statistic $L$ ?

Step 1: List the key scenarios and calculate probabilities:

Case	Pattern	Probability Calculation	Statistic $L$
1	(L, , , , )	$P(X_1 = L) = 0.2$	$L =$
2	(E, L, , , *)	$P(X_1 = E) \times P(X_2 = L) =$ ___	$L =$
3	(E, E, L, , )	___	$L =$
4	(E, E, E, L, *)	___	$L =$
5	(E, E, E, E, L)	___	$L =$

Step 2: Construct the sampling distribution of $L$ :

$L$ Value	Probability	Gaming Experience
1	0.2	”Instant jackpot!“
2	___	“Quick luck”
3	___	“Third time’s charm”
4	___	“Almost triggered pity”
5	___	“Pity system saves you”

Step 3: Answer these strategic questions:

What’s the probability of needing the pity system?
What’s the expected position of the first legendary?
How does this compare to 5 independent draws with no pity system?

Real-World Scenario: HeyTea is considering opening a new store near our school. They hired you as student consultants to conduct market research!

Question: “What percentage of students would buy milk tea at least once a week if we opened here?”

Mission: Design and analyze a sampling study to help HeyTea make this important business decision.

Step 1: Define Your Study Framework

Population: ____________
Parameter of Interest: Let $p$ = true proportion of students who would buy milk tea weekly. HeyTea needs $p \geq 0.40$ (40%) to make the store profitable.
Sampling Frame: ____________
Sample Size: Your team decides to survey $n = 50$ randomly selected students.

Step 2: Business Implications

If your class found $\hat{p}_{class} = 0.36$ (36%), should HeyTea immediately conclude the market isn’t viable?
What does sampling variability tell us about making business decisions from limited data?

Statistical Challenge: If the true population proportion is actually $p = 0.4$ (above the profitability threshold), what’s the probability that a sample of 50 students would give $\hat{p} \leq 0.36$ , potentially leading HeyTea to make the wrong business decision?

3. Solving Our Mobile Game Mystery

Now we can return to our original question with the proper theoretical framework!

3.1 Deeper Analysis: Using the Right Distribution

Now let’s approach this problem with the most appropriate statistical model. Since we’re dealing with rare events (low probability, large sample), the Poisson distribution is perfect!

The Smarter Approach: Instead of analyzing proportions, let’s directly study the count of SSR cards!

Our Statistic: $X =$ number of SSR cards in 200 draws

The Mathematical Foundation: If the company’s claim is true ( $p = 0.01$ ), then:

$X \sim \text{Binomial}(n = 200, p = 0.01) \approx \text{Poisson}(\lambda = np = 200 \times 0.01 = 2)$

This makes intuitive sense: we expect about 2 SSR cards on average in 200 draws.

What This Distribution Looks Like:

$X$ (SSR Count)	$P(X = x)$	Interpretation
0	$e^{-2} \cdot \frac{2^0}{0!} = 0.135$	”Terrible luck”
1	$e^{-2} \cdot \frac{2^1}{1!} = 0.271$	”Below average”
2	$e^{-2} \cdot \frac{2^2}{2!} = 0.271$	”Exactly expected”
3	$e^{-2} \cdot \frac{2^3}{3!} = 0.180$	”Our observation!“
4	$e^{-2} \cdot \frac{2^4}{4!} = 0.090$	”Lucky streak”
5+	$\approx 0.053$	“Extremely lucky”

The Critical Question: How unusual is observing 0 SSR cards if the company is honest?

Calculating the Evidence:

$P(X = 0) = 0.135$

Interpretation: There’s about a 13.5% chance of getting 0 SSR cards even when the true rate is 1%. This is not rare!

The “More Extreme” Evidence: What if we wanted to test in both directions (company could be lying up OR down)?

$P(|X - 2| \geq 2) = P(X = 0 \text{ or } X \geq 4) = 0.135 + 0.090 = 0.225$

About 22.5% of samples would deviate this much from the expected value of 2!

Conclusion: Our observation of 0 SSR cards provides weak evidence against the company’s claim. It’s well within the realm of random variation.

4. Preview: The World of Hypothesis Testing

What we’ve just done is the foundation of statistical hypothesis testing — the subject of our next chapter!

The Process We Followed:

Null Hypothesis: Assume the company is honest: “True SSR rate = 1%”
Choose Right Statistic: Count of SSR cards: $X = 0$ (better than proportion for rare events)
Find Sampling Distribution: Under null hypothesis, $X \sim \text{Poisson}(2)$
Calculate p-value: $P(X \geq 0) = 0.135$ (probability of our evidence or stronger)
Make Decision: 13.5% is quite high → insufficient evidence to reject company’s claim

Why This Approach is Powerful:

Objective: We use precise probability calculations instead of subjective judgment
Calibrated: We quantify exactly how unusual our observation is
Fair: We give the company the “benefit of the doubt” (assume innocence first)
Systematic: The same process works for any claim about any population parameter

Coming Next Chapter — Formal Hypothesis Testing:

How to set up null and alternative hypotheses systematically
Decision rules: When is evidence “strong enough” to reject a claim?
One-tailed vs two-tailed tests: Directional vs non-directional claims

The Revolution: We’ve moved from “That seems suspicious…” to “There’s a 13.5% chance of this happening by coincidence.” This precision transforms business decisions, scientific conclusions, and public policy!

Statistical Wisdom: You’ve now experienced the evolution from “gut feeling” → “precise probability” → “rational decision.” This is the essence of scientific thinking!

Homework Exercises

The volume of milk, M litres, in cartons produced by a dairy, has distribution $N(\mu, \sigma^2)$ , where $\mu$ and $\sigma$ are unknown. A random sample of 12 cartons is taken and the volume of milk in each carton is measured ( $M_1$ , $M_2$ , …, $M_{12}$ ). A statistic X is based on this sample.

Explain what is meant by “a random sample” in this case. \hfill (1)
State the population in this case. \hfill (1)
Write down the distribution of $\displaystyle \frac{M_{12} - \mu}{\sigma}$ \hfill (1)
Explain what you understand by the sampling distribution of X. \hfill (1)
State, giving a reason, which of the following is not a statistic based on this sample.
1. $\displaystyle 3M_1 + \frac{2M_{11}}{6}$
2. $\displaystyle \sum_{i=1}^{12} \left( \frac{M_i - \mu}{\sigma} \right)^2$
3. $\displaystyle \sum_{i=1}^{12}(2M_i - 3)$
\hfill (2)