Skip to content
Go back

An Intuitive Journey Through Statistics & Hypothesis Testing

Published:  at  10:00 AM

From data to decision in a coffee shop with two baristas

TL;DR

• Data → Mean → Standard deviation → Standard error → A/B comparison → z-score → Hypothesis testing → Decision

🟦 1. The Data — Customers Waiting for Coffee

Imagine you own a small coffee shop.

You have two baristas:

Every morning, customers queue up, and you start timing:

At this point, the data look messy:

You’re not doing statistics yet.
You’re just observing the world.

💬 Intuition:

Raw data is just reality recorded. It’s noisy and ugly – and that’s fine.


🟩 2. The Mean — “Typical” Wait Time

You’d like to compress all this into something like:

“How long does a customer typically wait with Anna?”
“How long with Ben?”

That’s where the mean comes in.

Suppose after timing many customers, you get:

The mean is the balancing point of all the data.

If you imagine every wait time as a weight on a number line, the mean is where the plank balances.

💡 Why the mean?

Because it’s the best single guess of “central tendency” when you care about squared error (which we’ll get to soon).


🟧 3. Standard Deviation — How Wild is the Experience?

Two baristas could have the same average, but very different consistency.

To capture this, we use standard deviation (SD).

💬 Intuition:

SD answers: “How much, on average, do individual wait times wiggle around the mean?”

🧮 Side Note: Variance and Squared Error

Variance is the average squared deviation from the mean.

Why squared?

💡 What does “the mean is the point that minimizes the sum of squared errors” actually mean?

👉 If you pick any number to represent your data, the mean is the number that gives you the smallest total squared difference from all data points.

This is an optimization statement.

Let’s break it down using developer mental models.

Imagine you have a list of numbers:

25, 30, 28, 29

And you want one single value to represent them.

Let’s call that value M.

But how do you choose M?

🔧 Step 1 — Think of M as a “central” value

You want M to be close to all the values.

So you measure how far M is from each one:

distance = each value - M

But distances can be negative (e.g., 25 - 28 = -3), so we square them to make them always positive.

So we compute:

(25 - M)²
(30 - M)²
(28 - M)²
(29 - M)²

And then add them up.

This gives us a score:

🧠 Step 2 — Try different values for M

Try M = 10 → far from all numbers → huge score Try M = 100 → even worse Try M = 27 → getting closer Try M = 28 → even better Try M = 29 → also good, but slightly worse Try M = 40 → bad again

It turns out that the one single value that gives the smallest score is always the mean.

🎨 Developer-Friendly Metaphor

Think of each data point as pulling on a point M with a string.

If M is too far left → right-side numbers pull harder

If M is too far right → left-side numbers pull harder

Squaring the distances makes longer pulls much stronger

The mean is the exact point where all the pulls balance.

It’s the optimal compromise.

OK, I hope that clears it up, now…

Standard deviation is just: bringing the spread back to the original units (seconds).


🟨 4. Standard Error — How Uncertain is Our Mean?

We now know:

But here’s the deeper question:

“How uncertain am I about that average?”

If you only timed 5 customers, you’re not very sure.
If you timed 500 customers, you’re more confident.

This is what standard error (SE) measures:

Where:

💬 Intuition:
SE is the uncertainty of your mean.
More data → √N grows → SE shrinks → your estimate sharpens.


🟥 5. Comparing Anna vs Ben — A/B Logic

Now we move from describing one barista to comparing two.

We define:

Interpretation:

But δ is based on sample data — so it’s noisy too.

Every measured mean has uncertainty, so the difference does too:

We need its standard error:

This comes from the rule:

If two estimates are independent,
variance of the difference = sum of their variances.

Since SE² = variance of the estimate, their squared SEs add.

💬 Intuition:
tells us how noisy this comparison is. If SE_δ is tiny, even a small δ might be meaningful.
If SE_δ is huge, you might be seeing random fluke.


🟪 6. Z-Score — Measuring Difference in Units of Noise

Now we compress everything (the entire comparison) into a single number:

This is the z-score.

It answers:

“How many standard errors away from zero is this observed difference?”

🧮 Sidebar: Why does z follow a bell curve?

Because δ is built from averages of many independent observations.

By the Central Limit Theorem (CLT),
those averages are approximately normally distributed,
no matter what the original data looked like (within reasonable conditions).

Dividing by SE standardizes it so:

under the assumption “true δ = 0”.


🟫 7. Hypothesis Testing — Turning Evidence Into Decisions

We now have:

Hypothesis testing wraps this into a decision rule.

🧩 The game:

  1. Null hypothesis (H₀):
    Anna and Ben have the same true mean wait time

  2. Alternative (H₁):
    Anna and Ben differ or even directionally: Anna is faster

  3. Compute:

  4. Ask:
    Under , with ,
    how likely is a value as extreme as ours? how probable is a z as extreme as the one we observed?

  5. Decision:

    • If that probability is small (say, < 5%),
      act as if the difference is real,
      and route more customers to the faster barista.
    • If not, you don’t have enough evidence yet.

⚠️ Important:
You never get certainty.
You get a bet with a known error rate (e.g. at most 5% chance of being wrong in this specific way). You get a controlled risk of being wrong.


🧭 8. The Whole Journey

Here’s the full progression from noisy data to decision:

💻 Example Code Snippets

import numpy as np

# Example wait times in seconds
anna = np.array([25, 30, 28, 29, 27, 26, 31, 24], dtype=float)
ben  = np.array([32, 35, 30, 29, 40, 33, 31, 34], dtype=float)

def describe(x):
    mean = x.mean()
    sd = x.std(ddof=1)
    se = sd / np.sqrt(len(x))
    return mean, sd, se

anna_mean, anna_sd, anna_se = describe(anna)
ben_mean, ben_sd, ben_se   = describe(ben)

delta = anna_mean - ben_mean
se_delta = np.sqrt(anna_se**2 + ben_se**2)
z = delta / se_delta

print("Anna:", anna_mean, anna_sd, anna_se)
print("Ben :", ben_mean, ben_sd, ben_se)
print("delta:", delta)
print("SE_delta:", se_delta)
print("z-score:", z)

Suggest Changes

Next Post
Maths-as-Code for Machine Learning