Knowledge · Reference

Testing Methodology

Six principles that separate testing programs that compound into product knowledge from ones that produce ambiguous noise.

Strategy before tactics

Connect activities to business objectives before you pick what to change.

It's tempting to start with "what content should we optimise?" — but that rarely compounds. Anchor first to a business objective (KBO), then identify where in the journey you'd move it, then design the test.

A program without strategy produces a list of activities. A program with strategy produces a steady drumbeat of business outcomes.

Hypothesis before variant

If you can't write the statement, you don't have a test — you have a guess.

Skipping the hypothesis is the most common reason post-test debriefs go in circles. The discipline of writing "For [audience], we believe that [change] will result in [outcome] because [reasoning]" forces you to name the mechanism you're actually testing.

Without it, any result is interpretable as a win.

Pick metrics before launch

Primary, secondary, guardrails — all committed before traffic flows.

Picking the metric after seeing the data is how teams accidentally lie to themselves. The right discipline: write down your primary metric, 1–3 secondaries, and the guardrails that would block roll-out even if you win on primary.

If you can't commit to a primary metric ahead of time, the hypothesis isn't sharp enough yet.

Power your test

Agree on baseline, MDE, and run length up front — and respect the math.

An underpowered test that comes back flat doesn't tell you the change had no effect. It tells you nothing: the sample wasn't big enough to detect a real difference of the size you cared about.

Set MDE based on what change is worth acting on, not what change you secretly hope for.

Don't peek

The sample-size calc assumes a single read at the end of the run.

Checking every day and calling a test early when something looks significant is the single fastest way to ship a placebo. Sequential checks inflate the false-positive rate well above the confidence level you'd expect.

If you genuinely need interim checks, use a sequential testing method that accounts for them; otherwise wait for the calculated duration.

Define winning up front

What lift, on which metric, sustained how long, justifies roll-out?

"Statistically significant" is not the same as "worth the engineering cost." A 0.3% lift on a low-cost change may ship; the same lift on something that adds latency or maintenance burden may not. Decide ahead of time so the result reads itself.