Primer · Statistics7 min read

Statistically significant — what it really means

The p-value is one of the most misunderstood concepts in medical literature. A clarification in five points.

Published: 2026-05-21

Almost every headline about a new study ends with "statistically significant", as if that were a quality seal. What most readers connect with it — "the effect is real" or "the effect is large" — is both wrong. p-values measure something else.

1. What the p-value actually says

A p-value is the probability of observing the result (or a more extreme one), ASSUMING the null hypothesis is true. The null hypothesis is typically: "there is no effect".

So a p-value of 0.03 means: if the substance did not actually work, we would see the observed result (or a more extreme one) by chance in 3% of cases. That is a conditional probability, not a direct measure of "how likely is the effect real".

What the p-value is NOT

The p-value is NOT the probability that the null hypothesis is true. It is NOT the probability that your hypothesis is correct. It is NOT the effect size. It is NOT reproducibility.

2. The 0.05 threshold is a convention

p < 0.05 as "significant" goes back to Ronald Fisher in the 1920s. It is an arbitrary convention, not a law of nature. p = 0.049 and p = 0.051 are statistically practically identical — one is celebrated, the other ignored.

High-quality medical statistics increasingly call for p-values to be reported together with effect sizes and confidence intervals — not instead, and not as the sole claim.

3. Effect size vs. significance

A tiny, clinically irrelevant difference can be statistically significant if the sample is large enough. A 0.02-percentage-point reduction in HbA1c with p < 0.001 is statistically impressive and clinically irrelevant.

Conversely: a 15% reduction of a clinically important endpoint can have p = 0.12 because the study was too small. "Not significant" does not mean "no effect" — it means "the data are not sufficient to say it for sure".

4. Multiple tests and p-hacking

When a study tests 20 different endpoints, statistically one is expected to reach p < 0.05 purely by chance. That is not proof of anything — it is mathematically guaranteed.

p-hacking is the practice of adjusting analyses until a significant p-value emerges: analysing different subgroups, trying different statistical tests, removing individual outliers. Pre-registered study protocols and "intention-to-treat" analyses are meant to prevent this.

5. Confidence intervals as the better measure

Rather than just "significant yes/no", a 95% confidence interval expresses effect size AND precision in one number. HR 0.80 with 95% CI 0.72–0.90 is more informative than "HR 0.80, p < 0.001":

Point estimate: the most likely effect (HR 0.80 = 20% reduction).
Interval width: how certain this estimate is (0.72–0.90 = quite narrow = precise).
Whether 1.0 is included: signals whether the effect is statistically significant (1.0 NOT included = significant).

A wide CI means: lots of uncertainty. "HR 0.80 with 95% CI 0.35–1.80" means — the true effect could lie between a 65% reduction and an 80% increase. That is practically no information.

What you can do as a reader

When an article reports only p-values without showing effect sizes and CIs — be sceptical. A serious study reports both. And when someone says "X is significant", the follow-up question is worth it: significant how large?