P-Values, Confidence Intervals, and Statistical vs Practical Significance

Key Takeaways

A p-value speaks to compatibility between the observed data and a statistical model, not to effect size or real-world importance.
Confidence intervals add information about precision and the range of effect sizes still compatible with the data.
Statistical significance can occur for trivially small effects, especially in large samples, while practically important effects can fail to reach significance in small studies.
In longevity research, interpretation is stronger when p-values, effect sizes, confidence intervals, endpoint relevance, and study design are read together.

Who This Is Useful For

This page is useful for readers who see a result described as “statistically significant” and want to know whether that actually means the finding matters. It is especially relevant when reading biomarker studies, intervention trials, and observational papers in which the reported p-value receives more attention than the effect size or the width of the confidence interval. [1] [2] [3]

P-values, confidence intervals, and effect sizes answer different questions. A p-value summarizes how incompatible the observed data are with a specified model, often a null hypothesis of no effect. A confidence interval shows a range of parameter values that remain reasonably compatible with the data under the chosen model. Neither measure, by itself, decides whether a result is important in practice. [1] [2] [4]

This distinction matters because the phrase “statistically significant” is often heard as a synonym for “meaningful” or “proven.” Statistical guidance from the American Statistical Association and related commentaries has repeatedly argued that such interpretations are too strong. Context, design quality, effect magnitude, and decision relevance still matter. [1] [5] [6]

What a P-Value Does and Does Not Tell You

A p-value is the probability, under a specified statistical model, of obtaining results at least as extreme as those observed. It is not the probability that the null hypothesis is true, and it is not the probability that the result happened “by chance” in some general everyday sense. Those are common but incorrect interpretations. [1] [2] [3]

A small p-value can arise when the underlying effect is large, but it can also arise when the effect is small and the sample is large enough to estimate that small difference precisely. Conversely, a larger p-value can reflect a genuinely small or absent effect, but it can also reflect limited information and wide uncertainty. This is why p-values alone do not tell readers what the effect size means in practice. [2] [8] [9]

What Confidence Intervals Add

Confidence intervals add two kinds of information that a p-value compresses away: the estimated size of the effect and the precision of that estimate. A narrow interval suggests greater precision; a wide interval signals that the data remain compatible with a broad range of effects, sometimes including both trivial and important values. [4] [10] [11]

Intervals also make it easier to ask whether the data are consistent with effects that would matter in a substantive sense. That is one reason methodologists have long argued that interval estimates are more informative than binary declarations of significance alone. [4] [10] [12]

Statistical Significance vs Practical Significance

Question	Statistical Significance	Practical Significance
Core concern	Whether the data are sufficiently incompatible with a specified null model under preset assumptions	Whether the estimated effect is large enough to matter in a scientific, clinical, or policy context
Typical summary	P-value or a threshold such as p < 0.05	Effect size, absolute difference, risk difference, or a domain-specific threshold such as a minimal important difference
Can be misleading when	Large samples make tiny effects look “significant” or small samples hide important effects	The effect looks meaningful but the estimate is too imprecise or biased to support strong conclusions
Best read with	Confidence intervals, design quality, multiplicity, and prior plausibility	Confidence intervals, endpoint relevance, baseline risk, and the decision context

1. Significant Does Not Necessarily Mean Important

In large datasets, very small deviations from a null value can produce low p-values. That can be useful for detecting subtle patterns, but it does not mean the observed effect is large enough to matter for health, function, or decision-making. Effect-size interpretation therefore cannot be replaced by a significance threshold. [1] [8] [9]

Clinical and applied fields often address this by comparing observed effects against a minimal clinically important difference or a similar decision threshold. The same logic applies more broadly in longevity research: a biomarker shift can be statistically detectable without being large enough to imply a meaningful change in ageing trajectories or health outcomes. [7] [8] [13]

2. Non-Significant Does Not Necessarily Mean No Important Effect

A non-significant result can reflect absence of a meaningful effect, but it can also reflect limited precision. When the confidence interval is wide, the data may still be compatible with effects that are scientifically or clinically important. Treating every non-significant result as evidence of no effect is therefore a common interpretive error. [2] [11] [12]

This problem is especially relevant in small proof-of-concept studies, early geroscience trials, and studies using noisy biomarkers. In those settings, the better question is often not whether p crossed 0.05, but what range of plausible effects remains after accounting for uncertainty. [10] [13] [14]

3. Confidence Intervals Help Separate Precision From Magnitude

Two studies can report similar p-values yet imply very different levels of certainty about the size of an effect. The study with the narrower interval is estimating the effect more precisely, while the study with the wider interval leaves more room for competing interpretations. Looking at the interval therefore helps distinguish statistical detectability from evidential sharpness. [4] [10] [2]

Intervals are also useful for comparing whether the data remain compatible with trivial effects, moderate effects, or very large effects. That is often more informative for real-world interpretation than a binary label such as “significant” or “not significant.” [5] [6] [10]

4. Why This Matters in Longevity Research

Longevity research often works with indirect outcomes such as biomarker changes, functional measures, or short-term disease-risk markers rather than lifespan itself. In that setting, a statistically significant result can still leave open a bigger interpretive question: whether the observed change is large enough, durable enough, and mechanistically relevant enough to matter for ageing-related outcomes. [13] [14]

This is one reason geroscience frameworks emphasize endpoint selection, replication, and triangulation across biomarkers, function, and clinical outcomes. Statistical significance may be part of the evidence, but it is not the whole evidential argument. [13] [14]

What This Does Not Mean

It does not mean p-values are useless; it means they answer a narrower question than many readers assume. [1] [2]
It does not mean confidence intervals solve bias, confounding, or poor measurement on their own. [2] [10]
It does not mean every practically interesting effect should be trusted if it is non-significant. [11] [12]
It does not mean statistical thresholds have no role; it means threshold crossing should not dominate interpretation. [5] [6]

Practical Interpretation Examples

If a very large cohort finds a tiny but significant biomarker difference: the result may be statistically detectable while still being too small to matter much biologically or clinically. [8] [13]
If a small trial reports p = 0.07 with a wide interval: that does not establish no effect; it often means the estimate is too imprecise for a firm conclusion. [2] [11]
If an interval spans both trivial benefit and meaningful benefit: the study has not yet pinned down practical significance, even if the point estimate looks promising. [4] [10]
If one study is significant and another is not: that difference alone does not prove the studies disagree, because the underlying effect estimates may still be compatible within their uncertainty ranges. [12]

Summary

P-values, confidence intervals, and practical significance should not be collapsed into a single idea. P-values address model-data compatibility, confidence intervals show estimate range and precision, and practical significance asks whether the effect is large enough to matter. In longevity research, where endpoints are often indirect and effect sizes can be modest, interpreting all three together gives a more credible picture than relying on significance labels alone. [1] [10] [13]

References

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA's statement on p-values: context, process, and purpose. The American Statistician. https://doi.org/10.1080/00031305.2016.1154108
Greenland, S., et al. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology. https://pmc.ncbi.nlm.nih.gov/articles/PMC4877414/
Goodman, S. N. (2008). A dirty dozen: twelve p-value misconceptions. Seminars in Hematology. https://pubmed.ncbi.nlm.nih.gov/18582619/
Gardner, M. J., & Altman, D. G. (1986). Confidence intervals rather than P values: estimation rather than hypothesis testing. British Medical Journal. https://www.bmj.com/content/292/6522/746
Amrhein, V., Greenland, S., & McShane, B. (2019). Scientists rise up against statistical significance. Nature. https://www.nature.com/articles/d41586-019-00857-9
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “p < 0.05”. The American Statistician. https://doi.org/10.1080/00031305.2019.1583913
Jaeschke, R., Singer, J., & Guyatt, G. H. (1989). Measurement of health status: ascertaining the minimal clinically important difference. Controlled Clinical Trials. https://pubmed.ncbi.nlm.nih.gov/2646486/
Sullivan, G. M., & Feinn, R. (2012). Using effect size, or why the P value is not enough. Journal of Graduate Medical Education. https://pmc.ncbi.nlm.nih.gov/articles/PMC3444174/
Sterne, J. A. C., & Smith, G. D. (2001). Sifting the evidence: what’s wrong with significance tests? BMJ. https://www.bmj.com/content/322/7280/226
Ho, J., Tumkaya, T., Aryal, S., Choi, H., & Claridge-Chang, A. (2019). Moving beyond P values: data analysis with estimation graphics. Nature Methods. https://www.nature.com/articles/s41592-019-0470-3
Altman, D. G., & Bland, J. M. (1995). Absence of evidence is not evidence of absence. BMJ. https://www.bmj.com/content/311/7003/485
Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician. https://doi.org/10.1198/000313006X152649
Justice, J. N., et al. (2016). Frameworks for proof-of-concept clinical trials of interventions that target fundamental aging processes. Journals of Gerontology Series A. https://pmc.ncbi.nlm.nih.gov/articles/PMC5055651/
Kritchevsky, S. B., et al. (2024). Biomarkers of aging and the translational geroscience challenge. Nature Aging. https://www.nature.com/articles/s43587-024-00615-8

Educational Disclaimer

This content is provided for educational purposes only and does not constitute medical advice.