Skip to content

Understanding P-Values in Statistics

CalculatorGlobe Team February 23, 2026 13 min read Math

P-values are arguably the most used — and most misunderstood — concept in statistics. They appear in scientific papers, medical studies, business reports, and news headlines. A result is "statistically significant" when the p-value falls below a threshold, typically 0.05. But what does that actually mean? And more importantly, what does it not mean?

This guide breaks down p-values from the ground up, explains the hypothesis testing framework, walks through calculation examples, addresses the most dangerous misconceptions, and provides guidance on using p-values responsibly as part of a broader statistical toolkit.

What Is a P-Value?

A p-value is the probability of observing results at least as extreme as what you found in your data, assuming the null hypothesis is true. In simpler terms, it answers the question: "If there really is no effect or no difference, how surprising is my data?"

A small p-value (close to 0) means your data would be very unlikely under the null hypothesis, suggesting something other than chance is at work. A large p-value (close to 1) means your data is consistent with the null hypothesis — it does not prove the null hypothesis is true, but it fails to provide evidence against it.

Analogy: The Courtroom

Think of the null hypothesis as "innocent until proven guilty." The p-value measures the strength of evidence against innocence. A very small p-value (strong evidence) leads you to reject innocence (reject the null). A large p-value (weak evidence) means you cannot reject innocence — but it does not prove innocence. Just as "not guilty" is different from "innocent," failing to reject the null hypothesis is different from proving it true.

The Hypothesis Testing Framework

P-values only make sense within the context of hypothesis testing. Here is the framework:

  1. State the null hypothesis (H₀). This is the default assumption of no effect, no difference, or no relationship. For example: "The new drug has no effect on blood pressure" or "There is no difference in conversion rates between website designs A and B."
  2. State the alternative hypothesis (H₁ or Hₐ). This is what you are trying to find evidence for. For example: "The new drug lowers blood pressure" or "Design B has a different conversion rate than Design A."
  3. Choose a significance level (α). This is the threshold below which you will reject the null hypothesis. The convention is α = 0.05, meaning you accept a 5% risk of incorrectly rejecting a true null hypothesis (Type I error).
  4. Collect data and calculate a test statistic. Depending on your data type and research question, this might be a t-statistic, z-statistic, chi-square statistic, or F-statistic.
  5. Calculate the p-value. Using the test statistic and its probability distribution, determine how likely you are to observe a result this extreme (or more extreme) if H₀ is true.
  6. Make a decision. If p ≤ α, reject H₀ (statistically significant). If p > α, fail to reject H₀ (not statistically significant).

How P-Values Are Calculated

The p-value is computed from the test statistic using the appropriate probability distribution. The general process involves three steps: calculate the test statistic from your data, determine which probability distribution applies, and find the area in the tail(s) of that distribution beyond your test statistic.

Example: One-Sample Z-Test

A factory claims its light bulbs last an average of 1,000 hours. You test 64 bulbs and find a mean life of 980 hours with a known population standard deviation of 80 hours. Does this provide evidence that the true mean is different from 1,000?

Step 1: State hypotheses

H₀: μ = 1,000 hours, H₁: μ ≠ 1,000 hours (two-tailed)

Step 2: Calculate the test statistic

z = (x̄ - μ₀) / (σ / √n)

z = (980 - 1000) / (80 / √64)

z = -20 / 10 = -2.0

Step 3: Find the p-value

P(Z ≤ -2.0) = 0.0228 (area in left tail)

Two-tailed p-value = 2 × 0.0228 = 0.0456

Step 4: Decision

Since p = 0.0456 < 0.05, we reject the null hypothesis. There is statistically significant evidence that the true mean bulb life differs from 1,000 hours at the 5% significance level.

One-Tailed vs Two-Tailed Tests

The choice between one-tailed and two-tailed tests affects how the p-value is calculated and interpreted:

Two-Tailed Test

Tests for a difference in either direction.

H₁: μ ≠ μ₀

P-value = probability in both tails combined

Use when: You want to detect any difference, regardless of direction. This is the default choice for most research.

One-Tailed Test

Tests for a difference in one specific direction.

H₁: μ > μ₀ or H₁: μ < μ₀

P-value = probability in one tail only

Use when: You have a strong prior reason to test only one direction and an effect in the other direction is irrelevant.

For the same test statistic, a one-tailed p-value is exactly half the two-tailed p-value. In the light bulb example above, the one-tailed p-value (testing whether bulbs last less than 1,000 hours) would be 0.0228 instead of 0.0456.

Try Our P-Value Calculator

Enter your test statistic and choose your test type to instantly calculate the p-value with visual probability representation.

Use Calculator

Interpreting P-Values Correctly

Correct interpretation of p-values requires understanding what they do and do not tell you. A p-value provides evidence about the compatibility of your data with the null hypothesis, but it is just one piece of information in a larger analytical picture.

What a P-Value Tells You

  • The probability of observing your data (or more extreme) if the null hypothesis is true
  • How compatible your data is with the null hypothesis
  • A continuous measure of evidence — smaller values represent stronger evidence against H₀

What a P-Value Does NOT Tell You

  • The probability that the null hypothesis is true or false
  • The probability that your results are due to chance
  • The size or practical importance of the effect
  • Whether the result will replicate in another study
  • Whether your study was well-designed or your data trustworthy

The American Statistical Association issued a formal statement in 2016 emphasizing that p-values do not measure the probability that the studied hypothesis is true, the probability that the data were produced by random chance alone, and that a p-value threshold alone should not determine whether a scientific finding is considered real.

Common Misconceptions About P-Values

These misconceptions are pervasive even among trained researchers. Understanding them is essential for responsible data analysis.

  • Misconception: "p = 0.03 means there is a 3% chance the null hypothesis is true." This confuses the probability of the data given H₀ with the probability of H₀ given the data. These are fundamentally different quantities. Computing the probability that H₀ is true requires Bayesian methods and prior probabilities.
  • Misconception: "p = 0.03 means there is a 97% chance the alternative hypothesis is true." This follows from the same error. The p-value cannot be subtracted from 1 to get the probability that H₁ is true. Hypothesis testing provides evidence, not proof.
  • Misconception: "A non-significant p-value proves there is no effect." Failure to reject the null hypothesis means the data does not provide sufficient evidence against it. The effect might exist but be too small for your sample to detect. This is the distinction between "absence of evidence" and "evidence of absence."
  • Misconception: "A smaller p-value means a larger effect." P-values are influenced by both effect size and sample size. With a large enough sample, even a tiny, meaningless effect can produce a very small p-value. A p-value of 0.001 does not necessarily indicate a more important finding than p = 0.04.
  • Misconception: "If p > 0.05 but p < 0.10, the result is 'marginally significant.'" This phrase creates a false middle ground. The data either provides sufficient evidence to reject H₀ at your predetermined alpha level or it does not. Treating borderline results as partially significant encourages selective reporting.

Practical Examples

Example 1: Natalie Tests a New Teaching Method

Natalie teaches two sections of the same chemistry course. Section A (32 students) uses her new interactive method, and Section B (35 students) uses the traditional lecture format. On the final exam, Section A averages 78.5 and Section B averages 74.2. She conducts a two-sample t-test.

Test statistic: t = 2.18, df = 64

Two-tailed p-value: 0.033

Since p = 0.033 < 0.05, Natalie finds a statistically significant difference in exam scores. However, the 4.3-point difference on a 100-point exam is a modest effect. She considers whether this improvement justifies the extra preparation time the interactive method requires, illustrating the distinction between statistical and practical significance.

Example 2: Omar Evaluates an Ad Campaign

Omar runs a digital ad campaign. He shows Ad A to 5,000 visitors (142 clicks, 2.84% CTR) and Ad B to 5,000 visitors (168 clicks, 3.36% CTR). He uses a two-proportion z-test to determine if Ad B performs significantly better.

z = 1.57

Two-tailed p-value: 0.117

Since p = 0.117 > 0.05, the difference is not statistically significant. Omar cannot confidently conclude that Ad B outperforms Ad A based on this data. He decides to continue the test with more traffic rather than prematurely switching to Ad B. With 20,000 visitors per ad instead of 5,000, the same proportional difference would likely become significant.

Example 3: Dr. Patel Analyzes a Drug Trial

Dr. Patel conducts a clinical trial comparing a new cholesterol medication to placebo. With 500 patients per group, the treatment group shows an average LDL reduction of 12 mg/dL more than placebo.

t = 4.82, df = 998

p-value < 0.001

The result is highly statistically significant. But Dr. Patel also evaluates the clinical significance: a 12 mg/dL additional reduction is meaningful for patients at high cardiovascular risk. She reports both the p-value and the 95% confidence interval for the difference (8.1 to 15.9 mg/dL) to give a complete picture of the treatment effect. The confidence interval shows the effect is consistently clinically relevant across its entire range.

Try Our Z-Score Calculator

Convert between z-scores and p-values, and find critical values for any significance level.

Use Calculator

P-Value Reference Table

This table provides a quick reference for interpreting p-values and the corresponding z-scores and evidence strength. Remember that these interpretations are guidelines, not rigid rules.

P-Value Range Two-Tailed z* Evidence Against H₀ Conventional Decision Common Notation
p > 0.10 |z| < 1.645 Weak or none Fail to reject H₀ NS (not significant)
0.05 < p ≤ 0.10 1.645 ≤ |z| < 1.96 Suggestive Fail to reject H₀ at α = 0.05 NS (borderline)
0.01 < p ≤ 0.05 1.96 ≤ |z| < 2.576 Moderate Reject H₀ * (significant)
0.001 < p ≤ 0.01 2.576 ≤ |z| < 3.291 Strong Reject H₀ ** (highly significant)
p ≤ 0.001 |z| ≥ 3.291 Very strong Reject H₀ *** (very highly significant)

Beyond P-Values: Better Statistical Practices

The statistical community increasingly advocates for supplementing or replacing p-values with more informative approaches. Here are the most important complementary tools:

  • Effect sizes. Measures like Cohen's d, odds ratios, and correlation coefficients quantify the magnitude of an effect independently of sample size. An effect size of d = 0.2 is small, d = 0.5 is medium, and d = 0.8 is large. Report effect sizes alongside p-values so readers can judge practical importance.
  • Confidence intervals. A 95% confidence interval provides both the point estimate and its precision. If the interval for a treatment difference is (0.5, 15.2), you know the effect is statistically significant (it excludes zero) and you can see the range of plausible effect sizes. This is far more informative than reporting p < 0.05 alone.
  • Pre-registration. Publishing your hypotheses, sample size, and analysis plan before collecting data prevents p-hacking and selective reporting. Pre-registered studies are more credible because they demonstrate that the analysis was planned, not data-driven.
  • Bayesian methods. Bayesian analysis directly calculates the probability of hypotheses given the data (what most people incorrectly think p-values tell them). It combines prior knowledge with observed data to produce posterior probabilities. Bayesian approaches also handle the "evidence for the null" problem that frequentist methods cannot address.
  • Replication. A single significant result is just one piece of evidence. True scientific confidence comes from multiple studies finding consistent results. Focus on whether effects replicate rather than whether any individual p-value crosses 0.05.

Common Mistakes to Avoid

  • Treating p-values as binary. Results do not flip from "false" to "true" at p = 0.05. A p-value of 0.04 provides slightly more evidence against H₀ than p = 0.06, not a categorically different conclusion. Report exact p-values and interpret them on a continuous scale.
  • Multiple comparisons without correction. If you test 20 hypotheses at α = 0.05, you expect one false positive by chance alone. When running multiple tests, apply corrections like Bonferroni (divide α by the number of tests) or Benjamini-Hochberg (controls the false discovery rate) to maintain overall error rates.
  • Confusing a large p-value with evidence for the null. A p-value of 0.40 does not mean the null hypothesis is true. It means your data is consistent with the null, but it might also be consistent with many alternative hypotheses. Use equivalence testing or Bayesian methods if you want to provide evidence that an effect is negligibly small.
  • Ignoring assumptions of the test. P-values from parametric tests (t-tests, ANOVA) assume specific conditions about your data: independence, normality, equal variances. Violating these assumptions can make p-values unreliable. Check assumptions before reporting results, and use non-parametric alternatives when assumptions are violated.
  • HARKing (Hypothesizing After Results are Known). Formulating your hypothesis after seeing the data and then testing it on the same data is circular reasoning. It dramatically inflates false positive rates. If exploratory analysis reveals an interesting pattern, treat it as a hypothesis to be tested on new data.

Try Our Confidence Interval Calculator

Build confidence intervals to complement your p-value analysis with effect size estimates.

Use Calculator

Frequently Asked Questions

A p-value less than 0.05 means that if the null hypothesis were true, there would be less than a 5% chance of observing results as extreme as (or more extreme than) what you actually found. At the conventional 0.05 significance level, this is considered statistically significant, meaning you would reject the null hypothesis. However, statistical significance does not automatically mean practical importance. A very large study can produce a small p-value for a trivially small effect.

No, this is one of the most widespread misconceptions about p-values. A p-value is not the probability that the null hypothesis is true or that the alternative hypothesis is true. It only tells you the probability of observing your specific data (or more extreme data) assuming the null hypothesis is true. The probability that a hypothesis is true requires Bayesian methods, which incorporate prior beliefs about the hypothesis alongside the observed data.

The 0.05 threshold is a convention, not a natural law. It was popularized by statistician Ronald Fisher in the 1920s as a convenient benchmark. Fisher suggested that a 1-in-20 chance was a reasonable cutoff for identifying results worth further investigation. Many fields now recognize that rigidly applying 0.05 is problematic and advocate for reporting exact p-values alongside effect sizes and confidence intervals, letting readers judge significance in context rather than relying on a binary threshold.

Statistical significance means the observed effect is unlikely due to random chance alone, based on a chosen alpha level. Practical significance means the effect is large enough to matter in the real world. A drug that lowers blood pressure by 0.5 mmHg might be statistically significant with 10,000 participants, but clinically meaningless because the effect is too small to benefit patients. Always report effect sizes alongside p-values so readers can assess both statistical and practical significance.

This is a gray area that highlights the limitations of using a rigid cutoff. Some researchers treat p = 0.05 as significant, others do not. The best practice is to avoid treating 0.05 as a sharp boundary. Report the exact p-value and let readers judge. A result with p = 0.049 is practically indistinguishable from p = 0.051 in terms of evidence strength, and framing them as categorically different undermines the continuous nature of statistical evidence.

Larger samples produce smaller p-values for the same effect size because they reduce sampling variability, making it easier to distinguish real effects from noise. This is why large studies can find statistically significant but practically meaningless effects. Conversely, small studies may fail to reach significance even when a meaningful effect exists. This relationship between sample size and p-values is why reporting effect sizes and confidence intervals alongside p-values provides a more complete picture of your results.

P-hacking refers to practices that manipulate data analysis to achieve a statistically significant p-value. Examples include testing many variables and only reporting those with p less than 0.05, removing outliers selectively, stopping data collection when significance is first reached, or trying multiple statistical tests until one gives a desired result. P-hacking inflates false positive rates far beyond the nominal 5% and is a major contributor to the replication crisis in science. Pre-registering your analysis plan before collecting data is the most effective prevention.

Use a two-tailed test unless you have a strong theoretical reason to predict the direction of the effect before looking at the data. Two-tailed tests are more conservative and widely accepted because they test for effects in both directions. One-tailed tests have more statistical power in the predicted direction but cannot detect effects in the opposite direction. Some reviewers view one-tailed tests with suspicion if they suspect the choice was made after seeing the data to achieve a smaller p-value.

Sources & References

  1. Scribbr — Understanding P-values — Definition and examples of p-values in statistical testing: scribbr.com
  2. Wikipedia — P-value — Comprehensive reference on p-value theory and applications: en.wikipedia.org
  3. Penn State STAT Online — Hypothesis Testing — Overview of hypothesis testing procedures and concepts: online.stat.psu.edu
  4. Stat Trek — Hypothesis Testing — Tutorial on statistical hypothesis testing with examples: stattrek.com
Share this article:

CalculatorGlobe Team

Content & Research Team

The CalculatorGlobe team creates in-depth guides backed by authoritative sources to help you understand the math behind everyday decisions.

Related Calculators

Related Articles

Disclaimer: This calculator is for informational and educational purposes only. Results are estimates and may not reflect exact values.

Last updated: February 23, 2026