P-Value Calculator — Free Online Hypothesis Testing Tool

Determine statistical significance by calculating p-values from test statistics with support for two-tailed, left-tailed, and right-tailed tests.

Test Statistic (z)

Test Type

P-Value

0.031555

Interpretation

Strong evidence against the null hypothesis.

Significance Tests

α = 0.01Not Significant — Fail to Reject H₀

α = 0.05Significant — Reject H₀

α = 0.1Significant — Reject H₀

Test Details

Test Statistic: z = 2.1500

Φ(z) = 0.984222

Test Type: two-tailed

How to Use the P-Value Calculator

Enter your test statistic: Input the z-score or test statistic from your hypothesis test. This value typically comes from a z-test, and it represents how far your observed result is from the null hypothesis in standard units.
Select the test type: Choose "Two-Tailed" when testing for any difference from the null hypothesis (most common). Choose "Left-Tailed" when testing specifically whether a parameter is less than the hypothesized value. Choose "Right-Tailed" when testing whether a parameter is greater than the hypothesized value.
Read the p-value: The p-value appears prominently at the top of the results panel, calculated to six decimal places. A smaller p-value indicates stronger evidence against the null hypothesis.
Check significance at common levels: The results automatically compare your p-value against three common significance levels (α = 0.01, 0.05, and 0.10), clearly indicating whether you should reject or fail to reject the null hypothesis at each level.

The calculator uses the standard normal (z) distribution for probability calculations. All results update in real time as you modify inputs. The interpretation guide provides a plain-language summary of the evidence strength.

P-Value Formulas

Two-Tailed P-Value

p = 2 × min(Φ(z), 1 - Φ(z))

Left-Tailed P-Value

p = Φ(z)

Right-Tailed P-Value

p = 1 - Φ(z)

Variables Explained

p: The p-value — the probability of observing a test statistic at least as extreme as the one calculated, under the null hypothesis.
z: The test statistic (z-score), representing how many standard deviations the observed result is from the null hypothesis value.
Φ(z): The cumulative distribution function (CDF) of the standard normal distribution, giving P(Z ≤ z).
α (alpha): The significance level — the threshold below which the p-value is considered statistically significant. Common values are 0.05, 0.01, and 0.10.

Step-by-Step Example

A researcher conducts a two-tailed z-test and obtains a test statistic of z = 2.35. Find the p-value and determine significance.

Identify the test statistic: z = 2.35
Find the CDF value: Φ(2.35) = 0.9906
For a two-tailed test: p = 2 × min(0.9906, 1 - 0.9906) = 2 × 0.0094 = 0.0188
Compare to α = 0.05: Since 0.0188 < 0.05, the result is statistically significant
Compare to α = 0.01: Since 0.0188 > 0.01, the result is not significant at this stricter level

At the 5% significance level, we reject the null hypothesis. The observed test statistic of 2.35 would occur only about 1.88% of the time if the null hypothesis were true, providing strong evidence against it.

Practical Examples

Example 1: Angela's A/B Test for Website Conversion

Angela is testing whether a new website design (B) increases conversion rates compared to the original (A). The original has a 3.2% conversion rate. After 10,000 visitors to design B, she observes a 3.8% conversion rate. Her z-test gives a test statistic of z = 2.45. She uses a right-tailed test since she is testing for an increase.

Test statistic: z = 2.45 (right-tailed)
P-value: 1 - Φ(2.45) = 1 - 0.9929 = 0.0071
Significant at α = 0.05: Yes
Significant at α = 0.01: Yes

With a p-value of 0.0071, the improvement in conversion rate is highly statistically significant. Angela can confidently implement design B, knowing there is less than a 1% chance that the observed improvement is due to random variation alone.

Example 2: Professor Wang's Drug Trial Analysis

Professor Wang is testing whether a new medication lowers cholesterol more than a placebo. The z-test comparing the treatment group to the placebo group yields z = -1.82. He uses a left-tailed test (testing for a decrease in cholesterol).

Test statistic: z = -1.82 (left-tailed)
P-value: Φ(-1.82) = 0.0344
Significant at α = 0.05: Yes
Significant at α = 0.01: No

The p-value of 0.034 is below the 0.05 threshold, suggesting the drug does reduce cholesterol more than the placebo. However, it does not pass the stricter 0.01 threshold. Professor Wang may recommend a larger follow-up study to confirm the effect with greater confidence, especially given the medical implications of the finding.

Example 3: Rachel's Quality Assurance Testing

Rachel tests whether the average weight of cereal boxes differs from the labeled 500g. Her sample of 50 boxes yields a test statistic of z = 0.87. She uses a two-tailed test since the boxes could be over or under weight.

Test statistic: z = 0.87 (two-tailed)
P-value: 2 × (1 - 0.8078) = 2 × 0.1922 = 0.3844
Significant at α = 0.05: No
Significant at α = 0.10: No

With a p-value of 0.384, Rachel fails to reject the null hypothesis. The average box weight does not significantly differ from 500g. She reports that the production line is operating within acceptable tolerances, and no corrective action is needed. For a more detailed look at the data spread, she might use our confidence interval calculator.

P-Value Interpretation Reference Table

P-Value Range	Evidence Against H₀	Common Notation	Typical Decision
p < 0.001	Extremely strong	***	Reject H₀
0.001 ≤ p < 0.01	Very strong	**	Reject H₀
0.01 ≤ p < 0.05	Strong	*	Reject H₀
0.05 ≤ p < 0.10	Moderate (marginal)	.	Context-dependent
p ≥ 0.10	Weak or none	ns	Fail to reject H₀

Tips and Complete Guide

Understanding Type I and Type II Errors

A Type I error (false positive) occurs when you reject a true null hypothesis. The probability of this is α, your significance level. A Type II error (false negative) occurs when you fail to reject a false null hypothesis. The probability of this is β, and statistical power (1 - β) is the probability of correctly detecting a real effect. Reducing α increases the risk of Type II errors, creating a fundamental tradeoff. Choose α based on which error is more costly in your specific context.

Effect Size and Practical Significance

A p-value alone does not tell you how large or important an effect is. A statistically significant result (p < 0.05) with a tiny effect size may be practically meaningless, especially with large samples. Always report effect sizes alongside p-values. Cohen's d measures the standardized difference between means: d = 0.2 is small, d = 0.5 is medium, and d = 0.8 is large. For our sample size calculator, you can determine how many observations you need to detect a specific effect size.

Multiple Testing and Correction Methods

When testing multiple hypotheses simultaneously (such as comparing 10 groups), the probability of at least one false positive increases dramatically. With 20 independent tests at α = 0.05, there is about a 64% chance of at least one false positive. The Bonferroni correction divides α by the number of tests (0.05 / 20 = 0.0025). The Holm-Bonferroni method is a less conservative alternative that maintains better power while controlling the family-wise error rate.

Common Mistakes to Avoid

Interpreting p-value as the probability H₀ is true: A p-value of 0.03 does not mean there is a 3% chance the null hypothesis is true. It means there is a 3% chance of observing data this extreme if the null hypothesis were true. This distinction is fundamental but often confused.
Treating p = 0.049 and p = 0.051 as fundamentally different: Statistical significance is not a binary cliff. A p-value of 0.049 is essentially the same strength of evidence as 0.051. Avoid the temptation to celebrate one while dismissing the other. Report exact p-values and let readers judge the evidence.
Using one-tailed tests to get smaller p-values: Choosing a one-tailed test after seeing the data direction is a form of p-hacking. The decision between one-tailed and two-tailed tests must be made before data analysis based on your research question, not based on what makes the result significant.
Confusing "not significant" with "no effect": Failing to reject the null hypothesis does not prove the null hypothesis is true. It only means you lack sufficient evidence to reject it. The difference could be real but too small for your sample to detect. Absence of evidence is not evidence of absence.
Ignoring effect size with large samples: With 100,000 observations, virtually any non-zero difference becomes statistically significant. Always pair your p-value with an effect size measure to assess practical importance.

Frequently Asked Questions

A p-value is the probability of observing results at least as extreme as the ones obtained, assuming the null hypothesis is true. In simpler terms, it answers: 'If nothing special is happening (null hypothesis is true), how likely is it that I would see data this unusual?' A small p-value (typically below 0.05) suggests the observed results are unlikely under the null hypothesis, providing evidence against it. A large p-value means the results are consistent with what we would expect by chance.

A two-tailed test checks for differences in either direction (is the value significantly higher OR lower than expected?). A one-tailed test only checks one direction. For a two-tailed test, the p-value is doubled because both tails of the distribution are considered. Use a two-tailed test when you do not have a specific directional hypothesis. Use a one-tailed test only when you have a strong a priori reason to expect the effect in one specific direction, such as testing whether a new drug performs better (not just differently) than a placebo.

The most common significance level is α = 0.05 (5%), meaning you accept a 5% chance of a false positive. For medical or safety-critical research, stricter levels like α = 0.01 or α = 0.001 are used to reduce false positives. For exploratory research or preliminary studies, α = 0.10 may be acceptable. The choice of significance level should be made before conducting the test and should reflect the consequences of a Type I error (false positive) in your specific context.

No, this is a very common misconception. A p-value of 0.05 means there is a 5% probability of observing data this extreme if the null hypothesis were true. It does NOT mean there is a 5% probability that the null hypothesis is true or a 95% probability that the alternative hypothesis is true. The p-value is a statement about data probability given a hypothesis, not about hypothesis probability given data. The latter requires Bayesian analysis with a prior probability.

P-values and confidence intervals convey the same information differently. A two-sided p-value less than 0.05 corresponds to a 95% confidence interval that does not include the null hypothesis value. For example, if testing whether a mean equals 0 with α = 0.05, a p-value below 0.05 means the 95% confidence interval for the mean does not contain 0. Confidence intervals provide more information because they show the range of plausible values, while p-values only indicate whether a specific value is plausible.

Yes, absolutely. With a large enough sample size, even tiny differences can become statistically significant. For example, a study with 100,000 participants might find that a new teaching method improves test scores by 0.1 points (p = 0.01). While statistically significant, this improvement is too small to be educationally meaningful. Always consider effect size (the magnitude of the difference) alongside p-values. Common effect size measures include Cohen's d, correlation coefficient r, and odds ratios.

P-hacking refers to manipulating data analysis to achieve statistically significant results. Techniques include testing many variables and only reporting significant ones, removing outliers selectively, stopping data collection when p < 0.05, or trying many statistical tests until one is significant. This inflates false positive rates far beyond the nominal α level. To avoid p-hacking, pre-register your hypotheses and analysis plan, report all tests conducted (not just significant ones), and use corrections for multiple comparisons such as Bonferroni or Holm-Bonferroni methods.

Related Calculators

Z-Score Calculator

Calculate z-scores and normal distribution probabilities.

Confidence Interval Calculator

Calculate confidence intervals for population parameter estimates.

Sample Size Calculator

Determine the required sample size for your study.

Statistics Calculator

Comprehensive descriptive statistics for any dataset.

Percentage Calculator

Calculate percentages and proportions for everyday math.

Grade Calculator

Calculate weighted grades across assignments and courses.

Disclaimer: This calculator is for informational and educational purposes only. Results are estimates and may not reflect exact values.

Last updated: February 23, 2026

Sources

Khan Academy — Hypothesis Testing and P-Values: khanacademy.org
NIST/SEMATECH e-Handbook — Critical Values and P-Values: nist.gov
American Statistical Association — Statement on P-Values: amstat.org