Subscribe to Our Blog
In recent years, statisticians and researchers have continued to vigorously sound the alarm on the use and abuse of p-values in clinical studies and statistical modeling in general. Look no further than the official statement of the American Statistical Association (ASA), “The ASA’s Statement on p-Values: Context, Process, and Purpose,”1 that was published just two years ago in response to the ever more heated debate on the confirmatory role of p-values in quantitative science and the validity of statistical inference. While many in the scientific community have generated discussions and commentaries on the misuse of p-values, the ASA’s policy statement succinctly synthesizes “several widely agreed upon principles underlying the proper use and interpretation of the p-value.” The ASA’s statement puts forth six principles aiming to guide practitioners in their search for statistically significant effects, ameliorate the problem of false discovery rates and irreproducibility of results, and thus improve on the applicability of the scientific method.
P-values are the crux of the null hypothesis significance testing (NHST) framework in the frequentist approach to drawing conclusions from statistical studies. Particularly, the 5% significance level has historically served as the gold standard for the threshold to reject or not to reject a null hypothesis of no effect. It is interesting to note that the 5% level is an arbitrary line of division without any scientific foundation, rhyme or reason. The UK statistician Ronald A. Fisher, who invented p-values, adopted the use of p ≤ 5% as a convenient benchmark, stating in his 1926 paper:2 “If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty, or one in a hundred. Personally, the writer prefers to set a low standard of significance at the 5% point and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.” While Fisher admitted his preferences to the world, by no means did he advocate the 5% threshold as the mantra of statistical inference, but rather recommended it as a sanity test for the researcher’s scientific hypothesis in light of available data.
The set-up of NHST is simple – we come up with a scientific hypothesis that we would like to test, collect data, calculate a sample statistic and draw a conclusion about the plausibility of the hypothesis. To formalize our conclusion, we calculate the p-value of our sample statistic, which is simply the probability that we would observe a result at least as extreme as the statistic that our data produced under some particular assumptions. Using the “magical” 5% level of significance, if the p-value is less than 5%, we would declare we have strong evidence in support of our hypothesis. Even though I snuck in the word “simply,” this p-value definition is about as clear as mud, so an example would set the stage and make discussing the misconceptions of the p-value more concrete.
Assume we are interested in testing whether consuming chocolate on a daily basis leads to weight loss3 (I should be so lucky). We collect data on two groups of participants – a treatment group consuming chocolate and a control group abstaining from it. We use the before-and-after weight measurements of the two groups to build a model to assess the effect of chocolate on weight loss. The null hypothesis (the null) of no effect is the devil’s advocate, which competes against the alternative hypothesis that daily consumption of chocolate reduces weight. We can think of this study as a simple regression model where the response variable is the change in weight, while the predictor is an indicator of whether or not the participant consumed chocolate. If a true effect exists, we would expect to observe a negative regression coefficient (β<0) for the control group, indicating consuming chocolate is indeed associated with weight reduction. A p-value of, say, 5% means there is a 5% chance that our model’s β is at least as small as the value obtained from our sample given the null is true. Stated differently, the p-value tells us how likely it is to obtain a negative β assuming chocolate has no effect on weight loss. The smaller the p-value, the less likely our data if the underlying assumptions hold, and we would have a reason to doubt the null and have confidence chocolate consumption is indeed one possible explanation for weight loss.
Now, here is the million-dollar question: Provided our model’s β<0 is significant at the 5% level, and we reject the null (β=0), can we then conclude the null is wrong and the observed effect of chocolate on weight loss is real? Absolutely not! As the ASA’s statement points out, one misconception is that rejecting the null based on a statistically significant p-value means the null is wrong. We can never know for sure if our scientific conclusion is correct – we can only be led to believe the results we observe are not explainable by chance alone, and that chocolate consumption is one possible explanation for weight loss. All we can know is how rare our results are in the face of the null, as measured by the p-value – we assume the null is true, and the less likely the results, the more incompatible the data with the null. The p-value is not a statement about the probability the null is true, and just because we observe a particular effect in our data does not mean the effect is real or that it will even replicate in further studies; it could still have happened by chance alone or as a result of other conditions. After all, a conclusion does not immediately become “true” on one side of the 5% divide and “false” on the other.
Another point to keep in mind is that statistically significant results are not necessarily practically significant. In our example, a negative β would indicate chocolate consumption is associated with weight loss, but if the reduction amounts to only a few grams, spending money on chocolate as a weight loss solution would be nothing more than a wasteful proposition. Statistically significant results are especially common in big data studies, no matter the effect size. On the opposite end of the spectrum, if a study fails to achieve the status of statistical significance, this does not mean the effect is non-existent either; it could be that we do not have enough data, the variability is too large to tame, or the effect size is too small to detect. Regardless of the case, absence of evidence is not evidence of absence.
P-values also depend on the modeler’s degrees of freedom, including the choice of statistical model, which observations to include, treatment of missing data, influential observations, the predictors included in the model, transformations applied, etc. This means p-values should be carefully interpreted and only in the context of the specific model.
P-values have recently received a lot of heat, but that does not mean they are obsolete or useless. The ASA statement’s mission was to respond to the scientific community’s concerns over the misuse of p-values and the resulting issues of reproducibility and replicability of scientific conclusions. These issues simply indicate doing science is tricky and modelers who rely on p-values in their research should work even harder to render their findings convincing versus blindly relying on mechanical rules like the 5% level to justify their scientific claims.
While it is true nothing can replace subject-area expertise and critical thinking, there are several standards of statistical practice that help mitigate the potential fallout from over-reliance on NHST and guard against false-positive conclusions. Ideally, scientific results should be corroborated through verification, re-analysis, replication and reproduction. Verification subjects the same data to the same method. This is akin to a technical peer review that confirms the original study is free from any errors or bias. With re-analysis, we challenge the same data with a different statistical method of analysis. If we are able to confirm the same results using a different technique, this will bolster the validity of our findings. In replication, we collect new data but apply the same method of analysis as the original study. This is similar to applying cross validation techniques for assessing a model’s predictive power to show the results can generalize over to new data. Finally, in a reproduction, we attempt to answer the scientific question by combining new data with a new statistical method for the ultimate test of whether our original conclusions reflect real effects and are not just artifacts of a specific data sample or statistical technique. If we are able to jump through these hoops, the 5% significance level can be our friend, not our enemy!
2. Fisher, R. A. The arrangement of field experiments. Journal of the Ministry of Agriculture, 1926, 33, 503-513
Radost Wenman, FCAS, MAAA, is a Consulting Actuary with Pinnacle Actuarial Resources, Inc. in the San Francisco, California office. She holds a Master of Science degree in Statistics and a Bachelor of Science degree in Mathematics from Stanford University. Radost has over 10 years of experience in the capacity of a pricing actuary in the personal lines segment. In this role, she has developed home and auto pricing solutions through the design and implementation of advanced predictive models.
« Back to Blog