Lab Manager | Run Your Lab Like a Business

Calling Time on 'Statistical Significance' in Science Research

An editorial states that scientists should stop using the term "statistically significant" in their research

by Taylor and Francis Group
Register for free to listen to this article
Listen with Speechify

Scientists should stop using the term "statistically significant" in their research, urges an editorial in a special issue of The American Statistician published today.

The issue, Statistical Inference in the 21st Century: A World Beyond P<0.05, calls for an end to the practice of using a probability value (p-value) of less than 0.05 as strong evidence against a null hypothesis or a value greater than 0.05 as strong evidence favoring a null hypothesis. Instead, p-values should be reported as continuous quantities and described in language stating what the value means in the scientific context.

Get training in Skills Planning and Succession Planning and earn CEUs.One of over 25 IACET-accredited courses in the Academy.
Skills Planning and Succession Planning Course

Containing 43 papers by statisticians from around the world, the special issue is expected to lead to a major rethinking of statistical inference by initiating a process that ultimately moves statistical science—and science itself—into a new age.

In the issue's editorial, Dr. Ronald Wasserstein, executive director of the ASA, Dr. Allen Schirm, retired from Mathematica Policy Research, and professor Nicole Lazar of the University of Georgia said: "Based on our review of the articles in this special issue and the broader literature, we conclude that it is time to stop using the term 'statistically significant' entirely.

"No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. Therefore, a label of statistical significance does not mean or imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical non-significance lead to the association or effect being improbable, absent, false, or unimportant.

"For the integrity of scientific publishing and research dissemination, therefore, whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight."

Articles in the special issue suggest alternatives and complements to p-values, and highlight the need for widespread reform of editorial, educational, and institutional practices [quotes below].

While there is no single solution to replacing the outsized role that statistical significance has come to play in science, solid principles for the use of statistics do exist, say the editorial's authors.

"The statistical community has not yet converged on a simple paradigm for the use of statistical inference in scientific research—and in fact it may never do so," they acknowledge. "A one-size-fits-all approach to statistical inference is an inappropriate expectation. Instead, we recommend scientists conducting statistical analysis of their results should adopt what we call the ATOM model: Accept uncertainty, be Thoughtful, be Open, be Modest."

This ASA special issue builds on the highly influential ASA Statement on P-Values and Statistical Significance, which has had more than 293,000 downloads and 1,700 citations, an average of over 10 per week since its release in 2016.

Author quotes

Need for change

"Considerable social change is needed in academic institutions, in journals, and among funding and regulatory agencies. We suggest partnering with science reform movements and reformers within disciplines, journals, funding agencies and regulators to promote and reward 'reproducible' science and diminish the impact of statistical significance on publication, funding and promotion."—Goodman

"Evaluation of manuscripts for publication should be 'results-blind'. That is, manuscripts should be assessed for suitability for publication based on the substantive importance of the research without regard to their reported results."—Locascio

"Everything should be published in some form if whatever we measured made sense before we obtained the data because it was connected in a potentially useful way to some research question. Journal editors should be proud of their exhaustive methods sections and base their decisions about the suitability of a study for publication on the quality of its materials and methods rather than on results and conclusions; the quality of the presentation of the latter should only be judged after it is determined that the study is valuable based on its materials and methods."—Amrhein et al.

"Reproduction of research should be encouraged by giving byline status to researchers who reproduce studies. We would like to see digital versions of papers dynamically updated to display 'Reproduced by...' below the original research authors' names or 'Not yet reproduced' until it is reproduced."—Hubbard and Carriquiry

"An important role for statistics in research is the summary and accumulation of information. If replications do not find the same results, this is not necessarily a crisis, but is part of a natural process by which science evolves. The goal of scientific methodology should be to direct this evolution toward ever more accurate descriptions of the world and how it works, not toward ever more publication of inferences, conclusions, or decisions."- Amrhein et al.

Alternatives and complements to p-values

"A number of factors should no longer be subordinate to 'p<0.05'. These include relevant prior evidence, plausibility of mechanism, study design and data quality, and the real-world costs and benefits that determine what effects are scientifically important. The scientific context of the study matters and this should guide its interpretation."—McShane et al.

"Words like 'significance' in conjunction with p-values and 'confidence' with interval estimates mislead users into overconfident claims. We propose researchers think of p-values as measuring the compatibility between hypotheses and data, and interpret interval estimates as 'compatibility intervals' rather than 'confidence intervals'."—Amrhein et al.

"Continuous p-values should only be used in conjunction with the 'false positive risk (FPR)', which answers the question: If you observe a 'significant' p-value after doing a single unbiased experiment, what is the probability that your result is a false positive? "—Colquhoun