Secondary Logo

Journal Logo

Trends in P Value, Confidence Interval, and Power Analysis Reporting in Health Professions Education Research Reports

A Systematic Appraisal

Abbott, Eduardo F. MD; Serrano, Valentina P. MD, MSc; Rethlefsen, Melissa L. MSLS; Pandian, T.K. MD, MPH; Naik, Nimesh D. MD; West, Colin P. MD, PhD; Pankratz, V. Shane PhD; Cook, David A. MD, MHPE

doi: 10.1097/ACM.0000000000001773
Research Reports
Free
SDC
AM Rounds Blog Post

Purpose To characterize reporting of P values, confidence intervals (CIs), and statistical power in health professions education research (HPER) through manual and computerized analysis of published research reports.

Method The authors searched PubMed, Embase, and CINAHL in May 2016, for comparative research studies. For manual analysis of abstracts and main texts, they randomly sampled 250 HPER reports published in 1985, 1995, 2005, and 2015, and 100 biomedical research reports published in 1985 and 2015. Automated computerized analysis of abstracts included all HPER reports published 1970–2015.

Results In the 2015 HPER sample, P values were reported in 69/100 abstracts and 94 main texts. CIs were reported in 6 abstracts and 22 main texts. Most P values (≥77%) were ≤.05. Across all years, 60/164 two-group HPER studies had ≥80% power to detect a between-group difference of 0.5 standard deviations. From 1985 to 2015, the proportion of HPER abstracts reporting a CI did not change significantly (odds ratio [OR] 2.87; 95% CI 1.04, 7.88) whereas that of main texts reporting a CI increased (OR 1.96; 95% CI 1.39, 2.78). Comparison with biomedical studies revealed similar reporting of P values, but more frequent use of CIs in biomedicine. Automated analysis of 56,440 HPER abstracts found 14,867 (26.3%) reporting a P value, 3,024 (5.4%) reporting a CI, and increased reporting of P values and CIs from 1970 to 2015.

Conclusions P values are ubiquitous in HPER, CIs are rarely reported, and most studies are underpowered. Most reported P values would be considered statistically significant.

E.F. Abbott is a research fellow, Mayo Clinic Multidisciplinary Simulation Center, Mayo Clinic College of Medicine, Rochester, Minnesota, and adjunct instructor of internal medicine, Department of Internal Medicine, Escuela de Medicina, Pontificia Universidad Catolica de Chile, Santiago, Chile.

V.P. Serrano is a research fellow, Knowledge and Evaluation Research Unit, Division of Endocrinology, Diabetes, Metabolism and Nutrition, Mayo Clinic, Rochester, Minnesota, and assistant professor, Department of Nutrition, Diabetes and Metabolism, Escuela de Medicina, Pontificia Universidad Catolica de Chile, Santiago, Chile.

M.L. Rethlefsen is deputy director and associate librarian, Spencer S. Eccles Health Sciences Library, and section director, Systematic Review Core, Population Health Research Foundation for Discovery, Center for Clinical and Translational Science, University of Utah, Salt Lake City, Utah.

T.K. Pandian is a postgraduate year six resident, Department of Surgery, Mayo Clinic College of Medicine, Rochester, Minnesota.

N.D. Naik is a postgraduate year four resident, Department of Surgery, Mayo Clinic College of Medicine, Rochester, Minnesota.

C.P. West is professor of medicine, professor of biostatistics, and professor of medical education; associate program director, Internal Medicine Residency Program; and consultant, Division of General Internal Medicine, Mayo Clinic College of Medicine, Rochester, Minnesota.

V.S. Pankratz is professor of internal medicine, University of New Mexico Health Sciences Center, Albuquerque, New Mexico.

D.A. Cook is professor of medicine and professor of medical education; research chair, Mayo Clinic Multidisciplinary Simulation Center; director of research, Office of Applied Scholarship and Education Science; and consultant, Division of General Internal Medicine, Mayo Clinic College of Medicine, Rochester, Minnesota; ORCID: http://orcid.org/0000-0003-2383-4633.

Supplemental digital content for this article is available at http://links.lww.com/ACADMED/A458.

An AM Rounds blog post on this article is available at academicmedicineblog.org.

Funding/support: None reported.

Other disclosures: None reported.

Ethical approval: Reported as not applicable.

Correspondence should be addressed to David A. Cook, Division of General Internal Medicine, Mayo Clinic College of Medicine, Mayo 17, 200 First St. SW, Rochester, MN 55905; telephone: (507) 284-2269; e-mail: cook.david33@mayo.edu; Twitter: @CookMedEd.

Planning and reporting statistical analyses have generated long-standing discussions in many fields of science.1–6P values are frequently misunderstood and misinterpreted,7–10 and there is evidence that studies with “statistically significant” P values are more likely to be published than studies with “nonsignificant” results11–13 and that authors selectively report statistically significant findings.14–16 Weighing the magnitude of reported associations can elucidate the importance of such associations beyond statistical significance.17 This is facilitated by reporting measures of effect size1 such as Cohen’s d, variance explained (R2), risk differences, and risk ratios. Experts have further recommended that confidence intervals (CIs) be reported to overcome some of the limitations of P values, especially in the interpretation of results relative to clinically relevant thresholds.17–20P values and CIs are tightly linked with statistical power, and power analyses (i.e., sample size justifications) are advocated as a routine part of study planning to optimize the likelihood that statistically nonsignificant results can be interpreted with confidence.19

Empiric studies of biomedical research indicate that P values are reported so frequently, and are so often statistically significant, that their utility is compromised.9,14 In contrast, CIs,9,21 effect sizes,21 and power analyses22–25 are reported infrequently and incompletely, even though guidelines such as the Consolidated Standards of Reporting Trials (CONSORT) and Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statements specifically recommend reporting them.17,26

Limited research in health professions education research (HPER) has identified deficiencies in the reporting of P values,27,28 CIs,27,28 power analyses,28,29 and statistical power,30–32 but never with a focused or extended exploration of these issues. As such, the scope of these reporting problems in HPER is not fully understood.

To address this knowledge gap, we sought to determine the prevalence and temporal evolution of reporting P values, CIs, effect sizes, and power analyses in HPER, and to compare these findings with reporting in concurrent biomedical research.

Back to Top | Article Outline

Method

Overview

We analyzed HPER publications in two ways: (1) in-depth manual coding of a random sample of HPER studies published at 10-year intervals from 1985 through 2015, and (2) automated text mining (i.e., computerized data extraction/coding) of the abstracts of all HPER studies published from 1970 through 2015. For comparison, we manually coded a random sample of biomedical research studies published in 1985 and 2015. Although this was not a systematic review, we followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) standards33 to ensure complete reporting.

Back to Top | Article Outline

Manual coding of abstracts and full text

Search and eligibility.

To identify studies for manual coding, on May 7, 2016, a research librarian (M.L.R.) with expertise in systematic reviews searched (a) the PubMed (including MEDLINE), Embase, and CINAHL databases for comparative HPER studies published in 1985, 1995, 2005, and 2015; and (b) the “Core Clinical Journals” subset of MEDLINE for comparative biomedical research studies published in 1985 and 2015. (For full search strategies, see Supplemental Digital Appendixes 1-A and 1-B at http://links.lww.com/ACADMED/A458.) We selected a random sample for each time point by randomly ordering the identified studies and then reviewing these for inclusion until we achieved the desired sample size (for sample sizes, see “Data analysis,” below). The number of comparative HPER studies published in 1985 fell short of our target, so we extended that search to include studies published in 1986.

We included only peer-reviewed journal publications that contained an abstract, because many of the elements of interest in our analysis pertained to the abstract. For HPER articles we required that participants be health professionals in a learner or teacher role, while for biomedical research articles we required that participants be human patients or volunteers. We limited both the HPER and biomedical research samples to comparative studies, defined as observational or interventional studies with (a) at least two groups or (b) one group with measurement before and after an event (i.e., pretest–posttest studies).

We based initial inclusion decisions on the title and abstract, and then referred to full-text reports for final decisions. All inclusion decisions were made by two reviewers working independently (interrater agreement: single-rater intraclass correlation coefficient [ICC] 0.80), with disagreements resolved by consensus.

Back to Top | Article Outline

Data extraction.

We reviewed prior studies of research reporting9,16,21–25 and expert guidelines17 to identify important statistical and study design elements to extract from the abstract (a, b, and e, below) and the main text (all except e), including:

  • a. P values: total number of P values, distribution of P values (including “NS” [not significant]), study data reported along with P values;
  • b. CIs: total number of CIs, CIs around effect size estimate;
  • c. Power analyses: alpha error, beta error, estimated variance or event rate, clinically important threshold and its justification, target sample size;
  • d. Effect sizes: type (e.g., absolute or standardized mean difference, correlation coefficient, risk ratio);
  • e. Statements of significance: statistical or clinical/educational, justification; and
  • f. Other study features: participants, study design, country of origin.

Two reviewers, working independently, reviewed each abstract and full text (both HPER and biomedical articles) to extract information using a standardized form. Interrater agreement was ICC > 0.50 for all items except four infrequently reported items (standardized mean difference, clinical significance, and justification of significance using effect sizes and CIs), for which ICC was low (< 0.2) but raw agreement was > 95%. All disagreements were resolved by consensus.

Back to Top | Article Outline

Data analysis.

Our primary analysis tested for changes in reporting of P values and CIs in HPER from 1985 to 2015. We evaluated changes over time in the presence/absence and number (counts) of P values and CIs. We report odds ratios (ORs) from logistic regression models for presence/absence and ratios of counts from the negative binomial regression models for counts, per decade. In secondary analyses, we evaluated changes over time in the percentage of P values ≤ .05 (of all reported P values), the mean largest P value, and the number of studies reporting a power analysis. We expected that authors would selectively report statistically significant P values in the abstract, and we evaluated this prediction within years by comparing the percentage of P values ≤ .05 in the abstract versus main text using the paired t test. We calculated the mean minimum and maximum P values for each study using a log10 transformation (substituting .00001 for P values reported as 0), then back-transformed to native units for reporting.

We compared HPER and biomedical research published in 1985 and 2015, evaluating within-year differences in reporting frequency or means, using chi-square tests or general linear models.

We estimated the experiment-wide error rate (the probability of finding at least one P value < .05 if all null hypotheses in the study are true) as 1 − 0.95c, where c is the total number of hypothesis tests. Using the number of participants completing each study, we estimated each HPER study’s power to detect mean differences of 0.2, 0.5, and 0.8 standard deviations (SDs), corresponding to Cohen’s34 thresholds for small, moderate, and large effects, respectively.

We estimated power for our study for the primary comparison (change in HPER reporting over time) and for the comparison between HPER and biomedical studies. With 100 HPER studies and 50 biomedical studies from 1985 and from 2015, we estimated > 80% power to detect a difference in proportions of 0.13 within HPER studies and to detect a difference of 0.2 between HPER and biomedical studies, with alpha = 0.05. We set the sample sizes for 1995 and 2005 at 25 HPER studies each. We used SAS 9.4 (SAS Institute, Cary, North Carolina) for all analyses. Applying a Bonferroni correction for 30 statistical comparisons (17 P values and 13 CIs) suggested 0.0017 as a conservative threshold of statistical significance to control the experiment-wide Type I error rate at ≤ 0.05. We did not conduct any unreported analyses.

Back to Top | Article Outline

Automated text mining of abstracts: Search, data extraction, and analysis

To identify studies for automated text mining, on May 7, 2016, the reference librarian searched the PubMed (including MEDLINE), Embase, and CINAHL databases for HPER studies published 1970–2015. We designed this search to be more specific for HPER than the manual coding search because the results would not be individually evaluated for inclusion (for the search strategy, see Supplemental Digital Appendix 1-C at http://links.lww.com/ACADMED/A458).

We created text-mining software that used a regular expression search string to identify P values (following the procedure outlined by Chavalarias et al9—namely, “P” followed by an expression of equality or inequality and a numeric value or the text term “not significant” or “NS”) and CIs (“confidence interval” or “CI” followed by a numeric value) in abstracts. Details of this search string are contained in Supplemental Digital Appendix 1-D at http://links.lww.com/ACADMED/A458. To test this software, we applied it to the abstracts of the 300 manually coded comparative HPER and biomedical studies published in 1985 and 2015, and found near-perfect agreement with the manually extracted data (ICC ≥ 0.93).

For each year, we determined the number of abstracts reporting at least one P value or CI, the number of P values and CIs reported in each abstract, the lowest and highest P values, and the number of exact P expressions. We used logistic and negative binomial regression to evaluate changes in P value and CI reporting per decade.

Back to Top | Article Outline

Results

Manual coding of abstracts and full text

Description of studies.

We included 250 reports of HPER (100 each in 1985 and 2015, and 25 each in 1995 and 2005) and 100 reports of biomedical research (50 each in 1985 and 2015) (Table 1). The “1985” sample included all 93 eligible studies published in 1985 and a random sample of 7 studies published in 1986. For all other time points, the included studies reflected a random sample of studies published that year. The HPER samples included 9 articles written in languages other than English, including German (n = 5), Spanish (n = 2), Italian (n = 1), and Chinese (n = 1); these were translated prior to data extraction. We excluded 1 potentially eligible study for which translation (Russian) was not available.

Table 1

Table 1

Key information for the included studies is reported in Table 2 and in Supplemental Digital Appendix 2–eTable 1 (available at http://links.lww.com/ACADMED/A458). Of the 250 HPER studies, 159 (63.6%) involved physicians at some career stage (medical students, residents, or practicing physicians), and 56 (22.4%) involved nurses or nursing students. One hundred thirty-five (54.0%) originated from the United States, 52 (20.8%) from Europe, and 24 (9.6%) from Canada. Seventy-seven (30.8%) involved participants from more than one institution. Eighty-six (34.4%) were single-group studies, 164 (65.6%) had a separate comparison group, and 59 (23.6%) were randomized experiments. Only 18 (7.2%) identified a primary outcome in the main text Method section. Most (164 [65.6%]) did not report a measure of effect size; among the 86 (34.4%) that did, the most common measures were the mean difference (44 [51.2%]), correlation coefficient or R2 (37 [43.0%]), and standardized mean difference (10 [11.6%]).

Table 2

Table 2

In the search and selection process for comparative studies, we first identified all relevant studies published since 1970, and then selected studies with abstracts published in the target years (1985, 1995, etc.). During this process, we incidentally noted a temporal pattern in the proportion of reports with abstracts (see Supplemental Digital Appendix 2–eTable 2 at http://links.lww.com/ACADMED/A458). Prior to 1975, < 1% of HPER publications and < 7% of biomedical publications had abstracts. For HPER publications, this rose in 1975 to 25% and remained between 30% and 60% through 2002. From 2003 to 2015, more than 70% of HPER publications had abstracts. By contrast, 51% to 66% of biomedical publications had abstracts from 1975 to 2015.

Back to Top | Article Outline

Current state of HPER reporting.

Of the 100 HPER reports in our 2015 sample, 69 (69.0%) reported at least one P value in the abstract, and 94 (94.0%) reported at least one P value in the main text (Table 3). On average, there were 3.0 (95% CI 2.3, 3.6) P values in the abstract, 8.4 (6.7, 10.2) P values in the main text, and an additional 10.1 (5.8, 14.4) P values in tables. The mean percentage of P values ≤ .05 was 86.9% in abstracts and 76.6% in main texts. Comparisons within reports showed that the mean percentage of P values ≤ .05 was 13% higher in abstracts (95% CI 7, 19; P < .0001). The average largest P value in a given abstract was .016 (95% CI .009, .031); in other words, even the least statistically significant P value was on average < .05. By contrast, the average largest P value in a given main text was .084 (.05, .14).

Table 3

Table 3

Six (6.0%) of the 2015 HPER studies reported a CI in the abstract, and 22 (22.0%) reported a CI in the main text. Five of the 6 abstracts (83.3%) and 10 of the 22 main texts (45.5%) reported a CI around an effect size.

Twelve (12.0%) of the 2015 HPER studies reported a prospective power analysis, and 2 (2.0%) reported a retrospective power analysis. We further evaluated the presence of four elements of a complete power analysis in the 12 studies reporting a prospective analysis: alpha level (defined in 9 [75.0%]), beta level (11 [91.7%]), an estimate of outcome variance or event rate (6 [50.0%]), and a threshold of educational (clinical) significance (10 [83.3%]). Only 4 studies (33.3%) reported all four elements. Six (50.0%) provided a justification for the clinical significance threshold. All 12 met the target sample for enrollment, and 9 (75.0%) met that target in the final analysis.

We found statements regarding the “significance” of results or lack thereof in 64 (64.0%) abstracts; 57 abstracts (57.0%) claimed a significant association, while 21 (21.0%) claimed absence of association (i.e., nonsignificance). The type of significance (statistical or clinical/educational) was specified in only 9 abstracts (9.0%). Justifications, when present, nearly always referenced P values rather than the magnitude of effect or a CI.

Back to Top | Article Outline

Changes in HPER reporting over time.

Our primary analyses explored changes over time in the number of P values and CIs reported. The proportion of abstracts reporting at least one P value increased from 1985 to 2015, with the odds that a P value was reported increasing twofold per decade (OR 2.00 [95% CI 1.62, 2.46]; P < .0001). The number of P values reported per abstract increased more slowly (13% relative increase per decade) and did not reach our adjusted threshold of statistical significance (ratio of counts [RC] 1.13 [95% CI 1.00, 1.28]; P = .04). Changes over time in the reporting of Pvalues and the number of P values in the main text were small in magnitude and not statistically significant (respectively, OR 0.87 [95% CI 0.59, 1.29]; P = .50; and RC 1.04 [95% CI 0.95, 1.14]; P = .38).

The odds of reporting a CI in an abstract increased more than twofold per decade, but did not reach statistical significance (OR 2.87 [95% CI 1.04, 7.88]; P = .04), whereas the proportion of main texts reporting a CI did increase significantly (OR 1.96 [95% CI 1.39, 2.78]; P = .0001). The number of CIs per report remained essentially unchanged for abstracts (RC 1.08 [95% CI 0.35, 3.32]; P =.89) and for main texts (RC 1.00 [95% CI 0.53, 1.86]; P = .99).

In our secondary analyses, we explored changes over time in the distribution of P values, considering that a higher proportion of P values ≤ .05 could indicate selective reporting of statistically significant P values. The proportion of P values ≤ .05 decreased marginally from 1985 to 2015 both in abstracts (OR 0.94 per decade [95% CI 0.91, 0.97]; P = .0006) and in main texts (OR 0.95 [95% CI 0.92, 0.98]; P = .0018). The small rise in largest P value did not reach statistical significance for abstracts (RC 1.28 [i.e., 28%] per decade [95% CI 0.87, 1.88]; P = .20) or main texts (RC 1.34 [95% CI 1.09, 1.65]; P = .005). We also explored the reporting of power analyses, and we found a statistically significant increase in reporting (OR 1.66 per decade [95% CI 1.14, 2.43]; P = .008).

Back to Top | Article Outline

Actual study power and Type I error.

Looking across all years, we estimated the actual power among the 164 HPER studies with a separate comparison group. Only 7 (4.3%) had ≥ 80% power to detect effects of “small” magnitude (Cohen’s d ≥ 0.2), while 60 (36.6%) had power to detect “moderate” effects (Cohen’s d ≥ 0.5) and 114 (69.5%) had power to detect “large” effects (Cohen’s d ≥ 0.8).

At least one hypothesis test (P value, nonnumeric P expression, or explicit comparison of CIs) was reported in 235/250 HPER studies (94.0%). Among these 235, the average number of hypothesis tests was 24.8 (median 15; range 1–434). The median experiment-wide Type I error (i.e., the probability that at least one P value will be < .05 if the null hypothesis is true) was 0.54 (range 0.05–1.0). Only 8 studies reported adjusting the threshold of statistical significance using, for example, a Bonferroni adjustment.

Back to Top | Article Outline

Comparison with biomedical reporting.

Figure 1 shows the reporting of P values, CIs, and power analyses for HPER in comparison with biomedical research in 2015. A similar proportion of reports in each cohort contained at least one P value, and the number of P values per article and the percentage of P values ≤ .05 were likewise similar. By contrast, biomedical research reports more often had data associated with P values and more often reported at least one CI. Power analyses were similarly infrequent in both cohorts. Table 3 contains additional information on reporting in biomedical research in 1985 and 2015.

Figure 1

Figure 1

Back to Top | Article Outline

Automated text-mining analysis of abstracts

For automated text mining we used a slightly different search strategy, which identified 56,440 comparative HPER studies published 1970–2015. The number of publications with abstracts increased from 3 in 1970 to 3,574 in 2015 (Figure 2 shows results at five-year intervals; for all years, see Supplemental Digital Appendix 2–eTable 3 at http://links.lww.com/ACADMED/A458). The first P value was reported in 1974 and the first CI in 1979.

Figure 2

Figure 2

When the 56,440 abstracts were analyzed with automated text mining, we found 14,867 (26.3%) reporting at least one P value and 3,024 (5.4%) reporting at least one CI. From 1974 to 2015, the percentage of abstracts with P values increased from 6.3% to 39.2%, with the odds of reporting at least one P value increasing 1.97-fold per decade (OR 1.97 [95% CI 1.90, 2.03]; P < .0001). CIs similarly increased in prevalence over time, from 0.6% of abstracts in 1979 to 8.6% in 2015 (OR 2.09 per decade [95% CI 1.96, 2.24]; P < .0001).

We extracted 37,874 numeric P values from abstracts. The most commonly reported P values were .001 (n = 8,979 [23.7%]); .05 (n = 5,842 [15.4%]); .01 (n = 3,560 [9.4%]); .0001 (n = 2,402 [6.3%]); and .02 (n = 1,146 [3.0%]). These five P values collectively accounted for 57.9% (n = 21,929) of all reported P values, suggesting strong clustering around specific values representing common P value thresholds.

Among the 14,867 abstracts reporting a P value, 13,853 (93.2%) reported at least one P value ≤ .05. The percentage of P values ≤ .05 decreased from 100% in 1974 to 84.5% in 2015 (OR 0.96 per decade [95% CI 0.95, 0.97]; P < .0001). The largest reported P value averaged .029, with within-year averages fluctuating from .0037 to .30 (see Supplemental Digital Appendix 2–eFigures 1A and 1B at http://links.lww.com/ACADMED/A458).

Among the 39,217 P expressions (both numeric and nonnumeric), 18,952 (48.3%) were exact P values (P = x); 19,277 (49.2%) were “less than” inequalities (P < x), and 988 (2.5%) were “greater than” inequalities (P > x). Over time, the percentage of exact P values has increased steadily, and it was 51.6% in 2015 (see Supplemental Digital Appendix 2–eTable 3 at http://links.lww.com/ACADMED/A458).

Back to Top | Article Outline

Discussion

In this study, we manually extracted data from 250 HPER reports published at 10-year intervals from 1985 to 2015 and, for comparison, from 100 biomedical research reports published in 1985 and 2015. We also used automated text mining to analyze the abstracts of 56,440 HPER reports published 1970–2015. In 2015, P values were reported in nearly 40% of HPER abstracts with prevalence increasing over time (as shown in our text-mining analysis of all HPER publications), and in over 90% of main texts (as found in our manual coding of comparative studies). Most reported P values were in the range nominally considered statistically significant (P ≤ .05); hence, even the least significant P value reported in abstracts would usually be considered statistically significant. Moreover, the proportion of P values ≤ .05 was higher in abstracts than in main texts, suggesting selective reporting of results. P values were rarely accompanied by study data, such that reporting and interpretation of results focused merely on the statistical significance or lack thereof rather than the magnitude of association. By contrast, CIs were reported only infrequently in abstracts or main texts.

Few studies identified a primary outcome measure, and nearly all reported numerous hypothesis tests. Yet adjustments for multiple hypothesis testing were rare, resulting in a high probability of rejecting at least one null hypothesis even if all were true (median experiment-wide Type I error > 50%). Few studies reported a power analysis, and only about one-third of studies were powered to detect moderate effects. Only one-third reported an effect size, and even fewer reported standardized effect sizes.

HPER was comparable to biomedical research in the reporting of P values and power analyses, but lagged in the reporting of CIs and of data along with P values.

Back to Top | Article Outline

Limitations and strengths

We included only comparative studies in this analysis, so our findings may not generalize to other quantitative designs. We also do not know the reporting quality of articles without abstracts, which comprised a large proportion of studies published prior to 1990 and were not eligible for inclusion in this study. For the manual coding analysis, we included all eligible HPER studies published in 1985; for all other cohorts, the included studies represented a random sample and thus may not have fully captured the reporting patterns in a given year. However, these random samples characterized the overall state of the field, and the automated text mining of abstracts of HPER studies published 1970–2015 reflected an inclusive albeit less detailed snapshot of reporting.

Strengths include our systematic approach to identifying, selecting, and abstracting data from manually reviewed HPER and biomedical articles, and the complementary automated text mining of all abstracts published over 46 years.

Back to Top | Article Outline

Integration with prior work

Although prior work in HPER has touched on the frequency of the reporting of P values,27,28 CIs,27,28 and power analyses,28,29 ours is the first study to probe these issues deeply. Our findings generally align with those from biomedical research and other fields—namely, that statistically significant P values are overreported,9,10,14,21 CIs are underused,9,35,36 power analyses are infrequently performed,9,22–25 and claims regarding “significance” abound.37

Our data extend previous reviews characterizing the topics, participants, and study designs in focused fields of HPER.29,38 Our unselected picture of current and past HPER efforts documents an explosion of HPER activity over the past 30 years. We reviewed 292 articles published in 2015 to identify 100 comparative studies; by extrapolating this finding to the 5,639 publications identified in the initial search (Table 1), we estimate that approximately 1,931 comparative research studies were published in 2015. This approximation underestimates the total number of HPER studies (e.g., those with noncomparative or qualitative designs). Randomized trials in 2015 accounted for over half of comparative studies with two or more groups, and most studies originated outside the United States. However, a large minority of studies had no comparison group, sample sizes were small, and most research focused on physician education.

Back to Top | Article Outline

Implications

Chavalarias et al9 noted, “The pervasive presence of P values less than .05 suggests that this threshold has lost its discriminatory ability for separating false from true hypotheses.” Noting the frequent reporting of small P values, we emphasize that smaller P values do not indicate a more significant association; rather, the magnitude of association is indicated by the effect size.1,30,39 However, P values are not inherently bad. Problems arise when they are misinterpreted, selectively reported, “hacked” (i.e., study conditions, data collection, and data analyses manipulated to achieve a target P value6,14), or reported without reference to study data (e.g., estimates of effect size and uncertainty). Additionally, testing of multiple independent hypotheses inflates experiment-wide Type I error. Error estimates based on the number of reported statistical analyses will underestimate the true Type I error if authors selectively report only “interesting” or “significant” analyses. Moreover, selective reporting presents a distorted picture of the evidence. We recommend that researchers plan hypothesis tests thoughtfully and in advance, only conduct tests they intend to report, account for all conducted tests, and report summaries of study data (estimates of effect size and their variance1,39) in addition to statistical test results. Statements about “significance” should clarify the type (e.g., statistical, educational) and provide supporting information.

Power analyses were infrequent and nearly always incomplete, which may in part account for the overall low statistical power of these studies. We recommend that researchers estimate sample size requirements in advance, taking advantage of prior work to estimate the variance of key outcome measures and the threshold of clinical/educational significance.40 Once data have been collected, CIs are preferable over retrospective power analyses in guiding interpretation.19

Finally, CIs were reported infrequently despite repeated requests for their increased use18–20 (especially around the effect size35,41). CIs guide interpretations of the precision and impact of observed results by offering “a range of values that are considered to be plausible for the population”18 and providing information about “what differences are and are not statistically ruled out.”19 CIs around the effect size (e.g., the mean difference, standardized mean difference, correlation coefficient, risk ratio) are particularly useful, as compared with CIs around group means or frequencies. We recommend increased use of CIs in reporting and interpreting research.

Back to Top | Article Outline

Acknowledgments:

The authors thank Doug Cook for his help in creating and validating the automated text mining software and search strings.

Back to Top | Article Outline

References

1. Cohen JThe earth is round (p < .05). Am Psychol. 1994;49:997–1003.
2. Ioannidis JPWhy most published research findings are false. PLoS Med. 2005;2:e124.
3. Goodman SNToward evidence-based medical statistics. 1: The P value fallacy. Ann Intern Med. 1999;130:995–1004.
4. Trafimow D, Marks MEditorial. Basic Appl Soc Psych. 2015;37(1):1–2.
5. Carver RPThe case against statistical significance testing. Harv Educ Rev. 1978;48:378–99.
6. Simmons JP, Nelson LD, Simonsohn UFalse-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci. 2011;22:1359–1366.
7. Wasserstein RL, Lazar NAThe ASA’s statement on p-values: Context, process, and purpose. Am Stat. 2016;70:129–133.
8. Nuzzo RScientific method: Statistical errors. Nature. 2014;506:150–152.
9. Chavalarias D, Wallach JD, Li AH, Ioannidis JPEvolution of reporting P values in the biomedical literature, 1990–2015. JAMA. 2016;315:1141–1148.
10. de Winter JC, Dodou DA surge of p-values between 0.041 and 0.049 in recent decades (but negative results are increasing rapidly too). PeerJ. 2015;3:e733.
11. Dickersin K, Min YI, Meinert CLFactors influencing publication of research results. Follow-up of applications submitted to two institutional review boards. JAMA. 1992;267:374–378.
12. Callaham M, Wears RL, Weber EJournal prestige, publication bias, and other characteristics associated with citation of published studies in peer-reviewed journals. JAMA. 2002;287:2847–2850.
13. Polanin JR, Tanner-Smith EE, Hennessy EAEstimating the difference between published and unpublished effect sizes: A meta-review. Rev Educ Res. 2016;86:207–236.
14. Head ML, Holman L, Lanfear R, Kahn AT, Jennions MDThe extent and consequences of p-hacking in science. PLoS Biol. 2015;13:e1002106.
15. Chan AW, Altman DGIdentifying outcome reporting bias in randomised trials on PubMed: Review of publications and survey of authors. BMJ. 2005;330:753.
16. Chan AW, Hróbjartsson A, Haahr MT, Gøtzsche PC, Altman DGEmpirical evidence for selective reporting of outcomes in randomized trials: Comparison of protocols to published articles. JAMA. 2004;291:2457–2465.
17. Schulz KF, Altman DG, Moher DCONSORT Group. CONSORT 2010 statement: Updated guidelines for reporting parallel group randomized trials. Ann Intern Med. 2010;152:726–732.
18. Gardner MJ, Altman DGConfidence intervals rather than P values: Estimation rather than hypothesis testing. Br Med J (Clin Res Ed). 1986;292:746–750.
19. Goodman SN, Berlin JAThe use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med. 1994;121:200–206.
20. Altman D, Bland JMConfidence intervals illuminate absence of evidence. BMJ. 2004;328:1016–1017.
21. Gaskin CJ, Happell BPower, effects, confidence, and significance: An investigation of statistical practices in nursing research. Int J Nurs Stud. 2014;51:795–806.
22. Pocock SJ, Hughes MD, Lee RJStatistical problems in the reporting of clinical trials. A survey of three medical journals. N Engl J Med. 1987;317:426–432.
23. Moher D, Dulberg CS, Wells GAStatistical power, sample size, and their reporting in randomized controlled trials. JAMA. 1994;272:122–124.
24. Chan AW, Altman DGEpidemiology and reporting of randomised trials published in PubMed journals. Lancet. 2005;365:1159–1162.
25. DerSimonian R, Charette LJ, McPeek B, Mosteller FReporting on methods in clinical trials. N Engl J Med. 1982;306:1332–1337.
26. von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JPSTROBE Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: Guidelines for reporting observational studies. Ann Intern Med. 2007;147:573–577.
27. Wolf FMMethodological quality, evidence, and research in medical education (RIME). Acad Med. 2004;79(10 suppl):S68–S69.
28. Cook DA, Levinson AJ, Garside SMethod and reporting quality in health professions education research: A systematic review. Med Educ. 2011;45:227–238.
29. Baernstein A, Liss HK, Carney PA, Elmore JGTrends in study methods used in undergraduate medical education research, 1969–2007. JAMA. 2007;298:1038–1045.
30. Cook DA, Hatala RGot power? A systematic review of sample size adequacy in health professions education research. Adv Health Sci Educ Theory Pract. 2015;20:73–83.
31. Michalczyk AE, Lewis LASignificance alone is not enough. J Med Educ. 1980;55:834–838.
32. Woolley TWA comprehensive power-analytic investigation of research in medical education. J Med Educ. 1983;58:710–715.
33. Moher D, Liberati A, Tetzlaff J, Altman DGPRISMA Group. Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. Ann Intern Med. 2009;151:264–269, W64.
34. Cohen JStatistical Power Analysis for the Behavioral Sciences. 1988.2nd ed. Hillsdale, NJ: Lawrence Erlbaum;
35. Odgaard EC, Fowler RLConfidence intervals for effect sizes: Compliance and clinical significance in the Journal of Consulting and Clinical Psychology. J Consult Clin Psychol. 2010;78:287–297.
36. Vavken P, Heinrich KM, Koppelhuber C, Rois S, Dorotka RThe use of confidence intervals in reporting orthopaedic research findings. Clin Orthop Relat Res. 2009;467:3334–3339.
37. Vinkers CH, Tijdink JK, Otte WMUse of positive and negative words in scientific PubMed abstracts between 1974 and 2014: Retrospective analysis. BMJ. 2015;351:h6467.
38. Cook DA, Hatala R, Brydges R, et alTechnology-enhanced simulation for health professions education: A systematic review and meta-analysis. JAMA. 2011;306:978–988.
39. Thompson BGreen JL, Camilli G, Elmore PBResearch synthesis: Effect sizes. In: Handbook of Complementary Methods in Education Research. 2006:Mahwah, NJ: Lawrence Erlbaum; 583–603.
40. Rusticus SA, Eva KWDefining equivalence in medical education evaluation and research: Does a distribution-based approach work? Adv Health Sci Educ Theory Pract. 2016;21:359–373.
41. Thompson BWhat future quantitative social science research could look like: Confidence intervals for effect sizes. Educ Res. 2002;31(3):25–32.

Supplemental Digital Content

Back to Top | Article Outline
© 2018 by the Association of American Medical Colleges