Health-care spending accounted for 17.9% of United States Gross Domestic Product (GDP) in 2016—a total expected to rise to 19.7% by 20261,2. The Affordable Care Act has catalyzed a shift from fee-for-service purchasing to value-based purchasing in the hope of facilitating quality per dollar spent. Tracking this effort involves accurately measuring and transparently reporting measures of quality to policymakers, clinicians, researchers, and the public.
In an effort to support consumer decision-making, the Centers for Medicare & Medicaid Services (CMS) created its Overall Hospital Quality Star Ratings (“stars”). First released in July 2016, the aim was to make quality information more actionable and accessible by consolidating dozens of hospital quality measures into a familiar 5-star rating system3. The December 2017 star ratings are based on up to 57 quality measures, each categorized into 1 of 7 domains: mortality, safety of care, readmission, patient experience, effectiveness of care, timeliness of care, and efficient use of medical imaging. Some quality measures span the entire spectrum of care in a hospital, whereas others are specific to a particular procedure. All quality measures available from each hospital in each domain are entered into a statistical model called a latent variable model, which calculates weights (also known as loadings) for each quality measure. These loadings are then applied to calculate up to 7 domain scores per hospital, which are then averaged according to another, predetermined weighting scheme created by a panel of experts, resulting in a single summary score per hospital. Finally, hospital summary scores are clustered into 1 to 5 stars4,5.
Hospital star ratings are based on whichever of the 57 quality measures are applicable to each hospital and for which CMS does not deem the number of cases too few to report6. Extensive literature on both surgeon and hospital volume has demonstrated that higher volume is associated with better outcomes, although the magnitude of this association varies across conditions, procedures, and years7,8. With regard to hospital star ratings, volume-outcome associations have been demonstrated for total knee arthroplasty9 as well as for total hip arthroplasty and coronary artery bypass grafting10. We investigated the effect of excluding individual hospitals’ surgical quality measures because of low volume and its impact on the star ratings.
We hypothesized that if some quality measures for certain hospitals are excluded from CMS’s calculations because of low volume, and if low volume is associated with worse outcomes, then including those omitted quality measures would negatively impact associated hospitals’ ratings, leaving other hospitals relatively unaffected or positively impacted.
Materials and Methods
We focused attention on quality measures related to surgical procedures with a demonstrated volume-outcome relationship. Among the 57 quality measures included in the December 2017 star ratings, we identified 6 such measures: 2 for total joint arthroplasty (namely, complication and readmission rates), 2 for coronary artery bypass grafting surgery (namely, mortality and readmission rates), 1 for colon surgery, and 1 for abdominal hysterectomy. We next evaluated whether these measures were ever not included for low-volume hospitals in the calculation of the star ratings. CMS’s public flat files11 contain footnotes indicating why a particular quality measure for a hospital is not included; for the colon and hysterectomy measures, the number of cases or patients was never too few to report. For the other 4 measures, a footnote for some hospitals indicated that the number of cases was too few to report, meaning that there were <25 cases (but not 0 cases) over the 3 years covered by each measure6. For total joint arthroplasty complications and readmissions, 601 and 599 of 4,579 hospitals, respectively, had associated low volume, and thus these measures were excluded from the calculation of their star ratings (Table I). Similarly, for coronary artery bypass grafting mortality and readmissions, 146 and 142 of 4,579 hospitals, respectively, had low volume.
TABLE I -
Hospital Surgical Quality Measure Summary Statistics*
||No. of Low-Volume Hospitals
|Hospital-level 30-day all-cause risk-standardized complication rate following elective primary total hip arthroplasty and total knee arthroplasty
||Total joint arthroplasty complications
|Hospital-level 30-day all-cause risk-standardized readmission rate following elective total hip arthroplasty and total knee arthroplasty
||Total joint arthroplasty readmissions
|Coronary artery bypass graft 30-day mortality rate
||Coronary artery bypass grafting mortality
|Coronary artery bypass graft 30-day readmission rate
||Coronary artery bypass grafting readmissions
N = number of hospitals with each quality measure included in CMS’s December 2017 Overall Hospital Quality Star Ratings, and No. of Low-Volume Hospitals = number of hospitals at which the relevant procedure was performed but the associated quality measure was not included by CMS in the star ratings.
We then verified the volume-outcome relationship for these 4 measures in the Hospital Compare December 2017 data release12. We performed univariate linear regressions of each hospital quality measure on associated volumes. As a robustness check, we also performed multivariable regressions, controlling for state, hospital type, hospital ownership, emergency services, and meaningful use of electronic health records.
For our main analysis, we imputed values of missing quality measures among low-volume facilities with use of 3 methods that varied in terms of how the missing values were filled using data from other hospitals’ non-missing quality measures: (1) using the predicted value from a best-fit multivariable regression of quality measure values on state, hospital type, hospital ownership, emergency services, and meaningful use of electronic health records (henceforth referred to as the regression-based imputation method), (2) using the 90th percentile (the bottom 10% of performance) of quality measure values (henceforth referred to as the bottom-performance imputation method) (Table I), and (3) using the mean quality measure value (henceforth referred to as the mean-based imputation method) (Table I). The mean-based imputation method was generous to low-volume hospitals given the volume-outcome association (i.e., it represented a sort of best-case scenario for these facilities). The regression-based imputation method was also generous in practice, predicting distributions of quality for low-volume hospitals centered on the overall mean.
Independently for each quality measure and imputation method, we then otherwise identically repeated Hospital Compare’s methodology with use of CMS’s publicly released code12 (assuming a volume of 12 for imputed hospitals). Finally, we compared the results after imputation with the results from our own replication of Hospital Compare’s unaltered methodology, examining what happened to hospitals’ 1 to 5-star ratings, hospitals’ composite scores, hospitals’ domain scores, and the underlying latent variable models.
We confirmed that the volume-outcome relationship held for each of the 4 quality measures assessed (see Appendix Figure E-1). A 100-case interval increase was associated with a 0.032 percentage point decrease in total joint arthroplasty complications (p < 0.001). For total joint arthroplasty readmissions, coronary artery bypass grafting mortality, and coronary artery bypass grafting readmissions, a 100-case interval increase was associated with percentage point decreases of 0.027 (p < 0.001), 0.150 (p < 0.001), and 0.157 (p < 0.001), respectively. For example, for a hospital that performs 25 cases over 3 years, the total joint arthroplasty complication rate is expected to be 2.90%; for a hospital that performs 3,025 cases over 3 years, the rate is expected to be 1.94% (i.e., 0.029 + 3,000/100 × [−0.00032] = 0.0194). After controlling for state, hospital type, hospital ownership, emergency services, and meaningful use of electronic health records, a 100-case interval increase was associated with 0.034 (p < 0.001), 0.030 (p < 0.001), 0.158 (p < 0.001), and 0.159 (p < 0.001) percentage point decreases, respectively.
For total joint arthroplasty readmissions, coronary artery bypass grafting mortality, and coronary artery bypass grafting readmissions, >99% of hospitals retained the same star rating after regression-based imputation (Fig. 1). Hospitals’ associated continuous composite scores were also nearly identical, as were their underlying domain scores and latent variable models. The other 2 imputation methods yielded nearly identical results.
However, for total joint arthroplasty complications, across all 3 imputation methods, over one-third of hospitals received a different star rating. After regression-based imputation, 12.9% of hospitals received a better rating and 25.8% received a worse rating (Fig. 1). For low-volume hospitals for which the total joint arthroplasty complication measure was imputed, 3.1% of hospitals received a better rating and 23.9% received a worse rating, whereas for hospitals for which the total joint arthroplasty complications measure was not imputed (i.e. those for which CMS-reported rates were used), 14.4% of hospitals received a better rating and 26.1% received a worse rating (see Appendix Table E-1). Results were similar for the other 2 imputation methods. The only exception was that, for the bottom-performance imputation method, among imputed hospitals, only 0.2% received a better rating and 74.4% received a worse rating (see Appendix Table E-1). Overall, although low-volume hospitals were more often hurt than helped after imputation (especially for the bottom-performance method), higher-volume hospitals were also more often hurt than helped.
Hospitals’ continuous composite scores (after taking weighted averages of the 7 domain scores) also differed after regression-based imputation of total joint arthroplasty complications (Fig. 2). The safety domain scores drove these changes: imputed hospitals’ safety scores collapsed toward the all-hospital average after imputation, and the safety scores of hospitals without total joint arthroplasty complications imputed bore little resemblance to their original values (Fig. 3). Similar trends held across imputation methods. These changes were driven by changes in latent variable model “loadings” (weights assigned to the quality measures in the safety domain). Without imputation of total joint arthroplasty complications, the Complication/Patient Safety for Selected Indicators (PSI-90) measure received the highest weight (0.944); this value was reduced to 0.189 after regression-based imputation (see Appendix Figure E-2). Conversely, the weight for total joint arthroplasty complications moved from 0.211 to 0.955. The loadings on other safety domain quality measures were small and virtually unchanged.
The December 2017 CMS stars measure set includes 4 surgical quality measures with known volume-outcome relationships9,10 and also missing values for some hospitals because of low volume. Imputing 3 of these measures (total joint arthroplasty readmissions, coronary artery bypass grafting mortality, and coronary artery bypass grafting readmissions) for low-volume hospitals resulted in virtually unchanged star ratings. Given the known9,10 and demonstrated volume-outcome relationships for these 3 measures, we expected that imputing quality measures from low-volume facilities would harm their relative ratings and help the relative ratings of other hospitals; however, this did not occur. This finding speaks to the stability of the underlying mortality and readmission latent variable models, although the scores that they produce omit important quality information; the hospitals excluded from these 3 measures performed fewer than 25 procedures over 3 years (about 1 procedure every 6 weeks). Yet, consumers viewing their ratings would have no indication that they should expect higher complication rates. Although these results are consistent with CMS’s assertion that changing 1 measure should make little difference for star ratings13, they point out limitations of the method in fully describing hospital quality.
In contrast, imputing missing values for the fourth measure, total joint arthroplasty complications, resulted in significantly different star ratings. The fact that low-volume hospitals were more often hurt than helped after imputation is consistent with the volume-outcome association. However, the fact that the much larger number of hospitals without the measure missing were also more often hurt than helped, even when low-volume hospitals are assigned quality measures in the worst decile, is inconsistent with the volume-outcome relationship. We can trace these star rating changes to changes in the safety domain, driven by a substantially lower loading on PSI-90 in the latent variable model, and a substantially higher loading on total joint arthroplasty complications. That is, the model underlying the safety domain is currently heavily weighted toward 1 measure, PSI-90, calling into question CMS’s assertion that poor performance on 1 safety measure does not preclude high performance in safety overall14. The model generating this weighting is unstable, and changes to it have substantial impact on the stars. Our results call into question the safety latent variable model as well as the list of quality measures composing that domain, particularly PSI-90; these results are consistent with past reports of problems with PSI-9015. Although some of these issues have been addressed16, the PSI-90 component measures still may not accurately measure certain adverse events17 and can be biased by differential surveillance across hospitals18,19.
We appreciate that CMS faces a tradeoff in how it handles quality measures with few observations, particularly for quality measures such as total joint arthroplasty complications, which denote relatively rare outcomes for which risk adjustment is likely incomplete. With regard to risk adjustment, Ayers et al.20 noted that CMS’s risk adjustment uses billing data, not clinician or patient-reported health information. Using the FORCE-TJR (Function and Outcomes Research for Comparative Effectiveness in Total Joint Replacement) registry, they showed that the AUROC (area under the receiver operating characteristic curve) associated with risk-adjustment increased from 0.65 to 0.79 after adjusting for relevant clinical variables. Without proper risk adjustment, quality of care cannot be differentiated from preexisting conditions, fair comparisons among hospitals cannot occur, and “cherry-picking” and “lemon-dropping” are incentivized21. Lack of adjustment for socioeconomic status could be another possible source of incomplete risk adjustment. CMS’s reliance on claims-based measures missing these details, especially in light of our analyses, is cause for hesitation. On the one hand, with regard to the rarity of outcomes, the volume-outcome relationship predicts that omitting measures from low-volume hospitals can impact star ratings (as we have demonstrated for total joint arthroplasty complications). On the other hand, lower volume indicates a greater degree of uncertainty around the usefulness of a quality measure for predicting performance. The calculated quality measure for some low-volume hospitals therefore will put them in a more positive light than might be warranted and may put other hospitals in a more negative light. Perhaps a reasonable compromise would be to use the omitted quality data to calculate the star ratings and to include those low-volume measures in the released data files but not elsewhere on the CMS website. Another approach could be to add a structural measure that requires a volume threshold.
Although it might seem counterintuitive that imputing one hospital’s quality measure can impact the domain scores and ratings of other, non-imputed hospitals, this interconnectedness is by design. The loadings assigned to the quality measures are estimated jointly from all hospitals’ available data, producing domain scores based on relative performance. As health-care incentives move toward value-based programs in which higher quality is rewarded financially, defining quality in relative terms may very well prevent some providers with high absolute quality scores, but lower relative quality scores, from achieving these rewards. Relative ratings will tend to overemphasize the differences between hospitals, even if true differences are minimal. This overemphasizing differences, in turn, could engender counterproductive competition by discouraging hospitals from sharing best practices and collaborating.
The star ratings are intended to reflect the type of care provided most often at any given hospital. A related limitation of the star ratings is that they are not comparable across hospitals because the specific quality measures used to calculate them differ from hospital to hospital22. Existing work has focused on measuring systematic differences in the star ratings and the number of measures used to calculate them across hospital types23. Hospitals that report fewer measures (smaller, non-teaching, specialty hospitals) tend to have more stars15. Not including low-volume surgical quality measures could exacerbate this bias beyond the implications of the volume-outcome relationship by mechanically reducing the number of quality measures included in the star ratings, rendering the measure set even less uniform.
In April 2018, CMS announced it would only update the star ratings in its July and December releases14,24. The July 2018 release, however, was cancelled after CMS published its preview reports; there was concern regarding updates to the measures in the safety domain and its latent variable model: the loading on PSI-90 fell (from 0.94 to 0.17), whereas the loading on total joint arthroplasty complications commensurately rose (from 0.21 to 0.96)25,26. This “flip” is nearly identical to what we found, although our findings resulted from imputing low-volume hospitals, whereas CMS’s findings resulted from updating the underlying measures and their collection period. Importantly, the updated PSI-90 included substantial methodological changes (i.e., the components that were included in PSI-90, their weighting, and an underlying switch from International Classification of Diseases (ICD)-9 to ICD-1027). It seems clear that the safety domain, whether from imputing low-volume hospitals or changing an underlying quality measure, is unstable; therefore, applying a latent variable model to it is problematic.
The present study had several limitations. First, from the public dataset, we did not know the actual volume of the relevant procedure among the low-volume hospitals, other than that it was <25 (and not zero); we assumed a value of 12. Second, we did not know the actual values of omitted low-volume quality measures. Given that our results did not vary among our best-case and worst-case imputation scenarios and that the true values of these missing variables likely lie somewhere between these scenarios, we believe that our findings would hold even if the actual values were known. Regarding both points, in order to facilitate replication and transparency, we felt that it was important to avoid using data that were not publicly posted for an analysis of a public ratings system. Third, 2 of our imputation methods were fairly simple: assigning all low-volume hospitals either the mean or 90th percentile (worst decile) among non-missing hospitals. It is unlikely that, if we had been able to observe these missing quality measures, they all would have been the same. To address this concern, we fitted a multivariable regression based on a number of covariates to assign low-volume hospitals quality measures. Fifth, although our focus was on volume, many other factors are important for achieving good outcomes (good teams, standardization of protocols, management of risk, etc.); most volume research investigates the impact of volume, holding constant these sorts of factors to the extent they are measured (or are at least correlated with measured factors).
In conclusion, the CMS Overall Hospital Quality Star Ratings do not represent the risks of undergoing procedures at low-volume hospitals, potentially misrepresent quality across facilities, and hence are of uncertain utility to consumers. The fact that excluding scores of low-volume hospitals for 3 of 4 quality measures with a demonstrated volume-outcome relationship did not impact the overall star ratings raises questions as to the meaningfulness of the rating system, especially for patients seeking quality information for these procedures. Moreover, the fact that the total joint arthroplasty complication measure impacts ratings for both low-volume and other hospitals raises questions regarding the safety domain.
Supporting material provided by the authors is posted with the online version of this article as a data supplement at jbjs.org (http://links.lww.com/JBJSOA/A90).
1. Hartman M, Martin AB, Espinosa N, Catlin A; The National Health Expenditure Accounts Team. National health care spending in 2016: spending and enrollment growth slow after initial coverage expansions. Health Aff (Millwood). 2018 Jan;37(1):150-60. Epub 2017 Dec 6.
4. Venkatesh AK, Bernheim SM, Qin L, Bao H, Simoes J, Norton E, Wing M, Glennon G, Shah R, Herrin J, Lin H, Lin Z, Krumholz HM. Overall hospital quality star rating on Hospital Compare December 2017 updates and specifications report. 2017 Dec. https://s3.amazonaws.com/assets.fiercemarkets.net/public/004-Healthcare/StrRtg_Dec_QtrUpdtSpecRpt.pdf
. Accessed 2018 Dec 14.
5. Venkatesh AK, Bernheim SM, Qin L, Bao H, Simoes J, Norton E, Wing M, Glennon G, Shah R, Herrin J, Lin H, Lin Z, Krumholz HM. Overall hospital quality star ratings on Hospital Compare, methodology report (v2.0). 2016 May. https://cpb-us-w2.wpmucdn.com/u.osu.edu/dist/c/28860/files/2016/08/Star_Rtngs_CompMthdlgy_052016-148w094.pdf
. Accessed 2018 Dec 14.
7. Halm EA, Lee C, Chassin MR. Is volume related to outcome in health care? A systematic review and methodologic critique of the literature. Ann Intern Med. 2002 Sep 17;137(6):511-20.
8. Reames BN, Ghaferi AA, Birkmeyer JD, Dimick JB. Hospital volume and operative mortality in the modern era. Ann Surg. 2014 Aug;260(2):244-51.
9. Wilson S, Marx RG, Pan TJ, Lyman S. Meaningful thresholds for the volume-outcome relationship in total knee arthroplasty. J Bone Joint Surg Am. 2016 Oct 19;98(20):1683-90.
10. Tsai TC, Joynt KE, Orav EJ, Gawande AA, Jha AK. Variation in surgical-readmission rates and quality of hospital care. N Engl J Med. 2013 Sep 19;369(12):1134-42.
12. Centers for Medicare & Medicaid Services. Hospital Compare December 2017 data release. 2017 Dec. Accessed 2018 Oct 4.
15. Bilimoria KY, Barnard C. The new CMS hospital quality star ratings the stars are not aligned. JAMA. 2016 Nov 1;316(17):1761-2.
17. Utter GH, Zrelak PA, Baron R, Tancredi DJ, Sadeghi B, Geppert JJ, Romano PS. Positive predictive value of the AHRQ accidental puncture or laceration patient safety indicator. Ann Surg. 2009 Dec;250(6):1041-5.
18. Bilimoria KY, Chung J, Ju MH, Haut ER, Bentrem DJ, Ko CY, Baker DW. Evaluation of surveillance bias and the validity of the venous thromboembolism quality measure. JAMA. 2013 Oct 9;310(14):1482-9.
19. Meddings JA, Reichert H, Hofer T, McMahon LF Jr. Hospital report cards for hospital-acquired pressure ulcers: how good are the grades? Ann Intern Med. 2013 Oct 15;159(8):505-13.
20. Ayers DC, Fehring TK, Odum SM, Franklin PD. Using joint registry data from FORCE-TJR to improve the accuracy of risk-adjustment prediction models for thirty-day readmission after total hip replacement and total knee replacement. J Bone Joint Surg Am. 2015 Apr 15;97(8):668-71.
21. Fehring TK. AAHKS risk adjustment initiative: why is it important? J Arthroplasty. 2016 Jun;31(6):1148-50. Epub 2016 Mar 22.
23. DeLancey JO, Softcheck J, Chung JW, Barnard C, Dahlke AR, Bilimoria KY. Associations between hospital characteristics, measure reporting, and the centers for Medicare & Medicaid services overall hospital quality star ratings. JAMA. 2017 May 16;317(19):2015-7.
24. Centers for Medicare and Medicaid Services. Overall hospital quality star rating FAQs for hospitals April 2018 preview reports. 2018 Apr.
27. Centers for Medicare & Medicaid Services. CMS recalibrated patient safety indicators fact sheet.
Supplemental Digital Content
Copyright © 2019 The Authors. Published by The Journal of Bone and Joint Surgery, Incorporated. All rights reserved.