“We have the duty of formulating, of summarizing, and of communicating our conclusions, in intelligible form, in recognition of the right of other free minds to utilize them in making their own decisions.”1
R. A. Fisher (1890–1962)
EPIDEMIOLOGY OF HYPOTHYROIDISM
Thyroid disorders are prevalent worldwide, and their specific clinical manifestations are mainly determined by the availability of dietary iodine.2,3 In iodine-replete or iodine-sufficient areas, the most common cause of primary hypothyroidism is chronic autoimmune disease.4 The prevalence of such spontaneous hypothyroidism is 1% to 2%; it is 10 times more common in women than men and increases with age.4
Subclinical hypothyroidism represents early, mild thyroid failure, which is characterized by an increased serum thyroid-stimulating hormone (TSH) level but normal free thyroxine (T4) and triiodothyronine (T3) levels.4,5 Partly because of a widely variably applied upper limit of normal for serum TSH (between 5.0 and 10.0 mIU/L),6 subclinical hypothyroidism reportedly exists in 4% to 20% of the adult population.5 Although subclinical thyroid disease is being diagnosed clinically more often across the adult age range, the long-term clinical significance of, and hence early thyroid replacement therapy for, such subclinical thyroid dysfunction remains debated.7
PERIOPERATIVE MANAGEMENT OF HYPOTHYROIDISM
Chronic hypothyroidism is a multisystem disease that encompasses a broad clinical spectrum. Thyroid hormone has major cardiovascular effects.8–10 A deficiency in thyroid activity results in decreased inotropy and chronotropy and increased systemic vascular resistance.10 The perioperative management of hypothyroidism, with the principal goal of achieving and maintaining a clinically euthyroid state, has been well described.10–13
Although it is conventionally believed that most patients with mild to moderate hypothyroidism can undergo surgery without a disproportionate increase in perioperative risk, few studies have directly assessed the clinical outcomes of hypothyroid patients after various surgical procedures.10,11 The paucity and contradictory nature of existing outcomes data prompted Komatsu et al.14 to report in this month’s issue of Anesthesia & Analgesia on the postoperative effects of hypothyroidism in their cohort study of 134,607 adults who underwent noncardiac surgery at the Cleveland Clinic Main Campus between 2005 and 2012. Their qualifying 8217 study patients were divided into 3 groups based on a preexisting diagnosis, current thyroid hormone supplementation, and/or the most recent available TSH concentration: (1) actively hypothyroid (TSH > 5.5 mIU/L); (2) previously diagnosed but effectively treated hypothyroid (0.4 < TSH < 5.5 mIU/L); and (3) historically euthyroid (normal TSH).
Their primary outcome was a composite of major in-hospital cardiovascular morbidity, surgical wound complication or infectious morbidity, and in-hospital mortality. There were no significant differences among the enrolled hypothyroid, treated, and euthyroid patients on this primary composite outcome. There were no significant differences among the patient groups on the mortality subcomponent of this primary composite outcome.
Their 2 secondary outcomes were the intraoperative use of any vasopressor and the duration of hospitalization. After adjusting for known confounding conditions, the actively hypothyroid patients were slightly, but significantly, more likely to have received an intraoperative vasopressor than either the treated hypothyroid patients or the euthyroid patients. The 3 groups were descriptively similar in their systolic and diastolic blood pressures during surgery. The actively hypothyroid patients were slightly yet significantly less likely to be discharged at any given postoperative time than the effectively treated hypothyroid patients.
It should be noted that a conservative serum TSH diagnostic cut-point level of >5.5 mIU/L was applied by Komatsu et al.,14 which may have resulted in a significant number of subclinical hypothyroid patients being misclassified as actively hypothyroid. This may have resulted in differential systematic misclassification and, thus, in an underestimate of the strength of the association between more frank hypothyroidism and their primary composite outcome and secondary outcomes. A sensitivity analysis of the subset of actively hypothyroid patients with a serum TSH level >10.0 mIU/L would address this issue.
GROWING NUMBER OF RETROSPECTIVE COHORT STUDIES USING EXISTING “BIG DATA”
Political campaigns, government, and businesses (e.g., 2012 Obama campaign, National Security Agency, Google) use applicable big data to learn everything possible about their constituents or customers and then apply advanced computation to hone their strategies.15 Such big data also include the vast reservoirs of health care–related information that are now routinely being collected by individual health systems and governmental and nongovernmental organizations.16 The work by Komatsu et al.14 is an example of the increasing use of big data in health care delivery and human subjects research,15,16 including in perioperative medicine.17
Large perioperative data repositories or registries in the United States include those of the American College of Surgeons (ACS) National Surgical Quality Program (NSQIP), Society of Thoracic Surgeons (STS) National Database, Closed Claims Registry of settled malpractice cases, Multicenter Perioperative Outcomes Group (MPOG), and Anesthesia Quality Institute (AQI) National Anesthesia Clinical Outcomes Registry (NACOR). A single-center database (e.g., Cleveland Clinic and Mayo Clinic) affords access to synchronized electronic health record and administrative claims data; however, such single-center data may lack external validity (generalizability).
As Sessler17 has observed, the quality (scope and validity of the variables) and the density (size) of such perioperative data registries vary substantially. The existing (“secondary”) data in such registries can be used for case-control and retrospective cohort studies, health services research, quality assessment and improvement, or modeling for conducting prospective studies. Of these, by far the most common use of existing big data has been in case-control and retrospective cohort studies.17 The key feature of a retrospective cohort study is that the study is designed and subjects are enrolled after the exposure or intervention and outcome(s) of interest have already occurred. One major advantage of a retrospective cohort study is its efficiency, both for conducting a given study and for how many distinct studies and discrete analyses can be generated from a single existing data registry, but there are perils for the unwary.
PERILS OF RETROSPECTIVE COHORT STUDIES USING EXISTING BIG DATA
Although there is great promise in big data, there is also the clear possibility of peril. Among the perils of any retrospective cohort study using big data are selection bias, lack of generalizability, and the ever-present threat of confounding. In their current study, Komatsu et al.14 are very careful to acknowledge each of these limitations and have conducted extremely thorough analyses using state-of-the-art statistical techniques to control for confounding. Their use of these techniques, primarily propensity scores to address confounding, helps advance research into the association between hypothyroidism and outcomes.
However, as with all articles that advance knowledge, addressing certain questions raises new questions. For Komatsu et al.,14 several methodologic questions can be raised. The greatest of these methodologic questions for such articles using propensity scores is “How close is close enough?” To be reasonably confident that potential identified confounders have been adequately controlled, how small should the absolute standardized difference among the comparison groups be? For this article, given that most results are failures to reject the null hypothesis and the small absolute standardized difference values (several <0.05), confounding is not likely an issue interpreting the results.
Although such small absolute standardized differences for the covariates included in the present analyses are reassuring about confounding by these identified covariates, propensity score methods cannot provide reassurance about unmeasured or unknown confounders. Furthermore, although Austin18 provides some guidance, the threshold for absolute standardized difference remains an open area of research that warrants further investigation.
ABSOLUTE STANDARDIZED DIFFERENCE
In a prospective randomized controlled trial or an observational cohort study, an imbalance or difference (simply due to chance) between the study groups for a baseline demographic, anthropometric, or clinical variable can confound the relationship(s) of interest.18,19 This relationship of interest can be between the active intervention/treatment (e.g., preoperative β-blocker therapy) or a pre-existing condition (e.g., hypothyroidism), and an outcome (e.g., postoperative myocardial infarction, hospital length of stay, or mortality).
Absolute standardized difference measures the effect size between 2 study groups, and in contrast to the P value generated by a t test, Wilcoxon rank sum test, or χ2 test, it is independent of sample size.19 Absolute standardized difference scores have consequently been recommended for comparing baseline covariates in clinical trials, as well as with propensity score-matched, nonrandomized, observational study data.18,20
Cohen21 has suggested that such an effect size index of <0.3 can be used to represent a small and insignificant effect size. There is no universally accepted threshold for the absolute standardized difference score that indicates the presence of meaningful imbalance. However, Austin18,20 has proposed a standard methodology to compare the distribution of baseline continuous and binary covariates between treatment groups in observational studies. Of note, standardized difference cut-point scores of 0.1, 0.2, and 0.3 have been variously applied by the Department of Outcomes Research at the Cleveland Clinic.22–24 In this research group’s current retrospective cohort study of postoperative hypothyroidism, Komatsu et al.14 applied a cut-point of <0.2 for covariate balance.
INVERSE-WEIGHTED PROPENSITY SCORES
Komatsu et al.14 have provided an excellent example of a sensitivity analysis in their article. Knowing that propensity score analysis is not 1 method but actually a set of multiple approaches, the authors showed by conducting the analysis 3 times, using different propensity score approaches (inverse propensity score weighting [primary], multiple propensity score method [sensitivity analysis 1], and propensity score matching [sensitivity analysis 2]), how sensitive their results are to the method chosen. The authors have shown that their results are consistent and robust to the propensity score method chosen, which provides greater confidence that the results are not because of an artifact of a specific propensity score approach. The authors are commended for taking these extra steps and illustrating a thorough sensitivity analysis.
USE OF SEVERITY WEIGHTING OF THE COMPONENTS OF A COMPOSITE OUTCOME
Regarding outcome measures, Mascha and Sessler25have previously published regarding the benefits and disadvantages of analyzing the collapsed composite outcomes (any versus none), the count of individual components, and analysis of each individual component. The authors have advocated for a binary composite outcome and within their current article offered a severity-weighted composite of mortality, cardiovascular complications, and wound complications. Such severity rating is needed to avoid the improper and/or illogical collapsing of a major outcome (e.g., death or myocardial infarction) and a minor outcome (e.g., pulmonary edema or postoperative nausea/vomiting) into a composite. The clinical severity weights in the study by Komatsu et al.14 were obtained by surveying 11 attending anesthesiologists not directly involved in the study.
As discussed in the study by Mascha and Sessler,25 such subjective weighting across outcomes can cause reticence among researchers. One of the drawbacks of using subjective weightings to design a composite outcome is a decrease in external validity, namely generalizability. Hopefully, the majority of the academic and scientific community agrees with the weighting scheme. If not, the conclusions can be accepted by a select few.
One avenue around the stated drawback is similar to the sensitivity analysis conducted regarding propensity score methods. A sensitivity analysis regarding the severity weights would prove beneficial. Do changes in severity weighting lead to negligible changes in P values, conclusions, and interpretations? If so, the results could be considered consistent and robust to the severity weighting schemes. As with the absolute standardized difference, the use of binary composite outcome remains a promising and open area of research that warrants further investigation.
From a statistical standpoint, complicated statistical analyses lead to complicated discussions of statistical power. Within Komatsu et al.,14 the authors have appropriately used statistical methods to examine a hypothesis that can be tested from big data. However, a traditional statistical power calculation approach is very hard to use in this case. Power calculations, for retrospective studies, assume a fixed known sample size and then calculate how much power that sample size provides to detect a clinically relevant difference in outcome.
As typified by Komatsu et al.,14 several factors complicate the standard power approach. First, the sample size depends on the chosen propensity score approach. Second, the variance of the outcome depends on the distribution of the propensity scores, the weights assigned in the severity weighted composite of binary components, the prevalence of the binary components in the severity weighted composite, and the correlation among the binary components in the severity weighted composite.26 To expect any researchers to correctly assume all those factors before beginning the investigation is unrealistic.
For investigators who may attempt in future research to adopt the methods used here by Komatsu et al.,14 there may be no choice but to conduct the complete analysis and then, based on the sample size and variances observed in the analysis, calculate the minimum effect size for which the study has adequate power. The clear danger in this approach is that researchers may find, at the end of the analyses, that the study, despite having access to a large number of records, is not adequately powered to detect a clinically relevant difference among study groups. When presenting results that use the methods presented by Komatsu et al.,14 a clear and transparent explanation of how the initial sample of records was obtained, how many records from that sample were excluded from the final analysis and justification for their exclusion (missing data, propensity score matching did not produce a suitable match, etc.), the final sample size of analyzable records, and the minimum effect size that is detectable with adequate power is essential to build confidence in the results.
Although they certainly can complement randomized controlled trials, the proliferation of registry-based retrospective cohort studies is fraught with potential problems. Komatsu et al.14 are to be commended for addressing many of these potential problems in their current retrospective cohort study of the postoperative effects of hypothyroidism. However, despite advances in epidemiological and statistical methods to address systematic bias and confounding, these are not a panacea. There remain legitimate concerns that inferences made from retrospective observational studies using big data can lead to poor health care decisions because of equating statistical significance with clinical significance and misinterpreting association for causation.27 Hence, “caveat emptor” should still prevail among the consumers of such studies.
Name: Thomas R. Vetter, MD, MPH.
Contribution: This author helped write the manuscript.
Attestation: Thomas R. Vetter approved the final manuscript.
Name: David T. Redden, PhD.
Contribution: This author helped write the manuscript.
Attestation: David T. Redden approved the final manuscript.
This manuscript was handled by: Sorin J. Brull, MD, FCARCSI (Hon).
1. Fisher RA. Statistical methods and scientific induction. J R Stat Soc Series B. 1955;B17:68–78
2. Vanderpump MP. The epidemiology of thyroid disease. Br Med Bull. 2011;99:39–51
3. Zimmermann MB. Iodine deficiency. Endocr Rev. 2009;30:376–408
4. Vanderpump MP, Tunbridge WM. Epidemiology and prevention of clinical and subclinical hypothyroidism. Thyroid. 2002;12:839–47
5. Cooper DS, Biondi B. Subclinical thyroid disease. Lancet. 2012;379:1142–54
6. Garber JR, Cobin RH, Gharib H, Hennessey JV, Klein I, Mechanick JI, Pessah-Pollack R, Singer PA, Woeber KAAmerican Association of Clinical Endocrinologists and American Thyroid Association Taskforce on Hypothyroidism in Adults. American Association of Clinical Endocrinologists and American Thyroid Association Taskforce on Hypothyroidism in Adults. . Clinical practice guidelines for hypothyroidism in adults: cosponsored by the American Association of Clinical Endocrinologists and the American Thyroid Association. Thyroid. 2012;22:1200–35
7. Biondi B, Cooper DS. The clinical significance of subclinical thyroid dysfunction. Endocr Rev. 2008;29:76–131
8. Danzi S, Klein I. Thyroid hormone and the cardiovascular system. Med Clin North Am. 2012;96:257–68
9. Klein I, Ojamaa K. Thyroid hormone and the cardiovascular system. N Engl J Med. 2001;344:501–9
10. Kohl BA, Schwartz S. How to manage perioperative endocrine insufficiency. Anesthesiol Clin. 2010;28:139–55
11. Stathatos N, Wartofsky L. Perioperative management of patients with hypothyroidism. Endocrinol Metab Clin North Am. 2003;32:503–18
12. Connery LE, Coursin DB. Assessment and therapy of selected endocrine disorders. Anesthesiol Clin North America. 2004;22:93–123
13. Njoku MJ. Patients with chronic endocrine disease. Med Clin North Am. 2013;97:1123–37
14. Komatsu R, You J, Mascha EJ, Sessler DI, Kasuya Y, Turan A. The effect of hypothyroidism on a composite of mortality, cardiovascular and wound complications after noncardiac surgery: a retrospective cohort analysis. Anesth Analg. 2015;121:716–26
15. Weber GM, Mandl KD, Kohane IS. Finding the missing link for big biomedical data. JAMA. 2014;311:2479–80
16. Murdoch TB, Detsky AS. The inevitable application of big data to health care. JAMA. 2013;309:1351–2
17. Sessler DI. Big Data—and its contributions to peri-operative medicine. Anaesthesia. 2014;69:100–5
18. Austin PC. Using the standardized difference to compare the prevalence of a binary variable between two groups in observational research. Comm Statist Simulation Comput. 2009;38:1228–34
19. Yang D, Dalton JE A Unified Approach to Measuring the Effect Size Between Two Groups Using SAS® SAS Global Forum 2012. 2012 Cary SAS:1–6
20. Austin PC. Assessing balance in measured baseline covariates when using many-to-one matching on the propensity-score. Pharmacoepidemiol Drug Saf. 2008;17:1218–25
21. Cohen J The Effect Size Statistical Power Analysis for the Behavioral Science. 19882nd edition Hillsdale Lawrence Erlbaum Associates:8–13
22. Abdelmalak BB, Bonilla A, Mascha EJ, Maheshwari A, Tang WH, You J, Ramachandran M, Kirkova Y, Clair D, Walsh RM, Kurz A, Sessler DI. Dexamethasone, light anaesthesia, and tight glucose control (DeLiT) randomized controlled trial. Br J Anaesth. 2013;111:209–21
23. Abdelmalak BB, Cata JP, Bonilla A, You J, Kopyeva T, Vogel JD, Campbell S, Sessler DI. Intraoperative tissue oxygenation and postoperative outcomes after major non-cardiac surgery: an observational study. Br J Anaesth. 2013;110:241–9
24. Komatsu R, You J, Mascha EJ, Sessler DI, Kasuya Y, Turan A. Anesthetic induction with etomidate, rather than propofol, is associated with increased 30-day mortality and cardiovascular morbidity after noncardiac surgery. Anesth Analg. 2013;117:1329–37
25. Mascha EJ, Sessler DI. Statistical grand rounds: design and analysis of studies with binary-event composite endpoints: guidelines for anesthesia research. Anesth Analg. 2011;112:1461–71
26. Mascha EJ, Imrey PB. Factors affecting power of tests for multiple binary outcomes. Stat Med. 2010;29:2890–904
27. Dahabreh IJ, Kent DM. Can the learning health care system be educated with observational data? JAMA. 2014;312:129–30