Future studies regarding infections in premature infants or critically ill newborns and children will fall into two domains: first, randomized, controlled trials assessing the efficacy or cost-effectiveness of interventions and, second, surveillance studies aiming at elucidating risk factors, prognostic signs, or trends in outcomes and incidence rates (1, 2). Whatever objective the study pursues, the success hinges on the correct identification of infected infants (cases) and noncases. Adjudication takes place either when a decision about enrolment of a patient is made, at decision branches during the course of treatment, or, in cohort studies, when the underlying disease process must ultimately be determined.
The inherent difficulty for adjudicating the presence or absence of infection arises from the heterogeneity in presentation, symptoms, changes in laboratory and other variables, and the imperfect test accuracy of the various diagnostic procedures (3, 4). Two examples may illustrate the problem. The first example is a hypothetic test accuracy study in premature infants. Assume a new marker yields an excellent sensitivity of 95% and a specificity of 95% for the detection of systemic bacterial infection. This test is employed in a cohort study of infants with suspected infection and compared against the gold standard of clinical signs plus a positive blood culture. Because a considerable proportion of blood cultures obtained from immature infants with systemic bacterial infection remain falsely negative (5), the test accuracy study will be biased toward showing inferior test performance than the test’s true properties—unless investigators employ a case-control design by omitting ambiguous cases from the analysis (6). The latter type of research design, however, has been implicated with severely inflated test accuracy measures (7).
The second example illustrating the difficulties arising from adjudication relates to a randomized trial of septic shock. Successful interventions should be implemented rapidly after the onset of the immune cascade ultimately mounting to septic shock (8). If broad criteria for inclusion are applied, many children will be included that would not have proceeded to shock. This may bias the results toward a null difference between treatment and control, even if the treatment reduces morbidity and mortality. On the other hand, if quite strict inclusion criteria are enforced, implementation of the experimental treatment may not be initiated before closure of the window of opportunity, again biasing the observed effect toward the null.
In the following, I will provide a secondary review of two previously published studies (9, 10) discussing the implications of these studies for case adjudication. In addition, a hypothetical simulation model illustrates the potential misclassification consequences of applying adult consensus definitions in the real world of pediatric intensive care with imperfect sensitivity of blood cultures. The present review article aims to address these adjudication issues from three perspectives. The first perspective relates to the reliability of post hoc adjudication of infectious episodes. Such post hoc adjudication is the usual procedure employed in most diagnostic accuracy studies and in surveillance studies (9). The second perspective reviews data from a study on the physician’s ability to estimate disease probabilities at bedside. Although arriving at probability estimates is an integral part of everyday decision making, such likelihood assumptions become rarely enumerated (10). The third perspective illustrates the consequences of applying discussed consensus conference definitions using the example of catheter-related sepsis when considering the limited sensitivity of blood cultures. The detailed methods and results from the studies underlying perspectives 1 and 2 have been published elsewhere (9, 10).
Challenge of Diagnosing Sepsis
The theory regards decision making in critical care as a process of quantitative reasoning based on the probability that an infection is present and on the correct interpretation of data from diagnostic tests (11, 12). The clinician also has to consider the harms and benefits from treatment. Because uncertainty accompanies this decision process regarding infections, the net result is a low antibiotic treatment threshold. This leads to a high treatment rate of uninfected patients (2). It has been estimated that up to 30 otherwise healthy newborns, and up to ten critically ill infants will be treated for every patient in whom infection can ultimately be confirmed. Estimates from the United States suggest that a quarter of all healthcare expenditures in neonatal care are caused by work-up of patients with suspected infection (13). The paradigm of early diagnosis rests on the assumption that infections may be detected by laboratory variables or by combinations of subtle signs that allow advancing the diagnosis by 24 to 48 hrs. It is believed that this advancement of diagnosis opens a window of opportunity to install appropriate treatment before infection proceeds to the full clinical picture of septic shock (Fig. 1) (8). Various new diagnostic tests and scores have been suggested to facilitate early diagnosis (14–17). To date, however, no single biological infection marker has gained unanimous acceptance.
Post Hoc Adjudication
The adjudication procedure in surveillance studies is the hindsight review of purpose-designed charts (18). This also applies to the ward-round reviewing of previous decisions to initiate or to withhold antibiotic treatment. In episodes of suspected infection in which classification remains ambiguous, senior clinicians base their decision making on the clinical signs, the history, the available laboratory data, the time course of events, and their own experience. To investigate the reliability and validity of such hindsight judgment, we conducted a prospective cohort study in the tertiary neonatal and pediatric intensive care unit of the Zurich University Children’s Hospital. In brief, the study population comprised all newborns, infants, and children admitted with medical or surgical conditions. The study population comprised few premature infants born before 32 wks of gestation because these patients are usually cared for in a separate unit in the Women’s hospital. About a third of the study patients were infants in postoperative care after cardiac surgery. In these patients, postsurgical inflammation is frequent and may occur simultaneous to nosocomial infections, rendering the differential diagnosis of clinical signs suggestive of infection particularly challenging.
To adjudicate cases of suspected infection, we invited three senior clinicians who had several years of experience in working alongside each other to adjudicate ambiguous cases of suspected infection. These experts comprised the head of the division of infectious diseases and two senior consultants on the pediatric intensive care unit. Experts were blinded to the judgment of each other. The main purpose of the study was to investigate the accuracy of novel diagnostic markers. Therefore, experts were provided with all relevant data from the patient record deemed to be relevant for case adjudication. This included the patient charts, laboratory and microbiological findings, and the physicians’ case records. All three experts were provided with the same set of instructions on adjudication of cases. Experts were asked to choose from five possible diagnoses: sepsis, probable sepsis with negative blood cultures, localized infection, viral infection, or absent infection. If case experts believed more than one condition to be present (e.g., pneumonia in a child with a systemic inflammatory response to cardiac surgery), they were instructed to select the most relevant cause of the present episode regarding treatment.
Before case review by the experts, we provided a fifth-year medical student with simple criteria for identifying patients with confirmed sepsis and those not having infection. These criteria were derived from published test accuracy studies (6, 8, 19, 20). All episodes that were considered by the fifth-year medical student to be proven sepsis or absence of infection and on which the experts arrived at the same adjudication, were deemed classifiable by junior physicians. The remaining episodes were considered to require expert judgment. Expert agreement was assessed using Kappa statistics. The Kappa statistics are a measure of agreement beyond chance, for which a Kappa of 0 is consistent with chance agreement and a Kappa of 1 indicates perfect agreement.
During a 5-month period, 183 episodes of suspected infection occurred in the 19-bed multidisciplinary, tertiary, neonatal, and pediatric intensive care unit. Antibiotics were prescribed in 167 of these episodes. Overall agreement among experts was moderate (κ = 0.54), with almost perfect agreement for episodes of proven sepsis (κ = 0.92) and agreement slightly better than chance regarding episodes of probable sepsis (κ = 0.18). However, when the 48 episodes classifiable by the a priori defined criteria were removed, experts only fairly agreed on the remaining 119 episodes (71%, κ = 0.32). Summarizing the findings, only about one third of all episodes of suspected infection, in particular, those with positive blood cultures, could unambiguously be classified according to prespecified criteria. The remaining two thirds of all episodes remained ambiguous even after data from the clinical course had become available.
What are the clinical implications? In the real world, clinicians will exchange information and attempt to arrive at a common conclusion. In the study, we blinded each clinician to the adjudication of the other. Although the three clinicians used the five available classifications with similar frequency, the data suggest that in the absence of clear-cut criteria such as positive blood cultures, decision making may rapidly become arbitrary. These findings replicated other studies supporting a limited agreement in clinical judgment on potentially ambiguous outcomes (21, 22). In the absence of a gold-standard test allowing classification of all episodes, the true cause of the episode remains elusive. Therefore, it remains unknown whether the usual decision process resembling a Delphi-process would have arrived at valid adjudications. These data show that in the presence of uncertainty, experts differ in their way to extract and weigh the relevant information. In any case, this introduces a misclassification bias, which may hamper the observed treatment efficacy or impede diagnostic accuracy. It is possible that further sophistication of the adjudication process (intense pretraining of the experts) might have increased the agreement. It should also be noted that the usual procedure in clinical studies is to proceed with a Delphi-method approach to arrive at a consensus classification. Due to the additional qualitative and quantitative data available if expert review takes place at bedside, the true clinical adjudication agreement may be higher than the level reported here. However, it is also conceivable that decisions made after a verbal report of the patient findings may result in even poorer agreement than found in this investigation. The study underscores the need to supplement definitions for research in sepsis and infection in newborns and critically ill children with categories representing the degree of certainty, such as probable, possible, or unlikely.
The next section deals with the potential accuracy of physicians’ adjudication if all information to the treatment team available at decision making may be utilized. The study reported in the next section differs with the previously presented one in several aspects: physicians were encouraged to discuss their adjudication with peers and nurses, they were asked to use any information available to them, and they were offered the option to express their adjudication as a disease probability rather than as exclusive categories.
Physicians’ Probability Estimates
Clinical decision making on the presence or absence of infection is an excellent example of the art of medicine: to arrive at decision when uncertainty prevails. The decision process always starts with an estimate about the probability that serious infection is impending. Clinicians will also have at least a vague idea about the disease probability when any awaiting of further tests is more risky than immediate initiation of antibiotic therapy. Naturally, this threshold disease probability for initiation of treatment may vary from patient to patient and may also be influenced by guidelines. Physicians also will have a perception about a very low level of disease probability, when infection is so unlikely that no further testing is warranted. Decision analysts label these three pieces of information elements the pretest probability, the treatment threshold, and the testing threshold.
According to decision theory, diagnostic tests are only warranted if the pretest probability is higher than the testing threshold and lower than the treatment threshold (12). Ordering additional tests when a decision to prescribe antibiotics has already been made does not provide any useful information, nor does screening of patients in whom infection is absent. Thus, only tests ordered from perceived uncertainty have the potential to add useful information, and only tests that move the pretest probability across either of the just mentioned thresholds provide clinically useful information (11). The mathematical formula for updating a previous probability estimate with new information is provided by Bayes’ theorem. It requires enumeration of the probability estimate before ordering the test and the likelihood ratio as an expression of the test accuracy. The resulting posttest probability is then compared with the treatment or testing threshold.
Bayes’ theorem works fine in textbook theory but has rarely been used in clinical practice with respect to diagnosing infection. The reason for this failure to introduce a presumably straightforward algorithm into clinical practice arises from four problems. First, physicians are not used to enumerating their best guess about a disease probability (23). Second, physicians do not know their treatment and testing thresholds. Third, very few test accuracy studies provide multiple-level likelihood ratios that allow imputing test results from continuous laboratory variables. Fourth and finally, often more than one test result is to be considered simultaneously (e.g., increase in immature neutrophils and elevated C-reactive protein)—the conditional likelihood ratios required for these calculations are usually not available (24). Thus, clinicians have shown a marked reluctance to embrace the decision analyst’s approach, despite ample publication publicity during the past decades.
In the study reviewed in the following section, we aimed at elucidating the first two questions: to obtain pretest probability estimates from physicians during the ward round and to delineate treatment and testing thresholds. Specifically, we undertook a multiple-center, prospective cohort study to obtain daily predictions by physicians on the possible presence of serious bacterial infection. We also assessed the diagnostic accuracy of physicians’ predictions using the standard case-control design with infected case subjects and noninfected controls—treating the probability estimate as if it was continuous laboratory variable for disease prediction.
The study included all consecutive patients who were admitted to the participating units during a 3-month period. Units were the 28-bed level III neonatal intensive care unit of the Brigham and Women’s Hospital, Boston, and the 19-bed level III pediatric intensive care unit of the Children’s Hospital in Zurich. The patient population of both units overlapped for premature infants and critically ill term newborns. Except for excluding patients receiving extracorporal membrane oxygenation, the patient population comprised the entire spectrum of neonatal and pediatric critical illness.
To obtain the probability estimates, three trained research fellows asked the clinicians responsible for the care of the patient (15 fellows, 12 attending intensive care physicians) to provide an estimate on the presence of serious untreated infection at every ward round. A subsequent study in the Zurich unit revealed a rapidly plummeting compliance if the research fellows did not personally chase for the estimates, but the process was left to a paper-based procedure.
Of particular interest were the predictions at initiation of antibiotics. We asked to quantify the probability of a serious untreated bacterial infection. The next morning, we requested physicians to provide an updated estimate, considering all information that became available since the initiation of antibiotics, including possible early results from blood cultures. We also asked physicians to predict whether blood cultures during sepsis workup would become positive—this served as an external validity criterion (10).
In the light of the previous section of this article, one wonders how we dealt with the case adjudication. As explicated in the article reporting the original data, one of the investigators and a senior clinician of the unit not involved in the decision making at initiation of antibiotics carried out the adjudication process. They resolved ambiguities by a Delphi-type consensus method. In some instances, we could not find a senior clinician sufficiently familiar with the case and at the same time not having been involved in the decision making at sepsis workup. The possible bias introduced by this is believed to be less severe than the bias introduced by a case-control design (7). For that reason, we used several definitions of cases in the original report: 1) only those patients with positive blood cultures (proven sepsis) and 2) proven sepsis plus patients with negative blood cultures but a high degree of clinical probability (probable sepsis).
The median predicted probability at inception of antibiotic therapy was 20%. The predictions’ ability to discriminate between patients who were later deemed to have culture-proven systemic bacterial infection (cases) and episodes classified as no infection revealed an area under the receiver operating characteristic curve of 0.88 (95% confidence interval [CI], 0.81–0.94), with a good calibration of the model (Hosmer-Lemeshow chi-square, p = .63). Choosing a predicted probability of 25% as the cut-off, the sensitivity amounted to 0.87 (95% CI, 0.65–0.97) and the specificity to 0.83 (95% CI, 0.73–0.90). The corresponding positive likelihood ratio was 5.1 (95% CI, 3.5–7.9) and the negative likelihood ratio was 0.16 (95% CI, 0.11–0.2). Sensitivity analysis using broadened definitions for cases yielded similar results. As expected, the further information becoming available between initiation of antibiotic therapy and the next morning also resulted in a slightly increased discriminative ability of the clinical judgment (area under the receiver operating characteristic curve, 0.91; 95% CI, 0.84–0.96). Underscoring the urgent need to identify new markers for infection that facilitate an early diagnosis, the accuracy of predictions provided 24 hrs before initiation of antibiotics was indifferent from a chance result (area under the receiver operating characteristic curve, 0.49)—at a time when infection was probably already incubating. Regarding the external validity criterion of prediction of positive blood cultures, physicians again showed a reasonable discrimination (model adjusted for age and unit: area under the receiver operating characteristic curve, 0.77; 95% CI, 0.70–0.83) with good calibration (Hosmer-Lemeshow chi-square, p = .28).
What are the lessons to be learned from this study? Experienced physicians perform remarkably well when asked about disease probabilities at bedside and with all accessible information, including the personal perception not documented in the patient charts being available (Fig. 2). Apparently, there is no single threshold for initiation of antibiotic treatment; rather, there is a nonlinear relationship between estimated disease probability and the proportion of patients receiving antibiotics. This finding underscores the already mentioned problems related to introducing Bayes’ theorem into daily routine. However, for research purposes and post hoc adjudication of ambiguous episodes, researchers may want to supplement the recorded data by probability estimates either at randomization of patients or when sepsis workups are obtained during test accuracy studies. Whether such assessment in Likert-scale format (e.g., proven, probable, likely, unlikely, absent) or an estimate expressed as percentages provides superior reliability remains to be elucidated.
Possible Misclassifications Resulting from Using the Gold Standard
Many attempts have been made to arrive at practical gold-standard definitions (25). In the adult literature, an agreement to proving the presence of catheter-related bacterial infection has been achieved (26). This consensus in the literature utilizes the technique of quantitative blood cultures. In cases of suspected catheter-related sepsis, it is recommended to obtain one culture through the existing catheter and one culture from a peripheral vein. If both blood cultures are positive and the culture obtained through the catheter becomes positive much faster, then and only then is catheter-related infection proven. This logic and the resulting diagnostic categories can be summarized in a straightforward 2 × 2 table (Fig. 3, top panel). It should be noted that this figure does not display the usual 2 × 2 table employed in test accuracy studies but serves to explore the various combinations of results from the two cultures. For example, if the peripheral culture yields a positive result and the culture obtained through the catheter is negative (lower left quadrant), the definitions suggest to diagnose the case as bloodstream infection of unknown origin. These definitions rest on the assumptions of near perfect sensitivity of blood cultures or, in other words, a negligible proportion of false-negative blood cultures. Because of this condition, the definitions work well for adult patients, in whom obtaining sufficiently large blood volumes (10 mL per bottle) to minimize the risk of false-negative blood cultures is not an issue.
Unfortunately, the same definitions yield serious misclassification risks when used in newborns or critically ill children, from whom it is no longer disputed that a sizable proportion of blood cultures will fail to grow organisms, despite the presence of blood stream infection. In these patients, considerably smaller volumes of blood are collected for blood cultures. The positive blood culture does no longer satisfy the criterion of an appropriate gold standard because blood cultures yield few false-positive and a considerable proportion of false-negative results (5). The middle panel of Figure 3 illustrates the effect of assuming a sensitivity of 30% for the peripheral blood culture and of 50% for the catheter culture when obtaining 1 mL of blood. Assuming that all patients have true catheter-related infection, the classification scheme would only correctly identify 15% of the infants. Disturbingly, about a third of patients with true catheter-related sepsis would be misclassified as having catheter colonization or catheter contamination because the peripheral culture remains falsely negative. Most clinicians will treat a contaminated catheter with a different regimen than catheter-related sepsis.
Even if 3 mL of blood are collected per bottle (Fig. 3, bottom panel), assuming an increase in the sensitivity to 70% and 80%, respectively, the correct classification only increases to about 56%. Because these examples assumed 100 patients with true catheter-related sepsis, the problem of false-positive cultures (specificity) was not considered. Therefore, in critically ill newborn or pediatric patients, the double-culture classification scheme may correctly classify only about half of the patients. Under any premise, about a quarter of all classifications will be potentially dangerous. Thus, researchers and clinicians should be particularly cautious when combining two criteria with imperfect sensitivity for case adjudication because this may aggravate the threat of possible misclassifications.
This article reviewed some of the many sources of possible misclassification in therapeutic and surveillance studies. In the absence of a true gold standard, the final post hoc adjudication of ambiguous episodes of infection remains a daunting task for both clinicians and researchers alike. Post hoc adjudication from case records is likely to fuel ongoing debates about the true status of the patient. Probability estimates incorporating all available information perform as well as the best currently available diagnostic markers. Researchers should include probability estimates in future trials at enrolment of patients. Serious consequences may arise from imperfect gold standards with limited sensitivity. The worst-case scenario presented for the workup of suspected catheter-related infection may lead to possibly dangerous therapeutic decisions. In the best case, misclassification will bias test diagnostic studies toward reporting lower test accuracy and therapeutic studies toward less treatment benefit. Power calculations for future trials must bear the effects of such misclassification biases in mind, as must clinicians treating patients. This may, as illustrated in the example for diagnostic workup of catheter-related sepsis in newborns, lead to contradictory findings from the definitions and from clinical judgment. If consensus definitions shall become helpful in clinical practice, they must incorporate this divergence by providing categories describing the degree of certainty about the diagnosis as definite, probable, possible, and absent.
1. Strait RT, Kelly KJ, Kurup VP: Tumor necrosis factor-alpha, interleukin-1 beta, and interleukin-6 levels in febrile, young children with and without occult bacteremia. Pediatrics
2. Franz AR, Steinbach G, Kron M, et al: Reduction of unnecessary antibiotic therapy in newborn infants using interleukin-8 and C-reactive protein as markers of bacterial infections. Pediatrics
3. Brun-Buisson C: The epidemiology of the systemic inflammatory response. Intensive Care Med
2000;26 (Suppl 1): S64–S74
4. Benitz WE, Gould JB, Druzin ML: Risk factors for early-onset group B streptococcal sepsis
: Estimation of odds ratios by critical literature review. Pediatrics
5. Schelonka RL, Chai MK, Yoder BA, et al: Volume of blood required to detect common neonatal pathogens. J Pediatr
6. Kuster H, Weiss M, Willeitner AE, et al: Interleukin-1 receptor antagonist and interleukin-6 for early diagnosis
of neonatal sepsis
2 days before clinical manifestation. Lancet
7. Lijmer JG, Mol BW, Heisterkamp S, et al: Empirical evidence of design-related bias in studies of diagnostic tests. JAMA
8. Balk RA: Severe sepsis
and septic shock: Definitions, epidemiology, and clinical manifestations. Crit Care Clin
9. Fischer JE, Seifarth FG, Baenziger O, et al: Hindsight judgement on ambiguous episodes of suspected infection in critically ill children: Poor consensus amongst experts? Eur J Pediatr
10. Fischer JE, Harbarth S, Agthe AG, et al: Quantifying uncertainty: Physicians’ estimates of infection in critically ill neonates and children. Clin Infect Dis
11. Pauker SG, Kopelman RI: Interpreting hoofbeats: Can Bayes help clear the haze? N Engl J Med
12. Pauker SG, Kassirer JP: The threshold approach to clinical decision making. N Engl J Med
13. Escobar GJ: The neonatal “sepsis
work-up”: Personal reflections on the development of an evidence-based approach toward newborn infections in a managed care organization. Pediatrics
14. Castellanos-Ortega A, Delgado-Rodriguez M: Comparison of the performance of two general and three specific scoring systems for meningococcal septic shock in children. Crit Care Med
15. Gendrel D, Raymond J, Coste J, et al: Comparison of procalcitonin with C-reactive protein, interleukin 6 and interferon-alpha for differentiation of bacterial vs. viral infections. Pediatr Infect Dis J
16. Giamarellos-Bourboulis EJ, Mega A, Grecka P, et al: Procalcitonin: A marker to clearly differentiate systemic inflammatory response syndrome and sepsis
in the critically ill patient? Intensive Care Med
17. Isaacman DJ, Shults J, Gross TK, et al: Predictors of bacteremia in febrile children 3 to 36 months of age. Pediatrics
18. Cook DJ, Walter SD, Cook RJ, et al: Incidence of and risk factors for ventilator-associated pneumonia in critically ill patients. Ann Intern Med
19. Chiesa C, Panero A, Rossi N, et al: Reliability of procalcitonin concentrations for the diagnosis
in critically ill neonates. Clin Infect Dis
20. Stoll BJ, Gordon T, Korones SB, et al: Early-onset sepsis
in very low birth weight neonates: A report from the National Institute of Child Health and Human Development Neonatal Research Network. J Pediatr
21. Clinical disagreement: II. How to avoid it and how to learn from one’s mistakes. Can Med Assoc J
22. Walter SD, Cook DJ, Guyatt GH, et al: Outcome assessment for clinical trials: How many adjudicators do we need? Canadian Lung Oncology Group. Control Clin Trials
23. Phelps MA, Levitt MA: Pretest probability estimates: A pitfall to the clinical utility of evidence-based medicine? Acad Emerg Med
24. Fischer JE, Bachmann LM, Jaeschke R: A readers’ guide to the interpretation of diagnostic test properties: Clinical example of sepsis
. Intensive Care Med
25. Goldstein B, Giroir B, Randolph A: International pediatric sepsis
consensus conference: Definitions for sepsis
and organ dysfunction in pediatrics. Pediatr Crit Care Med
26. Raad I, Hanna HA, Alakech B, et al: Differential time to positivity: A useful method for diagnosing catheter-related bloodstream infections. Ann Intern Med