From the Department of Paediatrics and Child Health, Ethio-Swedish Children's Hospital, Addis Ababa, Ethiopia; Medical Research Council Laboratories, Banjul, The Gambia; Papua New Guinea Institute of Medical Research, Goroka, Papua New Guinea; Research Institute of Tropical Medicine, Alabang, The Philippines; the Division of Biostatistics and Epidemiology, Department of Health Evaluation Sciences, University of Virginia, Charlottesville, VA; and the Department of Pediatrics, University of North Carolina, Chapel Hill, NC.
* The WHO Young Infants Study Group: Writing committee: P. Margolis, E. Kim Mulholland, F. Harrell, S. Gove; study coordination: S. Gove (scientific coordinator), P. Margolis, F. McCaul, S. Parker; data management: P. Byass; statistical analysis: F. Harrell, K. Mason, J. Carlin; Gambia: clinicians, K. Mulholland [site principal investigator (PI)], O. Ogunlesi, M. Weber, M. Manary, A. Palmer; laboratory, R. Adegbola, H. Whittle, O. Secka, B. Sam, D. Hazlett, M. Aidoo; data management, J. Bangali; director, B. Greenwood; Ethiopia: clinicians, L. Muhe (PI), M. Tilahun, S. Lulseged, S. Kebede; laboratory, A. Yohanes (deceased), B. Belete, S. Ringertz; radiology, T. Desta; data management, K. Woldeyesus; director, N. Tafari; Papua New Guinea: clinicians and nurses, D. Lehmann (PI), G. Saleu, A. Rongap, M. Kakazo, P. Namuigi, S. Lupiwa, R. Sehuko; laboratory, A. Clegg, R. Sanders, A. Michael, T. Lupiwa, M. Omena, M. Mens, B. Marjen, P. Wai'in, M. Sungu; data management, D. Lewis; director, M. Alpers; Philippines: clinicians, S. Gatchalian (PI), B. Quiambao, A. M. Moreles, L. Abraham; laboratory, L. Sombrero, F. Palladin, V. Sariano, A. M. Obach; data management, E. Sunico, T. Cedulla; study advisors, H. Eichenwald, C. Broome, M.Gratten, P. Margolis, R. Facklam, T. Nolan; reference laboratory support, J. Hendrichsen, P. H. Makela, M. Grandien, J. Schachter, L. Moore, G. Cassell, L. Duffy, R. Facklam, F. Tenover, B. Metchock; Radiology Working Group, H. Tschäppeler, M. Hendry, A. Lamont, P. Palmer.
Address for correspondence: Dr. K. Mulholland, Vaccines and Biologicals, World Health Organization, 1211 Geneva 27, Switzerland. Fax 41-22-791-4860; E-mail firstname.lastname@example.org.
Address for reprints: The Director, Department of Child and Adolescent Health and Development, World Health Organization, 1211 Geneva 27, Switzerland.
It is estimated that 5 million deaths occur in the neonatal period annually. Of these, ∼97% are in developing countries and >40% are believed to be caused by infection.1 In developing countries much of the effort to control neonatal mortality has concentrated on reducing the risk of perinatal infection by improving the care of pregnant mothers. For example referral of high risk pregnancies, maternal immunization for tetanus, improved care of infants after delivery with emphasis on umbilical cord care and the early introduction of breast-feeding. The treatment of infants with established infection remains unsatisfactory. Many cases of neonatal infection never reach treatment facilities, and the case-fatality rate for those that do ranges from 13 to 69%.2
One approach that has been used in dealing with other pediatric problems such as diarrhea and pneumonia has been to train clinicians, nurses and local health workers (e.g. traditional birth attendants or community health workers) to recognize clinical signs of illness and to refer children for treatment. In developing countries treatment decisions are based largely on the history and clinical examination because of the limited availability of laboratory facilities. More accurate methods of diagnosis could help improve decisions about the need for additional diagnostic tests or therapy in developing country settings.
Most previous studies of the accuracy of the clinical examination in detecting serious bacterial illness in infants younger than 3 months of age have been conducted in developed countries. Studies in the US, United Kingdom and Australia3-5 have given conflicting results about the accuracy of observational scales in detecting serious bacterial illness. The small numbers of patients with serious bacterial illness who can be enrolled in developed countries have limited previous studies. This has made it difficult to evaluate the importance of large numbers of clinical variables and to use sophisticated statistical approaches to evaluate the relationship between clinical findings and diagnoses. In addition previous studies of serious bacterial illness have focused simply on the presence or absence of infection, an outcome that is less clinically relevant than one that also enables clinicians to assess the severity of illness. Finally developed country studies have placed greater reliance on the value of laboratory examinations and less emphasis on the clinical examination.3, 6 In developing countries laboratory facilities to perform tests such as the blood count and chest radiograph are often unavailable, and clinical decisions must be made without them.
The purpose of this study was to develop a clinical prediction instrument that would identify infants at risk of serious bacterial illness on the basis of clinical examination findings. We hoped that the study would lead to a relatively simple method that could be used by nonphysicians and physicians to detect children at increased risk of adverse outcomes of serious bacterial illness.
The design of the study has been described in detail previously.7 Briefly it was conducted at hospitals or outpatient clinics in Ethiopia, The Gambia, Papua New Guinea and The Philippines that see large numbers of sick infants. At each site infants younger than 91 days of age seen consecutively for acute care with chief complaints indicating possible infection were eligible to be included in the study. Entry criteria were intended to include infants with a wide spectrum of illness severity and to ensure that virtually all infants with serious infection would be included. Exclusion criteria are described elsewhere in this supplement.7
All infants underwent a standardized history and physical examination to assess the presence or absence and the degree of severity of signs and symptoms believed to be associated with bacterial disease (Table 1).8-10 Candidate predictors considered included demographic variables, historical variables, vital signs and physical examination findings. A pediatrician conducted the examination at three of the sites. Pediatric nurse practitioners performed the examination at the site in Papua New Guinea.
Infants with prespecified symptoms associated with possible bacterial infection underwent a laboratory evaluation that included blood culture, white blood cell count and chest radiograph. Specific criteria were also used to identify infants for lumbar puncture. All infants enrolled in the study underwent pulse oximetry. Urine cultures were not systematically obtained in all sites because it was not feasible to collect suprapubic specimens or to catheterize infants in all the settings. A random sample of patients not meeting criteria for laboratory testing underwent chest radiograph, blood culture and blood count to assess the frequency of bacterial infection among infants not meeting criteria for laboratory evaluation. Decisions regarding treatment were made on a clinical basis.
Bacteremia was defined as the growth of a known pathogen in cultures of blood or cerebrospinal fluid (CSF). Meningitis was defined as a positive CSF culture or >10 polymorphonuclear cells in a nonbloody CSF specimen. Patients with grossly blood-stained lumbar punctures were classified as not having meningitis unless the CSF culture was positive. The criteria used to classify organisms as contaminants are described elsewhere.7
Oxygen saturation was measured after the clinical examination with a Nellcor N-200 (Nellcor Inc., Hayward, CA) pulse oximeter as previously described. Oxygen saturation measurements from sites above sea level were adjusted to correspond to readings at sea level. Pneumonia was diagnosed on the basis of chest radiographs interpreted without clinical information by a panel of three radiologists.7, 11 Study infants were defined as having pneumonia if all the radiologists who read the radiograph considered it to show probable or definite pneumonia. Infants with uninterpretable films or with missing radiographic information were classified as having a normal radiograph.
Study outcome measure. The primary study outcome measure was an ordinal scale that summarized the presence or absence of disease, as well as its severity. This measure was selected because it would allow infants to be included in the clinical prediction rule who might have negative cultures (e.g. those with low oxygen saturation and positive radiographs), but who still might benefit from treatment because they were at increased risk of bacterial infection. Data about oxygen saturation, chest radiography, blood culture, CSF culture and cell count and mortality were combined to produce a measure that grouped diagnosis in increasing severity. Diagnoses were ranked in a hierarchic fashion based on their association with the most severe outcome, death. Diagnoses not strongly related to death were examined with respect to their association with the presence of a positive blood culture or CSF result. Outcome variables with similar associations with death or bacterial infection were included in the same category. This approach produced a ranking of severity with four levels of disease severity: (1) no abnormality; (2) mild hypoxemia (90% ≧ SaO2 < 95%) or radiologic pneumonia; (3) severe hypoxemia (SaO2 < 90%) or bacteremia or meningitis; and (4) death. As will be discussed later the main study analysis considered only a three level ordinal scale that did not include death as an explicit outcome category. From a statistical standpoint an ordinal ranking assumes that the variables can be ordered in terms of clinical importance or disease severity without assuming interval spacing between the variables. Such a procedure also makes more efficient use of data for statistical analyses. Children who died >7 days after discharge from the hospital were not counted as a death for the purposes of this study.
Deriving the clinical prediction rule. The goal of the statistical analysis was to select a limited set of clinical predictors that could reasonably be used in clinical settings and that could predict infants at risk of serious bacterial illness or death. The analysis was designed to reduce the total number of variables in the model by grouping related subsets of clinical findings and summarizing each group of findings with a single score. We sought to avoid selecting variables to include in the model on the basis of their individual statistical associations with specific diagnoses because the multiple tests involved in this approach would exaggerate the strength of the relationship between clinical findings and the diagnoses. The statistical methods of the analysis are described in detail elsewhere.12
The derivation of the clinical prediction rule took place in three steps.
Step 1. Clustering variables. Statistical variable clustering methods were used to place the 51 individual clinical signs and symptoms into groups or "clusters" so that the correlation of clinical findings within clusters was high and the between cluster correlation was low. The clustering procedure took place before the evaluation of the association between the clusters and the diagnoses. After the clustering procedure clinicians on the study team were asked to identify and rank the relative severity of the findings within the clusters. After this exercise a check was made to identify major disagreements between physicians' scoring and the outcome patterns. The only clustering that was rescored after examining patterns of the association with the ordinal scale was the auscultation cluster. Here a decision was made to use as the only variable the presence or absence of crepitations. Clinicians' ranking and scoring decisions were carried out before any analysis relating individual clinical signs to the diagnoses. We selected a scoring method in which the cluster score was the score of the most severely abnormal finding in the cluster. Measurements of vital signs (age, temperature, respiratory rate, weight) were not included in the clusters. The weight-for-age Z score was used as the measure of infant size. We made this decision because other anthropomorphic variables of size in this age group are highly correlated with weight-for-age. Once each cluster of clinical signs was scored, we entered the cluster scores into a logistic regression model. The clustering of signs thus resulted in significant reduction in the number of variables in the model so that fewer regression coefficients needed estimation.
Step 2. Estimating the overall accuracy of the clinical findings. The association of the clusters of clinical variables, vital signs measurements and age with the diagnostic outcome was assessed by fitting a logistic regression model for an ordinal response variable (continuation ratio model) using penalized maximum likelihood estimation. Penalization is a method of discounting fitted regression coefficients to avoid overfitting. All two way interactions between predictors and the category of the ordinal response being predicted were included in the model so that predictors could be included that might be associated with one level of the scale but not with another. For example the presence of crepitations might be more highly associated with chest radiographic evidence of pneumonia but less strongly associated with a positive blood culture. Continuous variables such as temperature and respiratory rate were entered in the model through the use of restricted cubic splines (piecewise cubic polynomials) to not restrict the model to the assumption of a linear or even monotonic association of these variables with the ordinal scale.12 Bootstrap methods were used to assess the reproducibility of the model after penalizing for overfitting.
For the analyses reported here the main emphasis was on developing a clinical prediction instrument that assigned infants to a diagnostic category on the basis of chest radiography, blood and CSF cultures and oxygen saturation without regard to mortality. We were interested in this model because many of the deaths occurring in the context of the study took place within hours of admission. This model might identify children more likely to benefit from antibiotic therapy. Infants who died were assigned to lower level outcomes according to the presence or absence of the other diagnoses. The net effect of this was that about two-thirds of deaths were classified in another abnormal diagnostic category. The discriminating ability of the three level model was compared with that of the four level model using ordinary binary logistic regression methods.
Step 3. Simplifying the model for clinical use. Simplification of the full model was accomplished by treating it as a "gold standard" against which models with fewer variables could be compared. We used ordinary least squares regression and step-down techniques to predict the predicted log odds of the full model and thereby eliminated variables that explained little of the variation in the full model.12 The objective of the selection procedure was to maintain the squared linear correlation between the full model predicted log and the reduced model log odds (R2) above 0.95 with a minimal decrease in the measure of discrimination (Somers' Dxy) against the outcome. Variables (clusters or individual variables) were deleted sequentially so that the first variable deleted was the one that resulted in the smallest drop in the squared correlation between the reduced model predictions and the "gold standard" model prediction. Actual outcomes were not used as the dependent variable in this model simplification procedure in that conducting stepwise variable selection in this way has a low chance of selecting the "correct" model because of problems of multiple comparisons and arbitrary levels for stopping.14
Assessment of model accuracy. We summarized the diagnostic accuracy of the model as its ability to discriminate between infants without any diagnostic abnormality and infants in the categories of adverse events. We report the area under the receiver operating curve (ROC) for each level in the ordinal scale. We used the overall likelihood ratio chi square to summarize the amount of information in the models. We have not presented odds ratios for individual variables because of model complexity.
Bootstrap methods were used to validate the ability of the model to separate infants with no abnormalities from those in the moderate or severe category.14 The bootstrap procedure also adjusts the observed predictive accuracy for overfitting caused by evaluating the performance of the model in the same set of data used to develop it.
A total of 8418 infants less than 91 days of age were triaged in 4 sites, of whom 4552 satisfied the criteria for enrollment in the study and underwent a full history, physical examination and pulse oximetry. Table 2 lists the demographic and the clinical characteristics of the study infants at presentation. Similar proportions of infants were enrolled in Ethiopia, The Gambia and the Philippines. More infants than any of the other sites (47.6%) were enrolled in Papua New Guinea. As described previously infants in Papua New Guinea were substantially less likely to be hospitalized or to have died.7 Table 2 also shows the relationship between the presence of clinical findings and diagnostic abnormalities. For example the presence of a history of fast breathing increased the probability of any abnormality (mild hypoxemia or pneumonia; severe hypoxemia bacteremia or meningitis) from 6.9% to 19.2%. It increased the probability of a more serious abnormality (severe hypoxemia, bacteremia or meningitis) from 4% to 8%.
Of the 4552 infants in the study sample 2398 met the prespecified criteria for laboratory evaluation and underwent blood culture and chest radiography; 507 met the criteria for a lumbar puncture. The 67 infants with gross blood in the CSF specimen were excluded from the analysis of abnormal CSF counts. The remaining 2154 infants had no clinical findings requiring further evaluation.
The 4552 infants were classified into the following diagnostic categories: 3467 (76.2%) had no abnormality; 450 (9.9%) had pneumonia or mild hypoxemia; 386 (8.5%) had bacteremia, meningitis or severe hypoxemia; and 249 (8%) died. Almost all deaths (86%) took place in the hospital. Nineteen (8%) children were taken home moribund and 17 (7%) others were known to have died as outpatients. Of 1979 children in the group not meeting criteria for laboratory evaluation, 15 (0.8%) returned for a second evaluation with pneumonia, mild hypoxemia, bacteremia, meningitis or severe hypoxemia. None of these infants presented initially with such diagnoses.
To estimate the frequency of serious infection in infants not eligible for laboratory evaluation, 175 infants (8%) were selected randomly from those not eligible for complete laboratory evaluation to have blood and urine cultures obtained. The number of infants selected was less than planned because most parents in this group refused to permit blood drawing. Among the infants in whom laboratory tests were obtained, 4 (2%) had positive blood cultures. All the positive cultures occurred in infants from The Gambia who had skin infections with scabies. A fifth infant had a negative initial evaluation but returned the day after enrollment with meningitis. An additional 6 (3%) infants had mildly decreased oxygen saturation, one had a focal infiltrate on chest radiography, and the other two infants had diffuse or nonspecific chest radiographic findings. None of the infants in the sampled group died. We did not correct for the possible verification bias produced by the study design because the effect would be minimal and because the random sample appeared to have overestimated the frequency of adverse events in the group of children not meeting criteria for laboratory evaluation.
Development of the clinical prediction rule. For the analyses reported here the main emphasis was on the development of a clinical prediction instrument in which infants were assigned to a diagnostic category without considering whether they died. The model had three categories rather than four, because infants who died were assigned to lower levels of severity according to the presence or absence of other diagnoses. Infants who died were reassigned as follows: 84 (33.7%) had no evidence of any other serious abnormality or were missing data about laboratory procedures and were assigned to the no abnormality category; 40 (16%) had pneumonia or mild hypoxemia; and 125 (50.2%) had bacteremia, meningitis or severe hypoxemia. For infants reclassified into the no abnormality or pneumonia/mild hypoxemia category, virtually all were suspected of having a serious bacterial illness by the clinicians caring for the patients. In approximately one-third of these cases, it was not possible to obtain a chest radiograph or a blood culture. Thus some infants who would have met the criteria to be included in a higher level may have been misidentified. The overall effect of this reclassification is therefore likely to lead to a conservative estimate of the accuracy of the model.
The variable clustering procedure led to an aggregation of the 51 clinical history and examination findings into 14 clusters of variables.12 These clusters were included in a model along with the infants' age, respiratory rate, temperature and weight-for-age Z score to predict the diagnostic categories. The model for classifying infants into three levels ignoring death (no abnormality; pneumonia/mild hypoxemia; or bacteremia, meningitis, severe hypoxemia) had a bootstrap-validated ROC area of 0.836 for discriminating infants with no abnormality from those with any abnormality and an ROC area of 0.872 for discriminating infants with no abnormality from those with more serious ones. To illustrate the accuracy of the model we plotted the number of infants with any abnormality relative to the number of infants without an abnormality at each probability of disease (Fig. 1). Figure 1 also shows the likelihood ratios at a number of probability levels.
The three level model was compared with one in which death was added as a fourth diagnostic category (Table 3). The accuracy of the four level model was slightly better in detecting infants at different degrees of illness severity, but the difference between the models was small. A model with only vital signs, respiratory rate and weight explained ∼60% of the total variation in outcomes. This model had an ROC area of 0.773 for predicting any diagnostic abnormality. We also compared the accuracy of the models in the four study sites and found <5% difference in ROC areas.
Simplification of the model. Although the full models provided good discriminating ability, they would be difficult to use clinically because they would require clinicians to assess a large number of clinical predictors and to compute cluster scores. We simplified the three diagnostic category model to facilitate its use in clinical settings by identifying a smaller group of findings that could be used as a guideline for treatment by clinicians in small hospitals and by peripheral health workers in dispensaries and outpatient facilities.13
Ordinary least squares regression was used to eliminate progressively those groups of clinical findings that contributed little to variations in predictions given by the full (gold standard) model. This procedure was implemented for groups of findings as well as for individual clinical findings. Before the elimination procedure we excluded signs that were impractical to measure accurately in young infants (e.g. heart rate), those found to be unreliable (respiratory distress, smiling and attentiveness) and those with low prevalence (stridor and hypotonia). The procedure was conducted first for the model predicting any abnormality and then for the model predicting a more serious abnormality.
This approach led to two simplified clinical prediction models with similar abilities to predict the presence of any abnormality. A model that used clusters of clinical signs included three vital signs (temperature, respiratory rate, weight-for-age), the infant's age and five clusters of clinical findings: auscultation; respiratory effort; evidence of neurologic infection; inability to feed; and lethargy. This model had an R2 of 0.973 in representing the predictions of the full model and an ROC area of 0.833 in predicting any abnormality.
A second model with individual clinical findings was created by selecting a reduced model from a full model containing all the individual clinical findings. In this model binary (dummy) coding was used for each clinical finding (i.e. the findings could be used by rating them as present or absent). The approach resulted in a model with seven specific clinical findings (inability to suck, crepitations, cyanosis, history of convulsions, definite lower chest wall indrawing, failure to arouse with minimal stimulation, history of change in activity), as well as respiratory rate, age, temperature and weight for age. The model had an ROC area to discriminate children with any abnormality in the ordinal scale of 0.832 and an approximation accuracy of 0.954 against the best model. This process was repeated to create a simplified model to predict more serious abnormalities (see Appendix). In its simplest form the model predicting a serious abnormality had an ROC area of 0.866.
Use of the model in clinical settings. We developed a tabular version of the simplest models for use in clinical settings (Tables 4 and 5). To create the tabular presentation individual clinical findings were scaled so that the value (or "risk points") of each finding included was proportional to its log odds ratio in the model. For continuous predictors we solved for ranges that corresponded with whole numbered points. Because respiratory rate was associated with a different probability of an abnormality depending on the infant's age, the table for respiratory rate has multiple columns. Because all of the clinical findings are scored as yes/no items, each of the seven clinical findings receives the number of points indicated in the table if the finding is present and no points if the finding is absent. For example a 4-day-old infant with a temperature of 37.9°C, respiratory rate of 76/min, weight-for-age Z score of 2 and crepitations would receive a total of 17 points and a probability of 34% of having at least mild hypoxemia or pneumonia and a probability of 12% of having severe hypoxemia, bacteremia or meningitis.
We also compared the accuracy of the 12-item simplified 3 diagnostic category model with that of the 12 clinical signs in the WHO guidelines for the management of the sick young infant. These signs include: temperature >37.5°C or <36.5°C, baby feels cold to touch, presence of convulsions, fast breathing (>60 breaths/min), severe chest indrawing, nasal flaring, grunting, presence of a bulging fontanel, pus draining from the ear, red umbilical stump, pustules, lethargy/unconscious and less than normal movement. In detecting infants younger than 60 days of age with any abnormality, the WHO sick child criteria had an ROC area of 0.656 compared with the 3 level model ROC area of 0.838. Since the study was completed the WHO algorithm for the assessment of infants under 2 months of age has been modified to include inability to attach well to the breast, inability to suckle and inability to feed as indicators of possible serious infection. These certainly improve the predictive power of the algorithm.
This study indicates that clinical examination is of value in identifying infants at risk of serious bacterial illness and in estimating the severity of disease. Some of the most valuable information comes from the child's age, vital signs and size. A limited number of clinical findings add additional diagnostic information. The clinical findings can be combined into a predictive instrument that is likely to be of use clinically. Despite the simplicity of the instrument, it preserves the powerful information attained by using vital signs on a continuous basis.
The World Health Organization has promoted the use of simple clinical signs in the management of sick infants. However, there have been few studies of the accuracy of the management guidelines. The clinical findings included in the current sick young infant management algorithm were quite similar to those identified in this empirical study. The greater accuracy of the findings identified here may be attributable to two characteristics of the instrument: the more detailed use of an infants' vital signs; and more accurate cutoff points for determining whether a finding is abnormal. The clinical prediction instrument uses temperature, anthropomorphic measurements and respiratory rate as continuous variables and thus avoids the loss of information that occurs when variables are categorized.15 Like other previous studies of the value of clinical findings in detecting streptococcal pharyngitis16 and dehydration,17 the results presented here suggest that empirical evaluation of the WHO clinical management algorithms can lead to more accurate information about which clinical findings to use and how to combine them.
Studies from developed countries differ in their assessment of the value of the clinical examination in detecting serious bacterial illness in infants. A study by Baker3 in a US emergency room found that an existing observation scale to detect serious bacterial illness9 was of limited diagnostic value in identifying infants younger than 2 months of age with bacterial illness. In contrast a study by Hewson and Gollan4 concluded that the presence of any of five clinical findings (drowsiness, chest indrawing, generalized pallor, a history of feeding <50% of normal, decreased activity) had a sensitivity of 91% and a specificity of 72% in predicting "serious" illness or the need for hospital admission in children younger than 6 months of age. Although the present study is not directly comparable with the Hewson study, most of the same variables, except vital signs which were not considered by Hewson, were identified as important.
Our findings contrast with previous studies in developed countries for several reasons. First, we enrolled a large number of infants, many of whom had severe diagnostic abnormalities. The large number of abnormalities permitted the use of more sophisticated statistical procedures than has previously been possible. This allowed us to explore more completely the relationship of the clinical findings to diagnoses. The approach to model development we used also permitted us to examine nonlinear relationships of clinical findings to such diagnoses. Forcing variables to fit a linear model form when the true relationship is nonlinear results in loss of information and discriminating ability. Indeed variables with complex relationships to the diagnoses, such as vital signs, provided considerable information and may be responsible for the greater discriminating ability of the clinical prediction rule.
Our study suffered from a number of limitations. Almost one-half of the data were from Papua New Guinea, the site with the lowest proportion of severely ill children. We decided to combine data from this site with other data because there was no a priori evidence that the clinical characteristics observed there would have a different relationship to serious bacterial illness than those observed elsewhere. Another limitation was the inconsistent collection of specimens for urine culture. A failure to identify consistently urinary tract infections may have reduced the number of infants in whom bacterial illness was detected. This would have reduced the discriminating ability of the clinical findings. Finally the model has not yet been validated prospectively in a new sample of patients. Although bootstrap validation methods indicate that the rule is likely to perform well in other settings, prospective validation is needed in the future.
Although we designed the study to assess the importance of "verification" bias18 by conducting blood cultures and chest radiographs on infants who did not meet the criteria for evaluation, there were fewer infants than anticipated on whom these tests were performed. The proportion of infants with positive laboratory results in this sample indicates that they may have had more significant disease than the group from which they were sampled as the mothers of sicker infants were probably more likely to accept investigation. Future projects will need to develop approaches to enroll or follow patients so that the importance of verification bias can be assessed. Taken together the approach to model development that we pursued and the decisions we made regarding variable definition are likely to have underestimated the diagnostic accuracy of the model.
This study provides strong evidence of the diagnostic value of clinical findings that can be observed on examination. The approach we developed provides an empirical basis for developing management recommendations and thus extends the methods that the WHO currently uses to develop diagnostic algorithms. In addition the study suggests that an approach that permits the classification of diagnoses according to disease severity rather than into dichotomous groups is feasible and may provide a more informative approach for future studies. Finally representing the model so that the probability of a diagnosis can be computed simply from a chart will allow clinicians to account for multiple clinical findings at once and thereby estimate the probability of each of several diagnoses for any given patient.
Used in a developing country setting where laboratory testing is not readily available, the use of the clinical prediction model could improve the ability of clinicians to identify sick infants. Despite the increase in accuracy beyond the WHO management criteria, the possibility exists that children with serious bacterial illness will be missed. In developing country settings the possibility of false reassurance could be reduced by careful follow-up of infants classified as low risk. In developed countries information from the clinical prediction instrument could be combined with data from the clinical laboratory to enhance clinician's diagnostic accuracy. Validation of the model in both high and low prevalence settings, as well as studies to determine how to maximize the utility and acceptability of the prediction instrument in different clinical settings and by different levels of health workers, should be conducted in the future.
We thank the many nurses, doctors and field workers at each of the study sites who contributed so much to this study, but who are too numerous to mention individually. The support of WHO and the contribution of many WHO staffs was essential to the completion of this project. In particular J. Tulloch, J. Bryce and C. John made significant contributions to this project. Finally we acknowledge the patience and understanding of the more than 4000 families of sick infants enrolled in the 4 study sites who tolerated the intrusion of this study at a time when the families were under considerable stress.
APPENDIX: DEVELOPMENT OF A SIMPLE METHOD TO ESTIMATE THE PROBABILITY OF SEVERE ABNORMALITIES
In the analysis the three levels of the ordinal outcome scale were defined by a variable Y, which was given the values of 0, 1 and 2 for the least severe, intermediate and most severe levels, respectively. Previously we described how Prob[Y > 0] (probability of any abnormality) could be estimated. We describe here a method to more easily estimate the probability of the severe diagnostic abnormality.
In the continuation ratio logistic model for an ordinal outcome, Prob[Y = 2] is the product of two probabilities that are derived from two submodels to predict Prob[Y > 0] and Prob[Y = 1 given Y ≥ 1]. This is the same as 1 − Prob[Y = 2 given Y≥1] or 1 minus the probability of a severe outcome given a moderate or severe outcome. Our goal was to provide a simple way to estimate Prob[Y = 2] without needing to first obtain two probability estimates and then multiply them. First we transformed this Prob[Y = 2] using the logit transformation (log[p/(1 − p)]). Next to compute the upper limit of how well Prob[Y = 2] could be approximated using a single model, we fitted a least squares regression model based on all the variables (using cluster scores where appropriate) to the logit of this best estimate. This resulted in an accuracy of approximation given by R2 = 0.994 [R2 between predicted logit from a single regression and the optimum logit of Prob[Y = 2] (R2 < 1.0 because the logit of the product of two probabilities cannot be written exactly as a weighted sum of two logits.)]. Next we replaced all cluster scores with individual clinical signs, which resulted in an approximation accuracy of R2 = 0.949 and an ROC area against the actual outcomes of 0.868.
We then tested two methods of simplification that were similar to the approach we used to simplify the prediction of Prob[Y > 0]. The first method involves predicting the gold standard logit Prob[Y = 2] using a new set of regression coefficients for clinical and vital signs. Applying a stepdown approach to the model with an approximation accuracy of R2 = 0.949, we could drop several clinical signs and retain an R2 = 0.941 in predicting the logit of Prob[Y = 2]. The ROC area against the outcome was 0.866. The second method is based on an extreme simplification in which we summarize all clinical and vital signs in the way they were summarized in predicting logit Prob[Y > 0] and attempt to determine whether some transformation of this combination can adequately predict logit Prob[Y = 2]. Approximating Prob[Y = 2] from the simplified estimate of Prob[Y>0] results in R2 = 0.918 (vs. 0.941) and an ROC area = 0.862. The best equation for approximating Prob[Y = 2] is a 1/(1 + exp(−L)) where L is a cubic spline in the logit of the estimated Prob[Y > 0]. A great advantage of this second approach is that we could add this rough estimate of Prob[Y = 2] to Tables 4 and 5. The approximation can be improved by allowing crepitations to enter the prediction of Prob[Y = 2] from Prob[Y > 0]. Recall that the crepitations sign is the predictor with the most different behavior in relating to the two outcomes. Adding crepitations results in R2 = 0.934 and ROC area = 0.866. We recommend this approach because it allows Tables 4 and 5 to be used to simultaneously estimate the two probabilities of interest. The second probability estimate (Prob[Y = 2]) is approximated somewhat accurately using this approach (i.e. as the approximation R2 = 0.934, this model in essence achieves an accuracy that is ∼93% of what can be accomplished with a more complex two stage model). Cited Here...
1. World Health Organization. Perinatal mortality: a listing of available information. Maternal Health and Safe MotherhoodProgramme. Geneva: World Health Organization, 1997.
2. Stoll BJ. The global impact of neonatal infection. Clin Perinatol. 1997;24:1-21.
3. Baker M, Bell L, Avner J. Outpatient management without antibiotics of fever in selected infants. N Engl J Med 1993;329:1437-41.
4. Hewson P, Gollan R. A simple hospital triaging system for infants with acute illness. J Pediatr Child Health 1995;31:29-32.
5. Morley C, Thornton A, Cole T, Hewson P, Fowler M. Baby Check: a scoring system to grade the severity of acute systemic illness in babies under 6 months old. Arch Dis Child 1991;66:100-6.
6. Jaskiewicz J, McCarthy C, Richardson A, et al. Febrile infants at low risk for serious bacterial infection: an appraisal of the Rochester criteria and implications for management. Pediatrics 1994;94:390-6.
7. The WHO Young Infants Study Group. Methodology for a multicenter study of serious bacterial infection in young infants in developing countries. Pediatr Infect Dis J 1999;18(Suppl):S8-16.
8. Clinical signs and etiologic agents of pneumonia, sepsis and meningitis in young infants: report of a meeting. WHO/ARI/90.14. Geneva: World Health Organization, 1989.
9. McCarthy P, Sharpe M, Speisel S, al. Observation scales to identify serious illness in febrile children. Pediatrics 1982;70:802-9.
10. Margolis P, Ferkol T, Marsocci S, et al. Accuracy of the clinical examination in detecting hypoxemia in infants with respiratory illness. J Pediatr 1994;124:552-650.
11. WHO. Report of a meeting of the radiology working group. Geneva: World Health Organization, 1989.
12. Harrell F, Margolis P, Gove A, et al. Development of a clinical prediction model for an ordinal outcome: the World Health Organization ARI Multicentre Study of clinical signs and etiologic agents of pneumonia, sepsis, and meningitis in young infants. Stat Med 1998;17:909-44.
13. Harrell F, Lee K, Pollock B. Regression models in clinical studies. Determining relationships between predictors and response. J Natl Cancer Inst 1988;80:1198-202.
14. Harrell F, Lee K, Mark D. Multivariable prognostic models: evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996;15:361-87.
15. Altman, D. G. Categorising continuous covariates [Letter]. Br J Cancer 1991;64:975.
16. Steinhoff M, Abd el Khalek M, Khallaf N, et al. Effectiveness of clinical guidelines for the presumptive treatment of streptococcal pharyngitis in Egyptian children. Lancet 1997;350:918-21.
17. Gorelick M, Shaw K, Murphy K. Validity and reliability of clinical signs in the diagnosis of dehydration in children. Pediatrics 1997;99:E6.
18. Greenes R, Begg C. Assessment of diagnostic technologies: methodology for unbiased estimation from samples of selectively verified patients. Invest Radiol 1985;20:751-6.
Publication of this supplement was supported by a grant from the World Health Organisation Programme for Control of Acute Respiratory Infections.
© 1999 Lippincott Williams & Wilkins, Inc.