Accurate assessment of the motor performance of infants born prematurely or with other perinatal complications requires the use of tests with validity for identifying delay or neurologic impairment. Documentation of the construct validity of a new assessment requires research demonstrating its relationships with other existing tests intended for similar purposes. An exploration of external validity examines the congruence or divergence of the constructs underlying new and criterion tests. Of greater interest to practitioners who use these tools to diagnose delay clinically is an examination of the various cut scores for each tool and the differences in referral decisions based on having selected one tool or the other at a particular age.
The Test of Infant Motor Performance (TIMP) was published in 2001 for use in neonatal intensive care units and early intervention programs serving infants born prematurely and other infants at risk for developmental delay.1 The TIMP is a test of functional motor skills with age standards for performance of infants between the ages of 34 weeks postmenstrual age and 17 weeks corrected age (CA) based on a normative sample of 990 US infants with a range of medical risk for developmental delay.2 Previous research demonstrated convergent validity (r = 0.66) at 3 months CA with the Alberta Infant Motor Scale3,4 and validity at 3 months CA for predicting (1) AIMS scores at 12 months CA5 and (2) Peabody Developmental Motor Scales motor quotients at 4 to 5 years of age.6 In a Korean sample, the TIMP scores from tests performed at term-equivalent age were highly predictive of scores on the Bayley II at 6 months CA,7,8 but the TIMP has not been compared with the Bayley III.
The Bayley Scales of Infant Development have long been standard tests for measuring motor and cognitive performance of infants.9 The third edition of the test, the Bayley III,10 was published in 2006 with new age standards for cognitive, language, and motor development. Separate scales for gross and fine motor performance were developed. Evidence on performance of the Bayley III in clinical situations is accumulating but presents a confusing picture of how this new version of the scales performs when compared with the Bayley II and other tests. On the one hand, Green and colleagues11 reported clinical results consistent with expected levels of delay at 8 to 12 months CA in a group of 85 infants born premature. The average score on the Motor Composite was 94 (SD = 17); 22% of the infants had Motor Composite scores less than 85 (1 SD less than the normative mean of 100), and 12% had significantly delayed gross motor subscale scores (<4). On the other hand, Vohr and colleagues12 found large discrepancies in outcomes of a sample of more than 2600 infants with extremely low birth weight when results from Bayley II assessments at 18 to 22 months in 2006 to 2007 were compared with results by using the Bayley III on a cohort tested in 2008 to 2011. The neurodevelopmental impairment rate was reduced from 43% in the earlier period to 13% in 2008 to 2011. Anderson and colleagues13 also reported an overestimation of performance in infants born full-term.
The purpose of this study was to examine the convergent validity of the TIMP and the Bayley III Motor scores and the divergent validity of the TIMP and the Bayley III Cognitive and Language scores at 6 weeks CA in a sample of infants born preterm with dual risk for poor developmental outcome because of social-environmental home conditions. In addition to exploring the correlations among these assessments, we examine the differences in clinical decisions made using threshold cut scores for identification of delayed motor development. Both tests were standardized during approximately the same period of time, but gisven the slightly earlier time for the TIMP (2002-2004)1 and its more extensive research documenting predictive validity, we used the TIMP as the criterion measure.
This report is derived from a larger randomized clinical trial designed to evaluate outcomes of a developmental intervention for infants born preterm. After baseline data collection, 198 infants born at 29 to 34 weeks gestational age (GA), with at least 2 social-environmental risk factors, were randomly assigned to an attention-only control group or the Hospital-Home Transition: Optimizing Prematures' Environment intervention.14,15 The neurodevelopment of the infants was evaluated at 6 weeks CA via both the TIMP and the Bayley III. No group differences in infant development at 6 weeks CA were identified as a result of the intervention, so for the purpose of this report, we used the entire sample to examine the convergent/divergent validity of the TIMP Z scores and the Bayley III Cognitive, Motor, and Language Composite scores at 6 weeks CA.
The research was conducted in 2 inner-city Midwestern community hospital neonatal intensive care units (1 a level II intermediate care unit and the second a level III unit serving infants with special health care needs such as ventilatory support).
Infants met eligibility criteria if they were born between 29 and 34 weeks GA and had no other major health problems. Infants may have previously received ventilator support or other medical therapies for maintenance (eg, intravenous therapy or oxygen therapy via nasal cannula). Their mothers met eligibility criteria if they had at least 2 social-environmental risk factors, such as minority status, less than high school education, younger than 18 years, history of current mental illness (eg, depression), family income less than 185% of the federal poverty guidelines, more than 1 infant younger than 24 months, 4 or more infants younger than 4 years in the home, or residing in a disadvantaged neighborhood.
Infant exclusion criteria included the presence of congenital anomalies, necrotizing enterocolitis, brain injury, chronic lung disease, history of prenatal illicit drug exposure, or positive toxicology screen. Infants were excluded if their mothers were using illicit drugs, not the legal guardian of the infant, or HIV positive. Sample characteristics were obtained via medical record review, with the exception of maternal race, which was self-reported by the mother at baseline.
Of the 198 mother-infant dyads enrolled in the intervention study, 149 (75.3%) were retained for the 6 weeks CA visit, but 3 were not assessed for infant development because the tester was not available at the time of the visit. One additional infant completed the TIMP but not the Bayley III assessment as a result of fatigue. Therefore, 145 infants had scores for the TIMP and the 3 Bayley III scales and were included in this study. Table 1 summarizes the characteristics of the subjects. The average GA of the infants at birth was 32.4 weeks (SD = 1.6). The infants had a moderate degree of medical issues (mean = 70.9), as documented with the Problem-Oriented Perinatal Risk Assessment System, which is used to assign points to various medical conditions.16,17 Higher scores denote the presence of more medical complications and increased risk for developmental morbidity. At the 6 weeks CA follow-up visit, the mean chronologic age was 13.4 weeks (SD = 1.9).
The research was approved by the Institutional Review Boards from the university and the 2 clinical sites. After informed consent was obtained, mothers and infants were randomly assigned to the Hospital-Home Transition: Optimizing Prematures' Environment or the attention-only control group.15 When the infant reached 32 weeks postmenstrual age, the intervention began in the hospital following an initial oral feeding assessment. Infant development was assessed at 6 weeks CA via the TIMP and the Bayley III. For the 6-week assessments, mothers brought their infants to an examination room in the university's College of Nursing. In most cases, the sessions proceeded according to the following schedule: Infants were evaluated first with the TIMP followed by the Bayley. Because this was part of a larger study that also evaluated mother-infant feeding interactions,15 the mothers were instructed to let the researchers know when they believed that their infants were showing feeding readiness cues. In such instances, the therapist stopped the developmental assessment and resumed after the infants had completed their feedings. Most of the time (75%), the TIMP and Bayley III were completed in their entirety and in that order before the infants were ready to feed.
The TIMP is a 42-item assessment of functional gross motor performance.1 Item responses are related to demands for movement placed on infants in daily life interactions with caregivers,18 and Rasch psychometric analysis revealed that the items reliably separate infants into 5 to 6 uniquely different levels of development.19 All 42 items are administered to infants at 6 weeks CA, and raw scores are compared with age norms in 2-week increments. Z scores, percentile ranks, and age-equivalent scores can also be obtained.
The Bayley III is used to assess the developmental functioning of infants and toddlers aged 1 to 42 months.20 The Bayley III consists of 3 administered scales: Cognitive, Language (including receptive and expressive communication subtests), and Motor (including fine and gross motor subtests). The Cognitive scale is used to assess sensory-perceptual acuities, discriminations, and the ability to respond to these as well as the early acquisition of object constancy and memory, learning, and problem-solving ability. The Language scale is used to evaluate receptive language capabilities, expressive vocalizations, and the beginnings of verbal communication. The Motor scale provides a means to assess postural control, coordination of the large muscles, and finer manipulation skills of the hands and fingers; results can be reported as separate gross and fine motor scores. Scores are presented as raw scores, scaled scores, composite scores, percentile ranks, age equivalents, and growth scores.10,21
For this study, we assessed interrater reliability for a random 25% of the TIMP and Bayley assessments from a video recording of the original administration of the tests. A physical therapist who is an expert in administering the TIMP rerated the TIMP, and an expert in administering the Bayley test rerated the Cognitive, Language, and Motor scales of that test. Both were unaware of infant group assignment and scores obtained by the study testers. Interrater reliability was determined using the intraclass correlation coefficient (ICC).22
Means, standard deviations, and proportions were used to describe the sample characteristics, and the TIMP Z scores and Bayley III Composite scores for cognitive, language, and motor development were calculated for test comparisons. Pearson product moment correlation coefficients were calculated for the relationship between the TIMP and Bayley III Composite scores, and cross-tabulations were generated to contrast the number of infants with indication of delay in motor development according to the TIMP with the number of infants scoring below the mean of the Bayley III scores.
To formally compare the difference in clinical decision making regarding motor developmental outcome that would occur with TIMP Z score cutoffs of −0.5, −1.0, and −1.5 SD below the mean and a cutoff at the mean (<100) for the Bayley Motor Composite score, sensitivity, specificity, positive predictive validity (PPV), and negative predictive validity were calculated using the TIMP as the criterion measure or reference standard. Sensitivity was calculated as the proportion of infants with scores below the cutoff on the TIMP Z score that also scored below the cutoff on the Bayley III Motor Composite score. Specificity was calculated as the proportion of infants with scores at or above the TIMP cutoff who also had Bayley III Motor Composite scores above the cutoff. The PPV was calculated as the proportion of infants with a score below the cutoff on the Bayley Motor Composite who also had a score below the cutoff on the TIMP Z score, and the negative predictive validity was the proportion of infants with a score at or above the cutoff on the Bayley Motor Composite who also had a TIMP Z score at or above the cutoff.
Interrater reliability for the TIMP resulted in an ICC of 0.79 (95% CI, 0.60-0.90); the ICC was 0.73 (95% CI, 0.46-0.86) for the Bayley Cognitive scale; 0.75 (95% CI, 0.51-0.87) for the Bayley Language scale, and 0.75 (95% CI, 0.46-0.88) for the Bayley Motor scale.
Table 2 presents the means, standard deviations, and ranges of raw score performance on the TIMP and Bayley III Cognitive, Language, and Motor scales at 6 weeks CA, and Table 3 describes the outcomes in standard score terms (Z score for the TIMP and composite scores for the Bayley III). Sixty of the 145 infants (41.4%) scored below the recommended cutoff for delay of a Z score of −0.5 SD on the TIMP.1 Scores on the Bayley scales were much higher with the means of all composites greater than 100. Ten infants scored below the mean on the Cognitive scale (6.9%), 15 (10.3%) on the Language scale, and 5 (3.4%) on the Motor scale. No infant scored more than 1 SD below the mean on any Bayley III scale. As a result, no infants were identified as delayed on the Bayley III scales when using the typical cutoff for suspicious performance.
Correlations Among Tests
Table 4 shows the correlations among the TIMP Z scores at 6 weeks and the Bayley III Cognitive, Language, and Motor Scale Composite scores. As expected, the correlation between the TIMP and the Bayley Motor Composite was higher than those between the TIMP and the Bayley Cognitive Composite or Language Composite.
Clinical Decision Comparison
Table 5 presents a 4-fold table comparing the Bayley Motor Composite results split at the mean of the distribution to the TIMP Z score results, using a cutoff of −0.5 SD to designate delay. The mean of the Bayley Motor Composite was used because only 5 infants scored below the mean. The sensitivity of the Bayley for agreement with the TIMP in identifying delay is negligible at 8.3%, while the specificity is 100% because all infants scoring above the cutoff on the TIMP scored above the mean on the Bayley Motor Composite (Table 6). The PPV is also 100% because the 5 infants scoring below the mean on the Bayley Motor Composite all had delayed scores on the TIMP. The negative predictive validity, on the contrary, was poor at 61% because the vast majority of infants tested scored above average on the Bayley. The overall agreement of the Bayley in reflecting results on the TIMP was 62.1%.
Because the cutoff recommended for identification of delay on the TIMP was derived from data on concurrent and predictive validity for TIMP assessments at 3 months CA, the lower cutoff of −1.5 SD recently published by Korean researchers8 for comparing TIMP results at term age with Bayley II results at 6 months CA was next examined along with an intermediate cutoff of −1.0 SD. Table 6 shows the results. A cutoff of the mean on the Bayley Motor Composite best matches results by using a cutoff on the TIMP Z score of −1.0 SD with agreement between decisions at 93%, but the sensitivity remains poor at 31%, that is, only 4 of the 13 infants (31%) identified as delayed on the TIMP have scores below the mean on the Bayley Motor Composite. One infant scoring below the mean on the Bayley achieved a score above 1.0 SD on the TIMP, and 8.96% of the infants were identified as delayed on the TIMP.
In summary, Bayley Motor Composite scores were significantly correlated with TIMP Z scores, whereas Cognitive and Language Composite scores were less strongly related to TIMP performance. No infant was identified as scoring below average on any of the Bayley III scales, while the TIMP identified 41% of infants as having delayed motor development, using a cutoff of −0.5 SD, or 9% of infants, using a cutoff of a Z score of −1.0 SD.
A comparison of the correlations among the various Bayley scales and the TIMP demonstrated convergence between the Bayley Motor Composite and the TIMP and divergence between the TIMP and the Bayley Cognitive and Language Composites. An examination of the items on each test supports the conclusion that the Bayley Motor Scales (gross and fine) measure a number of the same skills as the TIMP at 6 weeks CA, including head control in the upright, prone, and supine positions, head turning and visual following, reaching, and crawling movements.1,20 The range of possible raw scores on the Bayley gross and fine motor scales, however, is about 6 to 15 points, whereas that of the TIMP at 6 weeks CA is as much as 70 to 80 points, providing a greater degree of precision along with a wider range of assessed skills, including postural control in supported standing, trunk rotation, and head and trunk control during lateral tipping actions evoking vestibular responses.
The Cognitive and Language scales of the Bayley III consist primarily of items assessing visual and auditory attention and habituation compared with only 4 items on the TIMP assessing attention or visual or auditory search and no habituation items.1,20 The relatively lower, although still statistically significant, correlations among these scales support a conclusion that they measure divergent constructs with only a small degree of overlap.
Despite a good correlation between the TIMP and the Bayley Motor Composite, the decision analysis shows that the use of the 2 tests at 6 weeks CA yields widely divergent results. The Bayley Motor Composite scores averaged 116, and the lowest score obtained by any infant was 94. We are not aware of any research on the predictability of later developmental outcome from Bayley III scores, but on the basis of the results of this study, clearly no infant would be identified as having delayed or even suspicious motor development when using the Bayley Motor scale as the basis for early identification of the need for close surveillance or intervention. On the contrary, using the recommended cutoff of −0.5 SD to identify delay by using the TIMP would result in 41.4% of the infants being flagged for close surveillance or referral for intervention, depending on how low the score was. Previous research on the TIMP at 30 days as compared with preschool-age outcomes showed an overall accuracy of 80% in predicting Peabody Developmental Motor Scale scores more than 2 SD from the mean.6 The PPV of early scores was 60% such that 40% of infants who scored low at 30 days were found to have normal motor development at 4 to 5 years of age. The PPV improved to 75% and overall accuracy to 87% if testing occurred at 90-days CA. Although one might conclude that 6 weeks CA is too early to identify delay, low scores on the TIMP reflect performance below that of a national sample of 990 infants who were stratified according to perinatal medical risk factors. Thus, low scores at 6 weeks CA provide the opportunity to flag infants close to the cutoff or scoring between −0.5 and −1.0 SD for intervention to help them close the gap between their performance and that of a national sample from the population of infants born premature for which the TIMP was designed. Vohr and colleagues12 suggest that all infants born with extremely low birth weight should be offered early intervention because of their high risk for cerebral palsy and other disabilities. In the case of this sample of infants with moderate biologic risk for delay but socioeonomically disadvantaged backgrounds, however, frequent surveillance and instruction of parents in appropriate activities to do at home are recommended until such time as a more definitive diagnosis can be reached.
The findings of this study add to the accumulating evidence on overestimation of ability when testing using the Bayley III by providing information on a moderate-risk sample of infants born preterm and assessed at 6 weeks CA. Previous studies have shown overestimation of performance at ages from 12 months to 2 years CA for Australian infants born full-term and preterm,13 infants born with extremely low birth weight compared with performance of earlier cohorts on the Bayley II at 18 to 22 months,12 infants younger than 6 months CA enrolled in early intervention services,23 infants post complex cardiac surgery,24,25 and at 6 months CA English infants born preterm.26
Why has this occurred? The most compelling argument for the overestimation of performance being widely reported is that the 2006 norms for the Bayley III scales were developed in a different manner than those for the previous editions of the scales.24 In an attempt to reflect the broader population of infants in the United States, the sample for the 2006 norms included more Latino infants and about 10% clinical cases, including infants born preterm and infants with diagnosed disabilities.10 This sampling decision lowered the overall means of the scaled scores and now boosts the apparent performance of both infants born preterm and infants developing typically being compared with the published norms. Moreover, studies have shown that the greatest overestimation of performance is occurring in infants born prematurely.12,26 As a result, further research on the predictability of scores for later outcomes is critically needed, and studies using the Bayley III scales should always include a control group for comparison with the group of interest.
If the Bayley III is used to make diagnoses, clinicians should be aware that a high cut score should be used. Moore and colleagues26 suggest a cut score of 80, but our results suggest that any score less than the average score at early ages should trigger parent instruction and close surveillance with repeated assessment. We agree with the suggestion of Moore and colleagues26 that on the basis of accumulating research a consensus statement is needed on the classification of developmental impairment when using the Bayley III.26
Infants with biologic and socioenvironmental risk, such as those in this study, have a high incidence of poor developmental outcomes. For example, Bradley and colleagues27 found that only 11% of 3-year-old children born prematurely and living in poverty functioned in the normal range in all areas of growth and development. The provision of a home exercise program for infants with low scores on the TIMP at hospital discharge has been successful in significantly raising the TIMP scores at 4 months,28 demonstration of the TIMP for African American mothers with low income improved their understanding of infant motor development,29 and predictability of preschool motor development is high by 3 months CA.6 As a result, we recommend the use of the TIMP for early assessment of infants at risk for delayed motor development with the selection of a cut score being based on resources available and the philosophy of early intervention of the agency.
Although the Bayley III scales have a degree of commonality with the TIMP, no children in this moderate-risk group were identified as having delayed motor development by the Bayley III scales at 6 weeks CA. For assessment of motor performance and determination of the need for intervention at early ages in infants at risk for developmental delay, the TIMP is the preferred test.
The authors thank Dr Michael Nelson for conducting reliability assessments for the Bayley III. They also thank the infants and mothers who participated in this research and the nursing and medical staff at the clinical sites.
1. Campbell SK. The Test of Infant
Motor Performance. Test User's Manual Version 3.0 for the TIMP Version 5. Chicago, IL: Infant
Motor Performance Scales LLC; 2012.
2. Campbell SK, Levy P, Zawacki L, Liao PJ. Population-based age standards for interpreting results on the Test of Infant
Motor Performance. Pediatr Phys Ther. 2006;18:119–125. doi:10.1097/01.pep.0000223108.03305.5d.
3. Piper MC, Darrah J. Motor Assessment of the Developing Infant
. Philadelphia, PA: WB Saunders; 1994.
4. Campbell SK, Kolobe THA. Concurrent validity of the Test of Infant
Motor Performance with the Alberta Infant
Motor Scale. Pediatr Phys Ther. 2000;12:1–8.
5. Campbell SK, Kolobe THA, Wright BD, Linacre JM. Validity of the Test of Infant
Motor Performance for prediction of 6-, 9- and 12-month scores on the Alberta Infant
Motor Scale. Dev Med Child Neurol. 2002;44:263–272. doi:10.1111/j.1469-8749.2002.tb00802.x.
6. Kolobe THA, Bulanda M, Susman L. Predicting motor outcome at preschool age for infants tested at 7, 30, 60, and 90 days after term age using the Test of Infant
Motor Performance. Phys Ther. 2004;84:1144–1156.
7. Bayley N. Bayley Scales of Infant
Development. 2nd ed. San Antonio, TX: Psychological Corporation; 1993.
8. Kim SA, Lee YJ, Lee YG. Predictive value of Test of Infant
Motor Performance for infants based on correlation between TIMP and Bayley Scales of Infant
Development. Ann Rehabil Med. 2011;35:860–866. doi:10.5535/arm.2011.35.6.860.
9. Bayley N. Manual for the Bayley Scales of Infant
Development. San Antonio, TX: Psychological Corporation; 1969.
10. Bayley N. Bayley Scales of Infant
and Toddler Development: Technical Manual. 3rd ed. San Antonio, TX: Harcourt Assessment; 2006.
11. Green MM, Patra K, Nelson MN, Silvestri JM. Evaluating preterm infants with the Bayley-III: patterns and correlates of development. Res Dev Disabil. 2012;33:1948–1956. doi:10.1016/j.ridd.2012.05.024.
12. Vohr BR, Stephens BE, Higgins RD, et al. Are outcomes of extremely preterm infants improving? Impact of Bayley assessment on outcomes. J Pediatr. 2012;161:222–228. doi:10.1016/j.jpeds.2012.01.057.
13. Anderson PJ, De Luca CR, Hutchinson E, et al. Under-estimation of developmental delay by the new Bayley III Scale. Arch Pediatr Adolesc Med. 2010;164:352–356. doi:10.1001/archpediatrics.2010.20.
14. Burns K, Cunningham N, White-Traut R, Silvestri J, Nelson MN. Infant
stimulation: modification of an intervention based on physiologic and behavioral cues. J Obstet Gynecol Neonatal Nurs. 1994;23:581–589. doi:10.1111/j.1552-6909.1994.tb01924.x.
15. White-Traut R, Norr K. An ecological model for premature infant
feeding. J Obstet Gynecol Neonatal Nurs. 2009;38:478–490. doi:10.1111/j.1552-6909.2009.01046.x
16. Davidson EC, Hobel CJ. POPRAS: A Guide to Using the Prenatal, Intrapartum, Postpartum Record. Torrence, CA: South Bay Regional Perinatal Project Professional Staff Association; 1978.
17. Molfese VJ, Thomason B. Optimality versus complications: assessing predictive values of perinatal scales. Child Dev. 1985;56:810–823.
18. Murney ME, Campbell SK. The ecological relevance of the Test of Infant
Motor Performance elicited scale items. Phys Ther. 1998;78:479–489.
19. Campbell SK, Wright BD, Linacre JM. Development of a functional movement scale for infants. J Appl Meas. 2002;3(2):191–205.
20. Bayley N. Bayley Scales of Infant
and Toddler Development: Administration Manual. 3rd ed. San Antonio, TX: Harcourt Assessment; 2006.
21. Spittle AJ, Doyle LW, Boyd RN. A systematic review of the clinimetric properties of neuromotor assessments for preterm infants during the first year of life. Dev Med Child Neurol. 2008;50:254–266. doi:10.1111/j.1469-8749.2008.02025.x.
22. Fleiss JL. Statistical Methods for Rates and Proportions. New York, NY: John Wiley & Sons; 1981.
23. Connolly BH, McClune NO, Gatlin R. Concurrent validity of the Bayley-III and the Peabody Developmental Motor Scale-2. Pediatr Phys Ther. 2012;24:345–352. doi:10.1097/PEP.0b013e318267c5cf.
24. Acton BV, Biggs WSG, Creighton DE, et al. Overestimating neurodevelopment using the Bayley-III after early complex cardiac surgery. Pediatrics. 2011;128:e794–e800. doi:10.1542/peds.2011-0331.
25. Long SH, Galea MP, Eldridge BJ, Harris SR. Performance of 2-year-old children after early surgery for congenital heart disease on the Bayley Scales of Infant
and Toddler Development. 3rd ed. Early Hum Dev. 2012;88:603–607. doi:10.1016/j.earlhumdev.2012.01.007.
26. Moore T, Johnson S, Haider S, Hennessy E, Marlow N. Relationship between test scores using the second and third editions of the Bayley Scales in extremely preterm children. J Pediatr. 2012;160:553–558. doi:10.1016/j.jpeds.2011.09.047.
27. Bradley RH, Whiteside L, Mundfrom DJ, et al. Early indications of resilience and their relation to experiences in the home environments of low birthweight, premature children living in poverty. Child Dev. 1994;65:346–360.
28. Lekskulchai R, Cole J. Effect of a developmental program on motor performance in infants born preterm. Aust J Physiother. 2001;47:169–176.
29. Goldstein LA, Campbell SK. Effectiveness of the Test of Infant
Motor Performance as an educational tool for mothers. Pediatr Phys Ther. 2008;20:152–159. doi:10.1097/PEP.0b013e3181729de8.