Diagnosis of delayed motor development in children requires age-related standards, or norms, for expected typical performance based on data collection from a representative sample of children similar to those for which a test is intended.1 The sample selection for a normative group must represent the range of age covered by the test and should be designed to investigate potential effects on test scores of factors in the population of interest, such as race/ethnicity, risk status, and sex. The purpose of this article is to present age standards for the Test of Infant Motor Performance (TIMP) derived from a cross-sectional sample of infants recruited to reflect the distribution of race/ethnicity in the US population of infants born with low birth weight (LBW). A second purpose is to describe results of analysis of group differences based on sex, risk status, and race/ethnicity.
The TIMP is a 42-item assessment of postural and selective control needed for function in early infancy.2,3 Total raw scores range from 0 to 142. Age-related performance standards were first established for infants from 34 weeks’ postconceptional age (PCA) through 17 weeks after term (ages adjusted for prematurity, AA) based on the performance of 98 white (non-Hispanic), black (African and African-American), and Latino/a (Mexican and Puerto Rican) infants from the Chicago metropolitan area who were tested repeatedly on a weekly basis up to four months AA plus 60 infants tested only at one age.3 On the basis of Rasch psychometric analysis, the TIMP forms an interval level linear scale of hierarchically ordered items.4
Other research has documented the construct, discriminative and predictive validity of the TIMP. The TIMP has been shown to (1) reflect linear developmental change across the period from 32 weeks’ PCA to four months after term,5,6 (2) be responsive to degree of perinatal risk as represented by medical complications,6 (3) predict motor developmental outcome at one year7 and at preschool,8 and school age,9 and (4) identify developmental delay associated with a later diagnosis of cerebral palsy as early as seven days after term.10 A study by Murney and Campbell showed that performance demands of the TIMP were comparable with those placed on infants by their caregivers in naturalistic interactions, demonstrating the ecologic validity of the TIMP.11
The TIMP is also capable of detecting change resulting from intervention. Scores from a pilot version of the test reflected the effects on motor performance of one to two weeks of twice daily physical therapy for 34- to 35-week PCA infants based on a neurodevelopmental treatment model in a small controlled clinical trial.12 In a large controlled clinical trial in Thailand, the TIMP was shown to both identify infants at hospital discharge who already show motor developmental delay and to demonstrate the efficacy of a home physical therapy program in assisting those infants to accelerate their average rate of development into the typical range.13
The purposes of this article are to present age standards for the TIMP derived from a population-referenced cross-sectional sample of US infants and to describe results of exploration with multiple regression analysis of group differences based on sex, risk status, and race/ethnicity.
The sample for this study was intended to reflect the racial/ethnic distribution of the population of LBW infants (ie <2500 g at birth) in the United States with (1) stratification based on age at testing and degree of risk for poor developmental outcome and (2) geographic variability. A sample of 120 infants from each of 10 geographic locations (an 11th was added partway through the project when one site dropped out) was planned for inclusion in this study based on a search for sites with special care nurseries and associated developmental follow-up clinics to represent the diverse regions of the United States and willingness of professionals in each location to participate. Only in the southwestern United States (eg Texas, New Mexico, Oklahoma), were we unable to identify a volunteer site. Study sites were hospitals in Birmingham, Alabama; Boston, Massachusetts; Chicago, Illinois (two hospitals); Cleveland, Ohio; Los Angeles, California (three hospitals); Omaha, Nebraska; Pensacola, Florida; Philadelphia, Pennsylvania; Portland, Oregon; Raleigh, North Carolina, and Sioux Falls, South Dakota. Represented were large urban health science centers but also smaller hospitals serving regional, including rural, populations.
The sampling plan called for 100 infants to be tested in each of 12 two-week AA ranges for a total of 1200 subjects. Table 1 shows the age ranges for each of the 12 groups.
Subjects were recruited for the study based on a specific subject selection grid containing 120 subjects for each of the 10 original participating sites (except 60 for each Chicago location), which was based on their reported distribution of race/ethnicity for the nursery census in the previous calendar year. The combination of recruitment from individual site grids in each participating center was designed to produce a sample reflecting the target population of LBW infants in terms of race/ethnicity as defined by epidemiologic studies. Because the TIMP is intended for use with infants in intensive care nurseries, developmental follow-up clinics for high-risk infants, and early intervention-programs, the target population of interest was that of infants with medical complications. Thus, the decision was made to consider the US population of LBW infants as the target group of interest with a desire to reflect their range of ethnicity and an approximation of their geographic distribution across the US. Statistics from the Centers for Disease Control, National Center for Health Statistics, for 1996 were used to establish the numbers of live births in the United States based on race/ethnicity.14 Five race/ethnicity groupings were used according to National Center for Health Statistics definitions; mixed race infants were excluded.
To determine the sample size, the number of live births for each race/ethnicity group was multiplied by the proportion of infants of that race/ethnicity born weighing less than 2500 g to yield the number of LBW newborns for each group. The proportion of all LBW infants in each race/ethnicity group was multiplied by 100 to determine how many infants of each race/ethnicity would be recruited. As Table 2 makes clear, because of the high rate of LBW in infants of black mothers, black infants were sampled at about twice the rate of other groups in keeping with their prevalence in high risk nurseries in the United States.
To sample infants to reflect a range of risk for developmental disability, a means for describing risk quantitatively was used. In our previous work, the Newborn Score on the Problem-oriented Perinatal Risk Assessment System (POPRAS) completed based on the historical medical record up to the day of testing was used as a marker of risk for neonatal mortality and morbidity. The POPRAS also was used because complications scales are more predictive of medical and developmental outcomes than optimality scales.15 In earlier work we used scores on the POPRAS greater than 90 to indicate high risk for poor developmental outcome, scores between 61 and 90 to indicate medium risk, and scores less than 61 to indicate low risk. Infants who meet the criteria for brain insult and for chronic lung disease (oxygen requirement after 35 weeks’ PCA) always have sufficient additional risk factors to warrant classification as high risk. In this study, a modified form of the POPRAS was used which significantly reduced the time required for gathering the data while maintaining the capability to select children with a range of risk for poor motor developmental outcome.
The modified recruitment tool was developed after reviewing the range of POPRAS scores and types of problems identified in our previous studies of almost 200 infants and collapsing the data collection tool into those types of problems seen most often. A small sample of the original infants were rescored with the modified POPRAS to verify comparability of risk assignment but no formal reliability or validity studies were done on the modified form. Score ranges for risk were the same as used earlier but an infant with bronchopulmonary dysplasia, periventricular leukomalacia, hypoxic-ischemic encephalopathy, or a grade III or IV intraventricular hemorrhage was considered high risk regardless of score.
The planned sample for each age group in the study included one-third each of high-, medium-, and low-risk infants. No national data could be found to support the choice of an appropriate distribution of risk in the sample selection process as most data reports available are only on exceptionally high-risk infants. We decided on equal proportions because a similar selection process in previous work intended to recruit infants with a range of risk7 resulted in an incidence of delayed development at 12 months of age of 15%, a figure that is in line with outcomes reported from studies of large samples of high risk infants.16,17
Infants were excluded from recruitment if there was disagreement of more than two weeks among age calculated from expected date of delivery, physician’s rating of gestational age (GA) at birth, and/or GA calculated from biparietal diameter on fetal ultrasound. Infants with myelodysplasia, brachial plexus injury, or spinal cord injury were excluded from recruitment because paralysis of body parts makes some items on the TIMP impossible to score.
In summary, the sampling design was planned to reflect the racial/ethnic distribution of US infants born with LBW and the sample was stratified by age and, within each age group, by medical risk status. Recruitment of infants in 11 locations provided geographic variability.
After identification of a match with a grid assignment for age/ethnicity/risk at a testing site, an infant was assigned by a recruiter for testing approximately one hour before a scheduled feeding with the TIMP in the age window for which they were selected. Subjects were cleared by their physician as medically stable enough for testing, were off mechanical ventilation (but could be receiving oxygen by nasal canula, reside in an isolette, or both), and had a signed parental consent to participate. The infant’s age and medical history were masked to testers. To further avoid development of expectations for total test scores, all tests were forwarded with only individual item scores marked to the University of Illinois at Chicago (UIC) data analysis site for final calculation of total raw scores.
Recruiter Reliability Training
The project Training Coordinator and the Principal Investigator completed the modified POPRAS form for five sample medical records with a variety of types of risk factors and obtained agreement on risk assessment. At each site, the recruiter completed the same sample records to determine their ability to accurately identify all risk factors. The Training Coordinator also reviewed local medical records with the recruiter to be sure they could identify the locations of pertinent data, such as laboratory test values and head ultrasound findings. When POPRAS forms were received at the data collection site, all values were checked by project staff for accurate addition and risk assignment. No formal statistical analyses were completed.
Tester Reliability Training
One to three testers were trained at each site. The testers followed previously established training procedures for rater reliability analysis. Published articles on the TIMP and a multi-media CD-ROM training program18 were provided for initial orientation. The testers then practiced testing infants of various ages and ability levels in order to become efficient in the testing procedures and the administration of each item. They then scored items from videotapes of four actual tests of infants from the researchers’ databank of tests scored by reliable raters and submitted the results to UIC. The scores were analyzed using the Facets computer program (V. 3.20) for Rasch analysis of rater consistency and severity/leniency.19
The scoring reliability analysis provided three types of information regarding the tester’s scoring performance.
- Inter-rater reliability: a list of specific items which the tester scored differently from reliable raters identified specific problem areas for further study. Fewer than 5% misfit ratings were required in order to be considered a reliable rater.
- Overall rater consistency: the fit statistics reflected the degree to which an individual tester’s responses differed from the expected values. An infit mean square of less than 1.3 was considered to be acceptable consistency in use of item descriptions for scoring the TIMP.
- Rater severity: the Facets program allowed the testers to be calibrated along a continuum, which measured systematic bias in consistently scoring items higher or lower than other raters.
All testers needed to pass all three reliability criteria based on scoring the four videotaped tests before they could schedule an on-site reliability check with the project Training Coordinator or Principal Investigator. During the on-site review, three infants were tested in the presence of the investigator who used an item administration checklist to record correct/incorrect testing. Procedures and scores were reviewed for agreement.
The testers underwent a second reliability check, ie scoring videotapes of another four tests and passing the same reliability criteria at a point approximately halfway through the data collection process at each site. Sixty-four percent of the testers passed the reliability recheck on the first attempt. Data from those who needed retraining were not used until they had studied their scoring errors and successfully passed another videotape scoring test.
Data Reduction and Analysis
Means and standard deviations (SDs) were calculated for performance in each of the 12 age groups for the development of age standards on Version 5 of the TIMP. Multiple regression analysis was performed to assess possible significant differences in TIMP scores based on race/ethnicity, risk group, and sex.
The total sample obtained was 990 infants with a range of 67 to 97 per two-week age group. Thus, the smallest group (14 to 15 weeks AA) consisted of a sample 67% of the size intended. Reasons for failure to attain the planned sample size included loss of trained staff, end of funding, and difficulty identifying infants for the final hard-to-fill slots, eg, a low-risk infant of a specified ethnicity in the youngest age group, or inability to locate a child for testing who had been recruited while in the hospital for testing at a much later age. Testing some of the children necessitated home visits which were difficult to arrange, especially when subjects lived hours away from the hospital. Finally, results of a preliminary analysis of the data after recruitment of 600 subjects were compared to results with 990. Because the relevant data were virtually identical, the study was terminated.
The sample was composed of 517 boys (52%) and 473 girls (48%). The total sample was 58% white, 25% black, 15% Hispanic, 2.3% Asian, and 0.5% Native American. Thus, the plan to obtain a sample resembling the US distribution of race/ethnicity in LBW infants was substantially achieved. The far right column in Table 2 shows the range of N obtained across the 12 age groups for race/ethnicity. The total sample was 35% high risk, 30% medium risk, and 35% low risk. Table 3 shows the distribution of risk for each age group. Again, the distribution of risk is close to that intended with slightly more low risk infants at the upper ages and slightly more high risk infants at the lower ages.
Raw score means and SDs for each two-week age group (12 sets of age norms) are presented in Table 4. Means ranged from 49 (SD = 15, n = 86) for the 34–35 week PCA group to 120 (SD = 16, n = 81) for the 16 to 17 week post-term AA group. Thus there was no floor or ceiling effect. As would be expected, the SDs are slightly larger for the groups with smaller sample sizes (eg, 10 to 15 weeks’ post-term AA groups). Multiple regression analysis demonstrated that, as expected, age at testing was the most significant predictor (p < 0.0001) of the TIMP total raw score (r = 0.082, Adjusted R2 = 0.67, n = 990, p < 0.0001). The simple correlation between AA at testing and TIMP score was 0.81 (n = 990; p < 0.0001). There were no significant differences between the sexes in performance on the TIMP raw scores, but Latino/a infants (beta = −0.052; p = 0.006) and those infants at high medical risk for poor developmental outcome (beta = −0.133; p < 0.0001) did less well than all other groups; Fig. 1 illustrates the relationship between average TIMP raw score, age at testing in days, and risk group. Although infants at high risk scored less well on average than those with lower risk status, it is important to note that low scoring infants appeared in all risk groups as can be seen in the presentation of individual test results by age in Fig. 2. Thus developmental testing is important in the health care of all infants.
The results of this study provide age standards for use in clinical practice to diagnose developmental delay in infants at risk for poor outcome because of perinatal medical complications. Infants with many medical complications can be expected to score less well than infants with lower risk. Despite a statistically significant finding of lower performance in Latino/a subjects, the difference is not likely to be clinically significant. Furthermore, differences based on race/ethnicity have varied in our studies from no differences5 to better performance in black infants6 to, in the current study, poorer performance in Latino/a infants and deserves further research.
When compared with the previous performance of a small (and longitudinally assessed) Chicago-area sample containing only black, white, and Latino/a infants, only the mean of the 34- to 35-week age group changed by more than three points.3 The 34- to 35-week group mean increased by six points, and we assume the reason is that the national sample included more low-risk infants than our previous sample, thereby raising the overall mean. Because of the similarity of performance scores obtained in this project in comparison with earlier data, we do not believe that the somewhat smaller sample size than that originally planned presents a problem in using the obtained standards for making judgments regarding test results in infant clients.
Individual users of the TIMP need to establish their own policies regarding cutoffs for diagnosing delay and determining the appropriate course of management of tested infants. Factors to take into account might include frequency of routine follow-up, infant risk status, and resources available to both families and the clinical practice. The infants tested in this study were not followed for long-term outcome assessment. On the basis of previous research on predictive validity of the TIMP, however, we recommend a cutoff of −0.5 SD below the mean for identifying infants who should be followed more closely or referred for intervention.7,8 Outcome prediction at three months’ AA to 12-month performance had a positive predictive value of 0.39 in a group with a 15% incidence of delayed development or cerebral palsy at 12 months, meaning that some children recover from poor TIMP performance in early infancy to do well at one year of age;7 further study of the same group of infants at four to five years of age (when the incidence of atypical performance was 27% as measured by the Peabody Developmental Motor Scales), however, revealed a positive predictive value of 0.75, suggesting that long-term outcome may be more accurately predicted.8 Thus, 75% of children who score less than −0.5 SD at three months of age can be expected to have a poor motor developmental outcome in a group with a 27% incidence of poor outcome. A cutoff of −0.5 SD at ages one and two months post-term also provided the best prediction to later developmental outcome (79% to 80% overall accuracy), but prediction was best (87% accuracy) at three months’ AA.8
TIMP users should be aware that the predictive values described were obtained in studies of other samples, not infants in the current study. Furthermore, these expectations may apply only if there is a similar prevalence of poor outcome in the clinical population of users because positive predictive values are influenced by the prevalence of the outcome of interest in the group.20 Study of institution-specific outcomes and testing accuracy is recommended.
In clinical use, the age standards can be used to derive a variety of measures used by different states and agencies to qualify children for early intervention services. As an example, consider a child with a history of meconium aspiration and seizures after a full-term birth who is tested at eight weeks, three days. Her TIMP raw score is 54. At eight weeks, a raw score of 54 can be transformed into a standard, or z score, by subtracting the mean for her age group (93) and dividing by the SD for the group (18),21 yielding a z score of −2.17. On the basis of a reported TIMP test-retest reliability over three days of 0.89,22 and an overall SD of 28 for this sample, the TIMP’s standard error of measurement (SEM) is nine points. As a result, the 95% confidence interval for this child’s score is 54 ± 18. Thus, at the outside, her true score could be as high as 72, but this value is still lower than −1.0 SD. Because her obtained score, more than two SDs below the mean, is far below −0.5 SD, we would recommend starting intervention rather than continued surveillance.
Should an age equivalent be desired, one can look up the age group for which the child’s obtained score of 54 is nearest to the mean.21 In this case, the score of 54 is closest to the mean of the 36- to 37-week group (54) and her age equivalent score is, therefore, considered to be 37 weeks’ PCA. Score sheets for plotting infants’ scores against percentile ranks derived from the performance of the infants in this study are also available from the publisher (ie Infant Motor Performance Scales, LLC, 1301 W. Madison St. #526, Chicago, IL 60607-1953; www.thetimp.com). Finally, some agencies require that children have a certain percent delay in development to qualify for services. Her delay can be estimated by dividing her age equivalent of −3 weeks from term by her actual age of eight weeks post term and subtracting the result (keeping the minus sign in this case) from one and multiplying by 100. This yields a per cent delay of (1 − (−0.375.)) or a 138% delay, ie she is not even performing at the level expected on her day of birth. Scores in standard deviation units (z scores) with confidence intervals are most appropriate for expressing performance in comparison with age peers,21 but the information presented here allows results to be calculated in alternative ways as required by individual agencies governing access to early intervention.
In conclusion, age standards are now available from a US population-based sample of 990 LBW infants against whom TIMP scores from newly tested infants can be compared for clinical decision making. Because early diagnosis should advance the goal of improving outcomes, further research is recommended to address the responsiveness, eg clinically significant difference measurement,23 of TIMP scores to management under various service delivery options of infants at high risk for developmental delay and those with a variety of diagnoses.
We gratefully acknowledge the participation of therapists, nurses, and physicians in the following data collection centers: University of Chicago Hospital, Chicago, IL; Lutheran General Hospital, Park Ridge, IL; Children’s Hospital, Birmingham, AL; Children’s Hospital of Philadelphia, PA; Los Angeles County/University of Southern California/Good Samaritan Hospital, Los Angeles, CA; New England Medical Center, Boston, MA; University of Nebraska, Omaha, NE; Rainbow Babies and Children’s Hospital, Cleveland, OH; Sioux Valley Hospital, Sioux Falls, SD; Sacred Heart Hospital, Pensacola, FL; St. Vincent’s Hospital, Portland, OR; and Wake Medical Center, Raleigh, NC. We thank Patricia Byrne-Bowens for developing the modified POPRAS form and Richard T. Campbell for advice on sample description. We especially thank the families who allowed their babies to be tested to benefit other children in the future.
1. McHorney CA. Concepts and measurement of health status and health-related quality of life. In: Albrecht GL, Fitzpatrick R, Scrimshaw SC, eds. Handbook of Social Studies in Health and Medicine
London: Sage; 2000;339–358.
2. Campbell SK. The quest for measurement of infant
motor performance. In: Refshauge K, Ada L, Ellis E, eds. Science-based Rehabilitation: Theories Into Practice.
Philadelphia, PA: Butterworth Heinemann; 2005;49–65.
3. Campbell SK. The Test of Infant Motor Performance. Test User’s Manual. Version 2.0.
Chicago, IL: Infant
Motor Performance Scales, LLC; 2005.
4. Campbell SK, Wright BD, Linacre JM. Development of a functional movement scale for infants. J Appl Meas.
5. Campbell SK, Kolobe THA, Osten ET, et al. Construct validity of the Test of Infant
Motor Performance. Phys Ther.
6. Campbell SK, Hedeker D. Validity of the Test of Infant
Motor Performance for discriminating among infants with varying risk for poor motor outcome. J Pediatr.
7. Campbell SK, Kolobe THA, Wright BD, et al. Validity of the Test of Infant
Motor Performance for prediction of 6-, 9-, and 12-month scores on the Alberta Infant
Motor Scale. Dev Med Child Neurol.
8. Kolobe THA, Bulanda M, Susman L. Predicting motor outcome at preschool age for infants tested at 7, 30, 60, and 90 days after term age using the Test of Infant
Motor Performance. Phys Ther.
9. Flegel J, Kolobe THA. Predictive validity of the Test of Infant
Motor Performance as measured by the Bruininks-Oseretsky Test of Motor Proficiency at school age. Phys Ther.
10. Barbosa VM, Campbell SK, Sheftel D, et al. Longitudinal performance of infants with cerebral palsy on the Test of Infant
Motor Performance and on the Alberta Infant
Motor Scale. Phys Occup Ther in Pediatr.
11. Murney ME, Campbell SK. The ecological relevance of the Test of Infant
Motor Performance elicited scale Items. Phys Ther.
12. Girolami GL, Campbell SK. The efficacy of a neuro-developmental treatment program for improving motor control in preterm infants. Pediatr Phys Ther.
13. Lekskulchai R, Cole J. Effect of a developmental program on motor performance in infants born preterm. Aus J Physiother.
14. Centers for Disease Control and Prevention. Health, United States, 1998 with Socioeconomic Status and Health Chartbook.
Hyattsville, MD: National Center for Health Statistics, US DHHS Publication Number (PHS) 1998; 98–1232.
15. Molfese VJ, Thomson B. Optimality versus complications: Assessing predictive values of perinatal scales. Child Dev
16. Msall ME, Tremont MR. Measuring functional outcomes after prematurity: Developmental impact of very low birth weight and extremely low birth weight status on childhood disability. Ment Retard Dev Disabil Res Rev.
17. Bracewell M, Marlow N. Patterns of motor disability in very preterm children. Ment Retard Dev Disabil Res Rev.
18. Liao P-j M, Campbell SK. Comparison of two methods for teaching therapists to score the Test of Infant
Motor Performance. Pediatr Phys Ther.
19. Linacre JM. FACETS: Computer Program for Many-faceted Rasch Measurement
. Chicago, IL: MESA Press; 1998.
20. Fletcher RW, Fletcher SW. Clinical Epidemiology. The Essentials.
4th ed. Philadelphia, PA: Lippincott Williams & Wilkins; 2005;35–58.
21. Anastasi A, Urbina S. Psychological Testing.
7th ed. Upper Saddle River, NJ: Prentice Hall; 1997;61.
22. Campbell SK. Test-retest reliability of the Test of Infant
Motor Performance. Pediatr Phys Ther.
23. Haley S. Changes in interpreting clinical changes. III Step Conference Presentation, Linking Movement Science and Intervention, Salt Lake City, UT, July 19, 2005. Accessed February 11, 2006 from the DVD Conference Proceedings.