Liao, Pai-jun M.; Campbell, Suzann K.
The Alberta Infant Motor Scale (AIMS) is an observational measure of motor development for infants from birth to 18 months of age. 1 The primary purposes of the AIMS are to identify motor delays and to evaluate motor development over time. The test consists of 58 dichotomously scored items measuring spontaneous movements that reflect the quality of weight-bearing, posture, and antigravity skills in prone, supine, sitting, and standing positions. The lowest observed item and the highest observed item in each position create a window of motor development. The raw score for each position includes the number of observed items in the motor development window plus all the items below the window. The sum of the scores for the four positions is the total raw score, which can be converted to a percentile rank (PR) for comparison with age norms derived from a large population-based sample of Canadian infants from the province of Alberta.
When using the normative data to derive PR scores, if children maintain a similar level of motor performance relative to age peers over repeated assessments across time, one would expect their PR to remain constant. 2 The PR converted from the raw score on the AIMS, however, can change considerably with a difference of only one point on the raw score. 3 Furthermore, a longitudinal study of infant performance on the AIMS revealed that PRs varied from age to age with no systematic pattern within individual infants. 4 Another study found that only 50% of the infants who scored below the 10th percentile at six months remained poor performers at 12 months of age, 5 and only two of the seven children who had a later diagnosis of cerebral palsy had AIMS scores below the 10th PR at three months of age. 6 Although Darrah et al 4 concluded that the rate of emergence of gross motor development is not stable in infants with normal development; other explanations of the instability in PR found in these studies are also possible. An alternative explanation may be related to problems with scaling of AIMS items. In this article, the psychometric scaling properties of the AIMS as a measurement device are examined.
Variability in standardized test scores over time must be explained in order to make appropriate decisions about assessment utility. Several factors can contribute to such variability. First, there may by uneven rates of change in the performance that is being measured. Second, problems with the reliability of the raters can also be a factor. Finally, the nature of the measurement device itself can contribute to score variability. 7 Ordinally scaled items are used in most motor development assessment tools. Linear change between scores is a usual assumption when using traditional statistical methods in analyzing test scores, but the distance or effort needed to jump from one category to another usually is not equal in ordinal scales. 8 Comparisons across scores at different levels of ability are not possible if the changes between scores are not linear. For example, an increase of one point in a 10-point scale can represent different amounts of improvement at different parts of the scoring scale. It might be more difficult for a person to improve from 9 to 10 than from 4 to 5.
Raw scores from ordinal items can be converted into interval measures by using Rasch analysis. 9 The linear transformation of raw scores into interval measures is represented by taking the logarithm of the probability of a person passing an item divided by the probability of failing that item. 10 The probabilistic result is composed of several components, including the ability level of the person tested, the difficulty level of an item, the severity of a rater, and the structure of the rating scale. Rasch analysis can then generate calibrations of item difficulty and of persons’ ability along a linear measurement continuum in which the difference between categories is the same at any level of ability, ie, an interval-level scale. Rasch analysis can also be used to ascertain the hierarchical structure of item difficulty based on the continuum as well as to identify areas of the scale where precision is poor because items of the appropriate difficulty level are lacking. The purpose of this study was to examine whether the variability in PRs across repeated assessments of infants on the AIMS could arise from the nature of its scaling properties.
A convenience sample of 97 infants who were tested as part of a longitudinal study of various aspects of the reliability and validity of the Test of Infant Motor Performance 5,11–13 was recruited from three hospitals or the community in the Chicago metropolitan area. The distribution of race/ethnicity was 38% white non-Hispanic infants, 33% black African-American or African infants, 26% Hispanic infants (primarily Mexican and Puerto Rican), 1% Asian, and 2% mixed race. The Problem-Oriented Perinatal Risk Assessment System (POPRAS) 14,15 was used to classify the infants into three different groups based on risk for poor developmental outcome. The scores on the POPRAS ranged from 1 to 225 (mean = 102.4; standard deviation = 69.9). Detailed descriptions of the subjects can be found in previously published articles. 5,12
Parental consent was obtained for each subject before testing. Each infant was tested on the AIMS at three, six, nine, and 12 months (age corrected for prematurity when necessary). Seventy of the 97 (73%) infants completed all four AIMS tests. Others had less than four tests due to (1) health condition on the test date, (2) family scheduling conflicts, or (3) study attrition because the family moved away or the family could not be located on the test date. Table 1 shows the number of infants at each test age by risk and gender.
Twelve testers (physical therapists or occupational therapists) participated in data collection. All testers had been trained to be reliable on the scoring of the AIMS. The training procedures included a one-day workshop presented by one of the AIMS developers and scoring one live and one videotaped AIMS performance. Scores from each therapist were then analyzed by Facets (Mesa Press, Chicago, IL), a software program for Rasch psychometric analysis. All testers had fewer than 5% misfitting ratings in scoring reliability and acceptable misfit ratings, which indicated that the raters used the scales consistently. Rater unreliability as a possible source of scoring variability, therefore, was controlled in this study.
The AIMS items were coded according to the chronologic sequence presented in the test manual for each testing position, ie, the first item in prone position was coded as PN01, supine as SU01, sitting as ST01, and standing as SD01. The score for each item of the AIMS (0 = not observed, 1 = observed) was entered into the Statistical Package for the Social Sciences (SPSS, Inc., Chicago, IL), then transformed into an ASCII file for analysis with the Rasch psychometric analysis program, BIGSTEPS (version 2.73) (Mesa Press). According to the scoring instructions for the AIMS, all items below the infant’s current window of performance are scored as 1, and this correction was followed in this study.
Analysis of item measures was used to assess the hierarchical structure and the range and distribution of difficulty of items on the AIMS (mean of item difficulty set to 50 with one logit equals 1 point). The continuity of item difficulty can be examined by performing a t test between successive pairs of items along the logit scale. 16 A gap in the item difficulty measure, which is defined as a significant t test for the difference between the measures of two successive items, is evidence of discontinuity in items. The presence of a gap means that there are no appropriate items to differentiate infants at that ability level, ie, poor precision and discrimination capability. More effort is required to pass one item in order to gain one point in the raw score in an area of ability where a gap exists in item difficulty.
Fit statistics were used to identify items that misfit the Rasch model. 17 The infit statistic is the weighted mean square residual difference between observed and expected values in the inlying range (items with difficulties similar to the ability level of the subject). The outfit statistic is the unweighted mean square residual that better reflects outlying deviation from expected values (items far from the subject’s ability level). The expected value of fit statistics is 1.0 with possible values ranging from 0 to infinity. 18 The further away from the expected value the fit statistic is, the more the item misfits the measurement model. The reasonable range of the fit statistics for rating scales is 0.6–1.4. 18
The range of the PRs obtained by infants on the AIMS was 1% to 99% at three, six, and nine months and 1% to 90% at 12 months. Because variability is necessary in scores for Rasch analysis, 11 tests with perfect scores (ie, passing all items with a total raw score of 58) were deleted from the analysis. Only one infant in this study was diagnosed with cerebral palsy (CP) at nine months of age, but nine more infants were diagnosed with CP when followed for a longer period of time. The scores for these 10 infants were discarded in the analysis because the AIMS is not appropriate for use in following motor development over time in infants with motor disorder diagnoses that result in abnormal movement patterns. 1 The final analysis included 299 tests from the 87 infants without CP and all 58 items of the AIMS.
The mean Rasch ability measure for infants was 51.05 with a standard deviation of 8.28 in nonextreme scoring infants, ie, excluding those with perfect scores (Table 2). Overall infit mean square was 0.92 and outfit mean square was 0.63 for these infants. The person separation index was 7.05 with a reliability of 0.98, indicating that approximately seven levels of infant ability were discriminated across the age range from three to 12 months. The mean Rasch measure for item difficulty (as set for the analysis) was 50.00 with a standard deviation of 8.47 for the 58 AIMS items. Overall infit mean square was 0.96 and outfit mean square was 1.30. The item separation index was 18.88 with a reliability of 1.00. The close fit of the infant mean (51) to the item difficulty mean of 50 indicates that the items overall were well centered on the average ability of the sample.
Hierarchical Structure of Items
Table 3 presents detailed measures and fit statistics for all items. This table provides information about the hierarchical structure of the assessment. Item difficulty should increase gradually within the same subscale, ie, the items with a higher order of numbers should have larger measures, which indicates increasing difficulty. The items of the AIMS followed this rule. Each asterisk in Figure 1 indicates the difficulty of each item based on the ruler of 0–100 (the horizontal bar, which represents infant’s ability level). The distance between 0 (of the infant’s ability) and the asterisk indicates the amount of ability needed to pass each item, ie, item difficulty. If an increase of one point in the total raw score requires the same amount of ability, regardless of whether an infant is high or low in ability, the asterisks in Figure 1 would form a line for which X = Y. The nonlinear curve in Figure 1 suggests that an increase in total score of one point requires different amounts of ability at different ability levels.
PN01 does not appear in the analysis because every infant passed the item. Figure 2 demonstrates how items of the AIMS and the infants in this study spread out along the same continuum. The ability level (raw scores) of infants is based on how many items were passed on the AIMS, with the lowest possible score reflecting failure on every item and the highest being a pass on every item. Every infant’s ability is converted to a measure, ie, 0–100 in this case, and infants are represented as # or · on the left side of the figure. The difficulty level of the AIMS items is displayed next to the infants along the same ruler. Ideally, items would be distributed rather equally across the range of subject ability, yielding high precision across levels of ability with no gaps and no floor or ceiling effect. A ceiling effect exists with several infants at the top of the ability scale where no more items are available to differentiate among their ability levels, in addition to the 11 infants with perfect scores who were excluded from the analysis.
Precision at Different Ability Levels
The average item difficulty levels ranged from 35 to 75 with a few gaps along the measurement continuum. The difficulty levels in the middle ranges often had a few items at the same level, whereas only one or two items at the same ability level are found toward the two ends of the measurement continuum, as demonstrated in Figure 2. After SD09 (controlled lowering through standing), only standing items are available to assess an infant’s ability levels and the difficulty levels are widely spaced.
The arrows in Figure 2 indicate gaps between item difficulties. PN01 was dropped because every infant passed it, which indicated that the item was too easy for these infants, a not unexpected finding because all the subjects were at least three months of age. Gaps exist among the eight most difficult standing items as well, including controlled lowering through controlled lowering through standing (SD09), cruising with rotation (SD10), stands alone (SD11), early stepping (SD12), standing from modified squat (SD13), standing from quadruped position (SD14), walks alone (SD15), and squat (SD16). No items at all are available for discriminating among the most competent 12-month-old infants.
In addition to gaps in the item difficulty measures, some individual items showed poor fit to the psychometric model in which only more able infants are expected to pass the more difficult items, whereas less able infants should pass only easier items. The expected value of infit and outfit mean square statistics (infit and outfit MNSQ in Table 3) is 1. A criterion value of 1.4 was used for a rating scale to judge whether an item misfit the model. 18 Four items had infit mean square values greater than 1.4, indicating noise in the data: PN08 rolling prone to supine without rotation, SU04 supine lying, SU05 hands to knees, and SD03 supported standing. Nine items had outfit mean square values greater than 1.4, including PN08 rolling prone to supine without rotation, PN14 propped side-lying, SU05 hands to knees, SU09 rolling supine to prone with rotation, ST10 sitting to prone, ST12 sitting without arm support, SD03 supported standing, SD11 stands alone, and SD12 early stepping.
A systematic hierarchy of item difficulty was found that was consistent with the order of items on the test scoring form. The AIMS items in each test position are arranged by difficulty level because the measures from the Rasch analysis consistently increase as the item number increases so that infants passing items with higher numbers within each position sequence have higher ability levels than infants with lower scores. As a result, this study provides evidence of the validity of using the AIMS both to assess overall motor ability in infants as well as to evaluate skills in different positions in space.
A ceiling effect existed in this analysis of longitudinal test results from infants ranging in age from three to 12 months corrected age. Few items are available to differentiate among infants whose ability level is at the top of the ability continuum where the items are spaced widely apart in difficulty.
A second purpose of this study was to explore the possibility that the AIMS might also have measurement properties that could explain the results of studies showing that it produced unstable longitudinal results. Similarly, Coryell et al 19 found instability in motor scores using the Bayley Motor Scale across the first year and indicated that limitations of the assessment itself might be a contributing factor for such performance instability. The Rasch analysis performed in this study revealed discontinuity of item difficulties on the AIMS. Gaps exist at several difficulty levels, which indicates that a large jump in ability level is required to pass one more item around the gap. The gaps exist between the various standing items, beginning with items expected to be passed by infants about 9 months old. This finding is commensurate with the report of Bartlett 20 that infants who scored low on the AIMS at 10 months of age would not necessarily score low on the AIMS at 15 months or on the Peabody Developmental Motor Scale at 18 months. The possible explanation might be that the precision of measurement was decreased by the gaps existing beyond nine months of age on the AIMS. Infants who do not pass one more item to jump to a higher measured ability level may either have motor ability close to that of the passed items, ability level close to the failed item, or anywhere in between. The true ability level cannot be revealed for these infants because of the lack of items on the AIMS at this level.
Coster 3 pointed out that one point of change in raw scores on the AIMS could result in a large change in PR in early infancy. A one-point difference in total raw score from 6 to 7 at one month of age leads to a change in percentile rank from the 25th to the 43rd, whereas a one point change in total raw score from 40 to 41 at eight months of age produces a change in percentile rank from the 51st to the 56th. Furthermore, Fetters and Tronick 23 found that the AIMS at seven months of age yields better sensitivity and specificity values for the prediction of the scores of the Peabody Gross Motor Scale at 15 months compared with prediction from the AIMS at four months of age. This study did not find gaps at the lower difficulty levels, but the infants were first tested on the AIMS at three months of age in the present study, and all infants passed the first prone item even though the PRs ranged from 1% to 99%. Further investigation is necessary to determine whether the AIMS items are adequate to precisely measure motor ability in infants younger than three months of age.
Some items might be added to fill the gaps beyond the difficulty level of independent standing, such as crawling upstairs, flinging a ball with extensor thrust of arm in standing, walking upstairs/downstairs/backward, crawling backward to go downstairs, or kicking a ball in standing. Alternatively, other gross motor assessments, such as the Peabody Developmental Motor Scales, 24 the Bayley Scales of Infant Development II, 25 or the Bayley Infant Neurodevelopmental Screener, 26 which are designed to measure motor development up to 72, 36, and 24 months respectively, and contain more standing items than the AIMS, can be used to document changes after a child passes the “controlled lowering through standing” item. One-time assessment on the AIMS at ages from about nine months on cannot be used as a sole resource to draw clinical impressions except for infants with delayed motor development who are not yet standing.
The AIMS items aggregate in the middle range of the difficulty levels, whereas fewer items are available toward the two ends of the measurement continuum. Only standing items are available for testing after an infant can lower him- or herself from standing in a controlled manner. This indicates that the AIMS is sufficiently precise to discriminate among infants whose ability levels are in the middle range but not at the higher ability end, ie, after achieving controlled lowering through standing (SD09). Darrah et al 21 found that using the 10th percentile cutoff point at four months and the fifth percentile at eight months yielded high sensitivity and specificity for predicting the pediatrician’s assessment at 18 months. Another study also found that the month-to-month correlations between PRs was the strongest between 7.5 and 8.5 months, and the correlations were unstable before 5.5 months. 4 These are in accordance with our findings, suggesting that the AIMS is most accurate in the mid-ability range. Another study also found only a moderate correlation (0.51) between AIMS raw scores at six and 12 months. 22 Variation in scores is necessary when calculating correlation coefficients 8 because decreased variability will decrease the correlation. The variation of AIMS scores is limited at the higher end of the scale because only a few standing items are available for assessment at 12 months, which will compromise any attempt to explore the correlation between scores derived at this age. Small score variance might also be the reason for lower correlations between three months of age and later ages because only a few items are available for use before three months.
Ten items with high infit or outfit values or both did not fit the Rasch model, ie, infants with high ability levels do not score higher and vice versa. Several factors can contribute to item misfit: (1) the item may measure a different construct than intended by the test’s authors, (2) the item may be hard to observe (need to facilitate or only appears for a short period of time), (3) some infants never achieve all three criteria needed to pass the item, (4) some infants do not experience this developmental stage or develop alternative motor patterns, or (5) testers cannot rate this item reliably (do not understand the item or do not use consistent criteria for scoring). The testers in this study had been trained to be reliable and consistent raters, thus eliminating the testers as likely sources of misfit. A review of the component analysis from the Rasch output suggests that the items belong to one construct. As a result, items that are difficult to observe or the skipping of particular motor milestones in some infants might contribute to high infit/outfit values on the AIMS. Bartlett 20 speculated that the low scores on the AIMS at 10 months could be explained by infants not crawling or using alternative motor patterns to move around. The crawling items, however, were not among the misfit items in this study.
Large infit and outfit statistics have different meanings. For items with infit misfit, erratic responses occur in infants whose ability levels are near the item difficulty level. It is hard to know whether some infants skip the misfitting items or whether results are affected by the fact that items below the performance window are automatically credited as passing, according to the AIMS scoring criteria. For items with outfit misfit, erratic responses occur in infants whose ability levels are higher or lower than the item difficulty levels, ie, infants with higher ability levels failed some easier items or infants with lower ability levels passed more difficult items. For example, some infants whose ability levels were much lower than the difficulty levels of SD11 and SD12 passed these two items. This phenomenon might be related to movement experience. These infants might have been exposed to the standing position or playing in the standing position more than other positions, leading to precocious performance on these items.
The erratic responses could not be related to specific groups of infants (eg, premature infants vs infants born full-term), therefore, the misfit items should be revised or deleted from the AIMS. An analysis without the misfit items revealed the mean infit as 0.99 and mean outfit as 0.46, and every item has infit and outfit misfit values within the expected range. A new gap appeared, however, when SU04, SU05, and SD03 were deleted. These three misfit items might also contribute to the instability in scores in early infancy because unreliable items cannot discriminate infants’ abilities properly.
This study examines the structure of the AIMS by using Rasch analysis. A ceiling effect exists in this sample, and only a few items are available for testing in the early months. Although the hierarchical nature of the items in each testing position was confirmed, the precision of measurement for older infants is decreased by the finding that only standing items with wide spacing of difficulty are available after an infant passes the controlled lowering from standing item (the ninth standing item). Although it is possible that the fact that infants were tested at three-month intervals up to 12 months affected the results, we do not believe that the AIMS is suitable for use in documenting motor developmental changes once an infant can lower him- or herself controllably from a standing position (SD09). After the age of about nine to 10 months, we suggest use of other standardized tests unless the infant being tested is not yet standing.
Subjects were recruited at the University of Illinois at Chicago Medical Center, the University of Chicago Hospitals, and Lutheran General Hospital. The authors thank Dolores Schorr, Pat Byrne-Bowens, Dawn Kuerschner, Carrie Ryan, and Kathy Tolzien for assistance in recruiting subjects; Elizabeth Branenn, Mary Carter, Judy Flegel, LouAnn Gouker, Pamela Klaska, Thubi Kolobe, Maureen Lenke, Gail Liberg, Elizabeth Osten, Jennifer Padek, Celina Wise and Laura Zawacki for testing infants; and the participation of all the infants and their families, without whom this work would not have been possible.
1. Piper MC, Darrah J. Motor Assessment of the Developing Infant. Philadelphia: WB Saunders; 1994.
2. Piper M. Theoretical foundations for physical therapy assessment in early infancy. In: Wilhelm I, ed. Physical Therapy Assessment in Early Infancy. New York: Churchill Livingstone; 1993: 1–12.
3. Coster W. Critique of the Alberta Infant Motor Scale (AIMS). Phys Occup Ther Pediatr. 1995; 15: 53–69.
4. Darrah J, Redfern L, Maguire TO, et al. Intra-individual stability of rate of gross motor development in full-term infants. Early Hum Dev. 1998; 52: 169–179.
5. Campbell S, Kolobe T, Wright BD, et al. Validity of the Test of Infant Motor Performance for prediction of 6-, 9-, and 12-month scores on the Alberta Infant Motor Scale. Dev Med Child Neurol. 2002; 44: 263–272.
6. Barbosa VM, Campbell SK, Sheftel D, et al. Longitudinal performance of infants with cerebral palsy on the Test of Infant Motor Performance and on the Alberta Infant Motor Scale. Phys Occup Ther Pediatr. 2003; 23( 3) 7–29.
7. Plewis I, Bax M. The uses and abuses of reliability measures in developmental medicine. Dev Med Child Neurol. 1982; 24: 388–390.
8. Portney LG, Watkins MP. Foundations of Clinical Research: Applications to Practice. Upper Saddle River, NJ: Prentice Hall; 2000.
9. Velozo CA, Kielhofner G, Lai J. The use of Rasch analysis to produce scale-free measurement of functional ability. Am J Occup Ther. 1999; 53: 83–90.
10. Wright BD, Masters GN. Rating Scale Analysis: Rasch Measurement. Chicago: MESA Press; 1982.
11. Campbell SK. Test-retest reliability of the Test of Infant Motor Performance. Pediatr Phys Ther. 1999; 11: 60–66.
12. Campbell SK, Kolobe THA. Concurrent validity of the Test of Infant Motor Performance with the Alberta Infant Motor Scale. Pediatr Phys Ther. 2000; 12: 1–8.
13. Campbell SK, Hedeker D. Validity of the Test of Infant Motor Performance for discriminating among infants with varying risk for poor motor outcome. J Pediatr. 2001; 139: 546–551.
14. Davidson EC, Hobel CJ. POPRAS: A Guide to Using the Perinatal, Intrapartum, Postpartum Record. Torrance, CA: South Bay Regional Perinatal Project Professional Staff Association; 1978.
15. Ross MG, Hobel CJ, Bragonier JR, et al. A simplified risk-scoring system for prematurity. Am J Perinatol. 1986; 3: 339–344.
16. Bond TG, Fox CM. Applying the Rasch Model: Fundamental Measurement in the Human Sciences. Mahwah, NJ: Lawrence Erlbaum Associates; 2001.
17. Lunz ME, Wright BD, Linacre JM. Measuring the impact of judge severity on examination scores. Appl Measure Educ. 1990; 3: 331–345.
18. Wright BD, Linacre JM. Reasonable mean-square fit values. Rasch Measure Trans. 1994; 8: 370.
19. Coryell J, Provost B, Wilhelm IJ, et al. Stability of Bayley Motor Scale scores in the first year of life. Phys Ther. 1989; 69: 834–841.
20. Bartlett D. Comparison of 15-month motor and 18-month neurological outcomes of term infants with and without motor delays at 10-months-of-age. Phys Occup Ther Pediatr. 2000; 19: 61–71.
21. Darrah J, Piper M, Watt MJ. Assessment of gross motor skills of at-risk infants: predictive validity of the Alberta Infant Motor Scale. Dev Med Child Neurol. 1998; 40: 485–491.
22. Jeng SF, Yau KIT, Chen LC, et al. Alberta Infant Motor Scale: reliability and validity when used on preterm infants in Taiwan. Phys Ther. 2000; 80: 168–178.
23. Fetters L, Tronick EZ. Discriminate power of the Alberta Infant Motor Scale and the Movement Assessment of Infants for prediction of Peabody Gross Motor Scale scores of infants exposed in utero to cocaine. Pediatr Phys Ther. 2000; 12: 16–23.
24. Folio MR, Fewell RR. Peabody Developmental Motor Scales and Activity Cards Manual. Allen, TX: DLM Teaching Resources; 1983.
25. Bayley N. Bayley Scales of Infant Development. 2nd ed. San Antonio, TX: The Psychological Corporation; 1993.
26. Aylward GP. The Bayley Infant Neurodevelopmental Screener. San Antonio, TX: The Psychological Corporation; 1995.
© 2004 Lippincott Williams & Wilkins, Inc.