Secondary Logo

Journal Logo


Interrater Reliability of Early Intervention Providers Scoring the Alberta Infant Motor Scale

Blanchard, Y. ScD, PT; Neilan, E. MSPT; Busanich, J. MSPT; Garavuso, L. MSPT; Klimas, D. MSPT

Author Information
doi: 10.1097/01.PEP.0000113272.34023.56
  • Free


Physical and occupational therapists assess infants who are at risk of motor dysfunction and who may not yet have a diagnosis of an impairment or disability. Clearly, one of the goals of therapists involved in the evaluation of infants who are at risk is distinguishing those infants who will demonstrate lasting motor difficulties from those who will not. Over the past decade, therapists have been involved in the development of motor tests that focus on the assessment of qualitative and functional dimensions of movement rather than on the acquisition of a single motor skill. These include assessment tools such as the Alberta Infant Motor Scale, 1 Toddler and Infant Motor Evaluation, 2 the Test of Infant Motor Performance, 3 the Gross Motor Function Measure, 4 and the Pediatric Evaluation of Disability Inventory. 5 Of particular interest in this study is the Alberta Infant Motor Scale (AIMS). The AIMS is a performance-based, norm-referenced, observational tool for the motor assessment of the developing infant from birth until independent walking. Between the ages of zero and 13 months), 6 the AIMS has shown high concurrent validity (r = 0.97–0.99) with the Bayley Scales of Infant Development 7 and the Peabody Developmental Motor Scales. 8 The interrater reliability between therapists throughout all ages was also found to be high, ranging from 0.96–0.99. 6 Two cutoff points that provide the best sensitivity and specificity in predicting abnormal motor development have been identified as the 10th percentile at four months and the fifth percentile at eight months. 9 Thus, infants with an AIMS score below the 10th percentile at the age of four months or below the fifth percentile at eight months would be considered at risk of long-term abnormal motor development. It is recommended that such infants be referred to pediatric neurology for further diagnostic testing to determine the cause of the abnormal motor development.

The authors of the AIMS state that it was designed to be used by any health professional with a background in infant motor development and an understanding of the essential components of movement as described for each of its items. 1 In her critique of the AIMS, Coster 10 suggested that the authors of the AIMS should specify the level of skill required to administer the AIMS. She suggests that the clear depiction of the items might easily deceive a less experienced clinician into thinking that the scoring should be straightforward. 10 In response to this critique, Piper and Darrah 11 argue that because the underlying concepts of the AIMS (weight-bearing, posture, and antigravity movement) are familiar to pediatric therapists, it is not necessary for them to be certified to administer the AIMS. A careful reading of the manual and familiarity with the item descriptors in each subscale should suffice. It is suggested that therapists interested in the use of the AIMS in the clinic or research could together score some infants and discuss any discrepancies until agreement is reached. The authors conclude by stating that other healthcare professionals should receive training from an occupational or physical therapist experienced in the administration of the AIMS. The duration and specifics of the training given remain unclear, but it is suggested that nontherapists would need more training.

In their original study conducted in 1992, Piper and Darrah 1 showed high interrater reliability (r > 0.96) on the AIMS, but this reliability was determined using six pediatric physical therapists who were experienced in infant motor assessment and trained by the authors on the administration of the AIMS. Similar reliabilities have been found in many different studies, including a study in Taiwan using the AIMS on infants born preterm. Six physical therapists showed high interrater reliability with intraclass correlation coefficients (ICCs) ranging from 0.97 to 0.99, showing that the AIMS is reliable between raters and across cultures. 12 Because of its proven validity and reliability as well as the uniqueness of its normative data, the AIMS has been used in research. In these studies, therapists with varying amounts of clinical experience were either not trained or given a 32-hour training session. 12–14

The most extensive training reported in the literature was in the previously mentioned study of reliability and validity of the AIMS when used on infants born preterm in Taiwan. 12 Six therapists, who had no experience with the AIMS, took a 32-hour training session in understanding the theories of motor development and administration and scoring of the AIMS. The instruction included demonstrations and rating criteria. The interrater and intrarater reliability of these therapists was high with ICCs of 0.97–0.99. However, as Jeng et al 12 state, these results may not reflect those obtained by therapists in general practice due to the high level of training provided in this study. In a study that examined the intraindividual stability of the rate of gross motor development in infants born full-term using the AIMS as an assessment tool, three pairs of pediatric therapists were trained in the administration of the AIMS prior to the study and had to attend monthly training sessions during which all six therapists together assessed an infant and discussed the scoring protocol. 9 The interrater reliability for these assessments was high, with an ICC of >0.99 for each pair of trained therapists.

Based on these findings, it can be concluded that with some form of training, therapists are able to achieve high levels of reliability on the AIMS. Everyone would agree that it is critical that these levels of reliability be reached in the context of research involving the use of the AIMS. However, another area of practice in which the AIMS can play a significant role is in the field of early intervention where clinicians from diverse professional backgrounds are asked to evaluate infants who are at risk of developmental delay and make recommendations for eligibility for services and offer guidance to families. Little is known of how effective the instructions provided in the manual are in ensuring an acceptable level of reliability between clinicians who would simply buy the manual at their local bookstore and be without the full support of academics or expert clinicians formally trained on the AIMS. The purpose of this study was to examine the interrater reliability on the AIMS of early intervention providers (EIPs) from diverse professional backgrounds and to examine whether training in the administration of the AIMS would improve their interrater reliability scores.



Eight EIPs were recruited from the greater Hartford, CT, area. To be eligible to participate, each professional had to have a minimum of three years of experience in early intervention and have no prior experience with the AIMS. The raters were recruited by one of the authors (Y.B.) through direct contact with agencies providing service to children aged birth to three years.



The AIMS is made up of 58 motor items that examine motor development in four different positions: prone, supine, sitting, and standing. 1 To pass an item, the infant must demonstrate key motor descriptors associated with the item. These include aspects of weight-bearing, posture alignment, and antigravity movement. Between 20 and 30 minutes is required to administer and score the entire assessment. The test is designed to document motor behavior elicited by the parent, examiner, and age-appropriate toys with only minimal handling of the infant. For each of the four positions, the least and most mature items observed during the assessment are identified and scored “observed.” The items located between these two items on the AIMS represent the infant’s possible motor repertoire in that position or what is called his or her “window” of current motor skills. 1 All items within this window must then be scored as either “observed” or “not observed.” No item within that window may be credited on the basis of developmental assumptions or parental reporting. Items below the least mature item within the window are all credited as observed. One point is given for each item observed, allowing for a total raw score to be determined by counting the numbers of observed items. Percentile rankings are then determined using the total raw score and the infant’s age.

Videotaped sessions.

Prior to the study, one of the authors (Y.B.) administered monthly AIMS to six infants from birth until the age of independent walking. All the infants were healthy and at an appropriate birth weight for their gestational age. Each session was conducted and taped at home with a parent present. Each infant was videotaped in minimal clothing, usually wearing just a diaper or undershirt and no shoes.

The authors reviewed the AIMS videotaped sessions of all six infants and chose 14 sessions to be included in the study. For a session to be included, complete administration of the AIMS and a full, unobstructed view of the infant had to be available. To be representative of a wide range of scores on the AIMS, videotaped sessions were selected for infants between the ages of zero and three months, four and six months, seven and nine months, and 10 and 12 months. The 14 sessions were randomly ordered and copied to tape 1 without audio. The audio was left out to eliminate any information about the infant’s age and performance that might come from discussions between the examiner and the infant’s parent. The 14 sessions of tape 1 were then copied on two separate tapes. Tape 2 contained the first seven sessions of tape 1, whereas tape 3 contained the last seven of the sessions of tape 1. The five authors scored the master tape independently, and interrater reliability was found to be high (ICC = 0.97). Once interrater reliability between the authors was determined, the authors reviewed the master tape as a group, and gold standard scores were determined for each videotaped session (AIMS total score and subscale score for prone, supine, sitting, and standing).



Eight EIPs (two physical therapists, three occupational therapists, three social workers, and one early intervention associate) volunteered to participate in the study and were randomly assigned to either Group 1 or 2. One participant was switched from Group 1 to Group 2 because she was unable to view tape 1 and complete the scoring of the babies before the training session because of a personal time conflict. Thus, Group 1 had three raters (1–3), and Group 2 had five (4–8).

Group 1.

The raters in Group 1 consisted of one physical therapist, one occupational therapist, and an early intervention associate. The authors met with the raters in Group 1 and gave them a copy of the AIMS manual, scoring sheets, and tape 2. The raters were told that the AIMS is an assessment tool commonly used to assess the motor development of young infants and that all the information needed to score the test could be found in the manual. It was recommended that they read the manual before viewing tape 2. They were asked to calculate only the raw scores (total and for each subscale) and were given 7–10 days to score tape 2. The raters were instructed to score the tape independently and not discuss scoring with anyone. Once all the raters completed their scoring of tape 2, they attended the training session (see below) at the same time as Group 2. At this time, raters in Group 1 returned their scoring sheets and tape 2. After the training session, raters in Group 1 were given tape 3 to score. Again, they had 7–10 days to score tape 3.

Group 2.

The raters in Group 2 consisted of two occupational therapists, two social workers, and one physical therapist. The raters in Group 2 attended the training session (see below) and were given a copy of the AIMS manual, scoring sheets, and tape 1 to score. The raters were instructed to score the tape independently and not to discuss scoring with anyone. The raters were asked to calculate only the raw scores (total and for each subscale) and were given 7–10 days to complete the scoring of tape 1.

Training session.

All raters attended the training session after the raters in Group 1 had completed the scoring of tape 2. The session lasted approximately 1.5 hours. It began with a discussion of the importance of reliable and valid motor assessment tools. The AIMS was introduced, and its 58 items and key descriptors were described in detail. The scoring of the AIMS was explained in great detail. For the purpose of this study, the raters were instructed to calculate only the raw scores (total and for each subscale). The raters then viewed three videotaped AIMS sessions and scored each one independently. Scoring of each session was discussed until agreement was reached on each item on whether it was observed or not observed. A fourth videotaped session was watched and scored independently by the raters and researchers. After all the raters completed the scoring, a discussion was held to address any items on which full agreement between raters was not reached.

Statistical Analysis

The interrater reliability before and after training was determined using the ICC. 15 The ICC(2,1) was chosen because the authors wished to generalize the findings of this reliability study to those of other raters (model 2) and used a single AIMS score as the unit of analysis to calculate the ICC. ICCs for Group 1 were determined for tape 2 (infants 1–7) and tape 3 (infants 8–14). ICCs for Group 2 were determined for tape 1 (infants 1–14) and also separately for infants 1–7 and infants 8–14 in order to compare the ICCs found between Groups 1 and 2.

The authors’ ratings of the videotapes were used to determine a gold standard total raw score for each videotaped session. These total raw scores and the infant’s age were used to determine a gold standard percentile ranking for each infant. The authors also determined the rater’s percentile rankings based on each rater’s scores because the raters were blind to the age of the infant on the videotape that they were scoring. The gold standard percentile rankings were compared with the raters’ percentile rankings to determine differences in the classification of infants as normal or abnormal in their motor development. The authors of the AIMS consider infants classified below the 10th percentile at four months or below the fifth percentile at eight months as abnormal in their motor development. 9


Interrater Reliability

The results of the interrater reliability for the AIMS videotaped sessions on tapes 1, 2, and 3 before and after training can be found in Table 1. For the AIMS total scores, the ICCs ranged from 0.98 to 0.99. The ICCs for the prone subscale ranged from 0.97 to 0.98; for the supine subscale, they ranged from 0.82 to 0.98; for the sitting subscale, they ranged from 0.90 to 0.98; and for the standing subscale, they ranged from 0.96 to 0.99. In the supine subscale, the ICC before training for Group 1 on tape 2 was 0.82 and was 0.90 for Group 2 after training.

ICC Values Pre- and Post-training for Group 1 and Post-training for Group 2 on the AIMS Subscale (Prone, Supine, Sitting, Standing) and Total Scores

AIMS Total Scores and Percentile Rankings

The AIMS total scores and percentile rankings can be found in Table 2. Before training, the percentile rankings determined for infants 3 and 4 by raters in Group 1 were below the fifth percentile but were at the 10th percentile for raters in Group 2 and the gold standard.

Infant’s Age, Gold Standard AIMS Total Score, Percentile, Mean AIMS Total Score with Standard Deviation for Groups 1 and 2 for Infants 1–7 and Groups 1 and 2 Combined for Infants 8–14


In this study, raters were divided into two groups. Raters in Group 1 were asked to score the AIMS on seven videotapes of infants before receiving training and another seven infants after training. Participants in Group 2 scored all 14 infants after attending the AIMS training session. As seen in Table 1, ICC values on AIMS total scores indicate high interrater reliability (0.98–0.99) before and after training. These values suggest that for professionals with at least three years of experience as an EIP, the AIMS manual provides sufficient detailed information for reliable scoring between raters. Further examination of the ICC values, however, suggests that the ability to score the AIMS reliably may differ between the different positions in which the infant is placed during the administration of the AIMS (prone, supine, sitting, and standing) and reliability may be affected by training. As can be seen in Table 1, the ICC value for the supine subscale for infants 1–7 before training by raters in Group 1 was 0.82 and was 0.90 after training by raters in Group 2. Although still indicating good reliability, an ICC of 0.82 is substantially less than the values presented earlier for the AIMS total scores. The ICC value after training (0.90) demonstrated by the raters in Group 2 also suggests that training improved these raters’ ability to score the AIMS items for the supine position, leading to high reliability.

We noticed, however, that the ICC value after training for raters in Group 2 of infants 1–7 (0.90) was also lower than the other ICC values. We decided to examine the videotaped AIMS sessions of infants 1–7 to determine whether there were any particularities that could explain these differences. As shown in Table 2, infants 1–4 were between the ages of five and seven months, whereas infants 5 and 6 were approximately seven and one-half months and infant 7 was 11 months of age. When designing this study, the infants’ videotaped AIMS sessions were randomly ordered and copied on the tapes, but perhaps results would have been different if the age range among infants 1–7 had been more diverse. Examination of the AIMS items for those age ranges may, however, suggest another explanation. During the first six months of life, infants spend most of their time lying supine or sitting supported in an infant seat because they are not yet standing or rolling onto their stomachs independently. Once an infant has mastered rolling supine to prone with rotation between the ages of six and one-half and nine months, all items on the AIMS supine subscale are credited as observed. That item is dependent on the examiner’s ability to differentiate between rolling supine to prone with or without rotation. Therefore, for infants around the age of six to seven months, there is the possibility that most, if not all, of the items on the supine subscale will be rated as observed.

Detailed examination of the supine items on the scoring sheets revealed that the EIPs had difficulty determining the window of current motor skills for these infants, especially for infants in the four- to seven-month age range. The window is made of the items between the least and most mature items on the scale and represents in this instance the infant’s possible motor repertoire in the supine position. Once the window is determined, each item within the window is then scored as either “observed” or “not observed.” Items are documented as observed only if observed at the time of the examination and not on the basis of developmental assumptions or parental reporting. An example of this difficulty was found when “hands to feet” was observed, but “hands to knees,” which is considered part of the window, was not observed. Clinicians know that if an infant is capable of touching his feet, he is also capable of touching his knees. However, the latter item cannot be credited if not observed. This particular item was confusing for the raters because some scored it as observed and others did not. We contacted Dr. Darrah, coauthor of the AIMS, to ask her opinion of this situation (personal communication December, 2002). She recommended not starting the window too far back and believes that experienced clinicians may have an advantage over less experienced clinicians in making this decision. When suspecting that an infant is capable of higher items, she recommends giving him or her time to “warm up” a little first before scoring. This information, however, is not clear in the AIMS manual. Our results, therefore, suggest that for infants younger than seven months of age, the supine subscale is more challenging to score reliably between raters than the other subscales. The lack of training prior to scoring affected our raters’ ability to score this subscale with high reliability.

Besides determining interrater reliability, it was also important to examine whether a discrepancy in scores would result in falsely predicting abnormal motor development (false positive) or in neglecting to properly identify an infant with abnormal motor skills (false negative). The AIMS has the best sensitivity and specificity at the 10th percentile at four months and at the fifth percentile at eight months. 9 A total score placing an infant below the 10th percentile at four months or below the fifth percentile at eight months is a significant finding and may create stress on families. Before training, the total AIMS scores from raters 1–3 determined a percentile ranking below the fifth percentile for infants 3 and 4, thus classifying them as having abnormal motor skills (Table 2). The same classification did not result from the percentile rankings determined from the gold standard or by raters 4–8 after training. It should be noted that these infants were also from the subgroup of infants between the ages of five to seven months. Our results suggest that training in the AIMS affected our raters’ ability to make appropriate recommendations for further testing based on the percentile rankings of the child and may cause false-positive judgment of a child’s motor development and possibly undue stress to families.

Such cutoff points have been used to classify infants using their percentile rankings and have provided therapists with confirmation of the wide range of normal motor development that is observed in the typically developing infant, they are, however, only available for four- and eight-month-old infants. This situation leaves therapists with limited data to support recommendations for infants of different ages. At best, it can be extrapolated that infants older than four months with a percentile ranking below the fifth percentile are considered to be at risk of having long-term abnormal motor development and should be referred for further evaluation. It is unfortunate that research efforts to identify cutoff points at other monthly intervals have not been pursued since the original paper on this research was published in 1998. 9

Although we believe that our results show, with some exceptions, that high reliability can be achieved by EIPs with or without training when scoring videotapes of infants being administered the AIMS, our methodology does not allow us to conclude that those same individuals would be able to administer the AIMS in such a way that would lead to high interrater reliability in scoring. Reliability achieved from the scoring of videotaped AIMS sessions does not necessarily mean that the raters know how to appropriately administer the AIMS. The infants on our videotapes were also typically developing infants. The results might have been different if the infants had demonstrated motor delays. Future research studies would need to examine the reliability between raters in real clinical settings with raters administering the AIMS on a range of infants with and without motor delays. Our findings also suggest that a more detailed explanation of the window of current motor skills would help to improve reliability. Further explanation of the placement of the window would benefit the user of the AIMS. Currently, the decision to credit an item as observed or not observed is at times arbitrary because moving the window toward more mature behaviors would allow such items as “hands to knees” to be credited even if not observed during a testing session when the “hands to feet” item is observed. An experienced clinician may have more flexibility in making those decisions than a novice therapist.

The AIMS is a well-designed assessment that offers the opportunity for clinicians to assess infants’ current motor skills without unnecessary stress caused by excessive handling. The provision of normative data at the end of the manual affords the possibility of calculating Z scores to help determine eligibility for early intervention services. Its scoring system is simple, and the determination of the percentile ranking based on the total raw score offers clinicians the ability to provide guidance to families of infants with potential motor dysfunction. That the cutoff points for determining who shows abnormal motor development are as low as the 10th and fifth percentiles confirms that the rate of motor development varies considerably. We believe that design of the AIMS can also be used by a diverse group of professionals working with infants who are at risk of motor delays. However, before such applications are widely recommended, we suggest further clarification of the scoring system as well as additional research to identify cutoff points for abnormal motor development at monthly intervals.


Special thanks to the families and infants who participated in this study and to the EIPs at Capitol Region Education Council (CREC) in Hartford, CT, who so eagerly gave their time for this study.


1. Piper MC, Darrah J. Motor Assessment of the Developing Infant. Philadelphia: WB Saunders; 1994.
2. Miller LJ, Roid GH. The TIME. Toddler and Infant Motor Evaluation: A Standardized Assessment. Tucson, AZ: Therapy Skill Builders; 1994.
3. Campbell SK, Kolobe THA. Concurrent validity of the Test of Infant Motor Performance with the Alberta Infant Motor Scale. Pediatr Phys Ther. 2000; 12: 2–9.
4. Russell DJ, Rosenbaum PM, Cadman DT, et al. The Gross Motor Function Measure: a means to evaluate the effects of physical therapy. Dev Med Child Neurol. 1989; 31: 341–352.
5. Haley SM, Faas RM, Coster WJ, et al. Pediatric Evaluation of Disability Inventory. Boston: New England Medical Center; 1989.
6. Piper MC, Pinell LE, Darrah J, et al. Construction and validation of the Alberta Infant Motor Scale (AIMS). Rev Canad Sante Publ 1992; 31: 341–352.
7. Bayley N. Bayley Scales of Infant Development. 8th ed. San Antonio, TX: Physiological Corporation; 1993.
8. Folio MR, Fewell RR. Peabody Developmental Motor Scales. 2nd ed. Austin, TX; 2000.
9. Darrah J, Piper M, Watt M. Assessment of gross motor skills of at risk infants: predictive validity of the Alberta Infant Motor Scale. Dev Med Child Neurol. 1998; 40: 485–491.
10. Coster W. Critique of the Alberta Infant Motor Scale (AIMS). Phys Occup Ther Pediatr. 1995; 15: 53–63.
11. Piper MC, Darrah J. Response to Dr. Coster’s critique of the Alberta Infant Motor Scale (AIMS). Phys Occup Ther Pediatr. 1995; 15: 65–69.
12. Jeng SF, Tsou Yau KI, et al. Alberta Infant Motor Scale: reliability and validity when used on preterm infants in Taiwan. Phys Ther. 2000; 80: 168–178.
13. Darrah J, Redfern L, Maguire T, et al. Intra-individual stability of rate of gross motor development in full term infants. Early Hum Dev. 1998; 52: 169–179.
14. Fetters L, Tronick EZ. Discriminate power of the Alberta Infant Motor Scale and the Movement Assessment of Infants for prediction of Peabody Gross Motor Scale scores of infants exposed in utero to cocaine. Pediatr Phys Ther. 2000; 12: 16–23.
15. Portney LG, Watkins MP. Foundations of Clinical Research: Applications to Practice. Stamford, CT: Appleton & Lange; 2000.

developmental disabilities/diagnosis; development; infant; motor skills/classification; reproducibility of results; health personnel; training

© 2004 Lippincott Williams & Wilkins, Inc.