Secondary Logo

Journal Logo

Psychometric Properties and Standardization Samples of Four Screening Tests for Infants and Young Children: A Review

Lee, Leanna L. S. BSc (Psychology); Harris, Susan R. PhD, PT, FAPTA

doi: 10.1097/01.PEP.0000163078.03177.AB

Purpose: This article compares traditional psychometric properties (interrater and test-retest reliability, concurrent and predictive validity), clinical epidemiological characteristics (sensitivity, specificity, and positive predictive values), and standardization samples of four tests useful to pediatric therapists in screening infants and young children for developmental delays.

Summary of Key Points: Pediatric therapists are often involved in screening infants and young children for developmental delay. Ideally, they will use standardized tests that have strong psychometric properties (eg, reliability, validity, sensitivity, specificity). The four tests described in this article vary in meeting these criteria. They vary as well in the domains assessed, age ranges for which they are intended, and desired qualifications of the examiners.

Conclusions: Each of the four tests reviewed has identified strengths and weaknesses. Practicing clinicians should select screening tests based on the test’s stated purpose, qualifications of the examiner, age range covered, administration and scoring time, developmental domains encompassed, comparability of the standardization sample, and strength of the test’s psychometric properties.

This review of the psychometric properties of four screening tests: Ages and Stages Questionnaires (ASQ), the AIMS, the Denver-II, and the Harris Infant Neuromotor Test (HINT) serves as a reminder that therapists need to understand psychometric properties, clinical epidemiological characteristics, and characteristics of the standardization samples when selecting a screening test for clinical use.

School of Rehabilitation Sciences, University of British Columbia, Vancouver, BC

Address correspondence to: Dr. Susan R. Harris, School of Rehabilitation Sciences, University of British Columbia, T325-2211 Wesbrook Mall, Vancouver, BC V6T 2B5, Canada. Email:

This review was funded, in part, by a grant from The Hospital for Sick Children Foundation, Toronto, Ontario, Canada.

Back to Top | Article Outline


In a 2001 position statement by the Committee on Children with Disabilities of the American Academy of Pediatrics on developmental screening and surveillance of infants and young children, the following recommendation was made: “All infants and young children should be screened for developmental delays.”1 The Committee emphasized that screening tools must have strong psychometric properties, such as acceptable levels of reliability, validity, sensitivity and specificity, as well as standardization of the test on diverse populations.1

The purpose of this review article is to compare and contrast traditional psychometric properties (interrater and test-retest reliability, concurrent and predictive validity), clinical epidemiological characteristics (sensitivity, specificity, and positive predictive value), and standardization samples of four tests that can be used to screen infants and young children for developmental delays (see Table 1 for definitions). The four tests are the Ages and Stages Questionnaires (ASQ),2 the Alberta Infant Motor Scale (AIMS),3 the Denver-II,4 and the Harris Infant Neuromotor Test (HINT).5 These tests were chosen because they cover age ranges up to at least 12.5 months (for the HINT), with the AIMS extending to 18 months, the ASQ ranging from four to 60 months,2 and the Denver-II from one month to six years of age.4 The goal of this review is to assist practicing clinicians and clinical researchers in choosing appropriate screening tests based on traditional psychometric properties, clinical epidemiological characteristics, and standardization samples. Because practitioners are also interested in the developmental domains covered by a test, the administration time, and qualifications for the assessors, this information is provided in Table 2.





Back to Top | Article Outline


The first edition of the Ages and Stages Questionnaires: A Parent-Completed, Child-Monitoring System6 was published in 1995 by Bricker et al as a revision of the former Infant/Child Monitoring Questionnaires from the 1980s. The ASQ was designed to provide an economical and convenient alternative to early infant/child assessments. Composed of a sequence of questionnaires to be completed by the parent or primary caregiver, the ASQ monitors infant/child development at various age intervals beginning at four months of age. Parents typically require 10 to 20 minutes to complete each questionnaire.

The first edition of the ASQ included 11 questionnaires to be completed at the following ages: four, six, eight, 12, 16, 18, 20, 24, 30, 36, and 48 months.6 Each questionnaire assessed 30 items distributed equally across the domains of communication, fine motor, gross motor, problem solving, and personal-social.6–8 Item responses were based on parental or caregiver report and observation and could vary from “yes” (10 points) to “sometimes” (5 points) to “not yet” (0 points). The total scores obtained on each domain are compared with cutoff scores, which had been empirically determined at each designated age interval. Scoring below the established cutoff scores in one or more of the five domains meant that the infant was considered suspect for developmental delay, resulting in referral for further comprehensive testing.7

Back to Top | Article Outline

Standardization Sample for the ASQ

In 1997, the ASQ authors published a study evaluating the revision, reliability, and validity of the questionnaires.7 Questionnaires completed by parents of 2008 children from Oregon, Hawaii, and Ohio were examined. The sample included both children at risk of developmental delay due to medical or environmental risk factors (81%) and children who were typically developing with no known risk factors (19%). Diverse ethnic and socioeconomic backgrounds were represented, although the authors noted that the Hispanic population was underrepresented and the Native-American population was overrepresented.7

Questionnaires were not always completed consecutively by each and every family. The number of questionnaires completed decreased over time, ranging from 1500 on the four-month ASQ to 535 at the 36-month interval. For many of the participating infants and young children (n = 950), only one questionnaire was completed.6 The entire sample (n = 2008) was not used in the reliability and validity analyses. A subsample of children and their parents was typically used.7

Back to Top | Article Outline

Reliability and Validity of the ASQ

Squires et al7 examined test-retest (n = 175) and interobserver reliability (n = 112) based on parent-completed questionnaires. Data collected from both the original (Infant/Child Monitoring Questionnaires) and the revised questionnaires (ASQ) were pooled to evaluate the reliability of the ASQ. The rationale for including data from the original questionnaire was that the ASQ underwent minimal revisions and that a more representative sample of child performance at each developmental age would result.7

To evaluate test-retest reliability, scores on two identical questionnaires completed by parents at two-week intervals were compared. Based on the percentage of agreement between the classification outcomes (ie, typical or suspect for developmental delay) of the two completed questionnaires, test-retest reliability was reported to be 94% (standard error of the mean [SEM] = 0.10).7 Interobserver reliability was determined by evaluating the percentage of agreement between the classification of children as assessed by parent-completed questionnaires and by professional examiners. Interobserver reliability was found to be 94% (SEM = 0.12).7 Unfortunately, the percent of agreement will often overestimate true reliability because it fails to account for chance agreement. The kappa statistic would have been preferable for calculating the test-retest and interobserver reliability of the ASQ.9

Concurrence between parental classifications on the ASQ and other gold standard tests was investigated by comparing parental classifications on the ASQ with child performance on standardized screening tests. For infants aged four to 30 months, the tests administered were the Revised Gesell and Amatruda Developmental and Neurologic Examination10 and the Bayley Scales of Infant Development.11 The Stanford-Binet Intelligence Test-4th edition12 and the McCarthy Scales of Children’s Abilities13 were used for children older than three years of age. Standardized assessment was administered within 29 days of ASQ completion by parent or caregiver. Concurrent validity was evaluated using a total of 1511 completed questionnaires. A cutoff score of two standard deviations (SDs) below the mean was implemented as receiver operating characteristic (ROC) analyses had demonstrated optimal sensitivity and specificity at this cutoff point.7 In examining the concurrent relationship between classifications based on ASQ performance and classifications based on scores from professionally administered standardized tests, the sensitivity and specificity of the ASQ across all age intervals were determined to be 75% and 85%, respectively, with a positive predictive value of 46%.

To further evaluate the ASQ based on classification categories, questionnaires were completed by parents of 52 children aged four to 36 months with known disabilities. All but two parents (96%) identified their children as “refer for further assessment” on the ASQ (≥2 SDs below the mean).7 However, because the children selected for this study were all enrolled in state-funded early intervention programs, parental knowledge likely affected their response to items on the ASQ.

More recently, a group of Australian researchers14 examined the concurrence of classifications based on the first edition of the ASQ5 with classification categories derived from scores on the Griffiths Mental Developmental Scales,15 the Bayley Scales of Infant Development,11 and the McCarthy Scales13 in a sample of 167 children born prematurely. Aggregate results for all age groups (12 to 48 months corrected age) yielded sensitivity of 90%, specificity of 77%, and a positive predictive value of 46%.

According to Glascoe et al,16 the preferred level for sensitivity is approximately 80%, whereas preferred levels for specificity and positive predictive value are 90% and 70%, respectively. Although the sensitivity for the ASQ was acceptable in this study,13 neither the specificity or positive predictive value reached acceptable levels.

According to Fletcher et al,17 however, interpretation of acceptable levels of sensitivity and specificity should be based on the philosophy of individual clinics or treatment settings. For example, are clinicians in that setting more concerned with falsely identifying a child as delayed or is there greater concern that a child will be “missed” for early identification and thus not receive early intervention? Because positive predictive value is affected by the prevalence of the disability in that particular clinical setting, it can be argued that preset levels of “acceptability” for positive predictive value are not particularly helpful to clinicians who work in settings in which the prevalence of that disability is low (which will result in lower positive predictive values).

In the second edition of the ASQ, published in 1999,2 additional questionnaires were added for the age intervals 10, 14, 16, 22, 33, 42, 54, and 60 months. Neither reliability nor concurrent validity of the second edition has been examined, nor has predictive validity been examined for either the first or second editions.

Back to Top | Article Outline


In response to advances in the neonatal intensive care unit and the consequent increase in survival rate of premature and low birth weight infants, the AIMS was developed to provide a valid and reliable measure of motor development for infants at high risk of motor delay.18 Unlike previous standardized developmental assessments, the AIMS3 focuses not simply on achievement of motor milestones in the developing infant but also on the motor aspects and mechanisms required to attain such milestones (eg, weight-bearing, posture, antigravity movement) between birth and the age of independent walking. Designed as an observational assessment requiring minimal handling of the infant, the AIMS can be administered in 10 to 20 minutes by any health professional with a background in infant motor development. The scale consists of 58 items examining an infant’s gross motor proficiency in four postural positions: prone (21 items), supine (nine items), sitting (12 items), and standing (16 items).18,19 A sum of the scores from all four positions is then converted to a percentile rank and compared to empirically established normative ranks.

Back to Top | Article Outline

Standardization Sample for the AIMS

The AIMS was standardized on a normative sample of 2202 infants recruited exclusively from the province of Alberta, Canada, via a stratified random selection procedure. Although the sample was stratified by age and gender, ethnic and socioeconomic characteristics were not reported, thus making it difficult to compare this sample to infants in the United States20 or those in other provinces in Canada. Because some studies have shown that race21 and maternal socioeconomic status22,23 can have an effect on infant motor development, the failure to include this information is a shortcoming of the AIMS normative sample description.

Back to Top | Article Outline

Reliability and Validity of the AIMS

Study of the test’s reliability and validity was conducted on a sample of 506 typically developing infants from birth to 18 months of age, all born in the city of Edmonton, Alberta.18 Interrater reliability was assessed through simultaneous administration of the AIMS by two different examiners where one examiner actively scored the test while the other observed, scoring the test independently. The scores obtained for 221 infants were included in this correlation analysis and resulted in an interrater reliability coefficient of 0.996 and above.18 Test-retest reliability involved administering the AIMS to 253 infants on two occasions, the second assessment occurring within seven days of the first. The test-retest reliability across all ages on the AIMS ranged from 0.86 to 0.99.3 Unfortunately, the authors do not report the statistic used to calculate the reliabilities.20

In evaluating the AIMS’s concurrent validity, the authors assessed 120 infants from 0 to 13 months of age on the Motor Scale of the Bayley Scales of Infant Development and the Peabody Developmental Motor Scales.3,18 Concurrent validity between the AIMS and two established, standardized pediatric screening tools was measured using the Pearson product-moment correlation.17 Correlation coefficients were r = 0.98 for the Motor Scale of the Bayley Scales of Infant Development11 and r = 0.97 for the Peabody Developmental Motor Scales (PDMS).24 An additional study of concurrent validity was conducted on 68 infants at high risk with atypical motor development using the same tests and yielded similar correlation coefficients (r = 0.85–0.98).3

Using clinical epidemiological analyses, the predictive validity of the AIMS was assessed in a sample of 164 infants at high risk recruited from two neonatal intensive care units in Edmonton, Alberta.19 The predictive validity of the AIMS was compared to the predictive validity of the Movement Assessment of Infants (MAI)25 and the Peabody Developmental Gross Motor Scale (PDGMS).24 A physical therapist administered the AIMS, MAI, and PDGMS to each infant at four and eight months of age and was blind to the infants’ medical history. At 18 months of age, a follow-up assessment was conducted by a developmental pediatrician who then classified the infants as normal, suspect, or abnormal in terms of motor development. The pediatrician’s evaluation was based on criteria such as postural control, muscle tone, reflexes, and achievement of motor milestones.18 At 18 months, 78% of the infants were classified by the pediatrician as normal, 8.5% as suspect, and 13.4% as abnormal in their motor development.

For predictive validity analyses, infants at 18 months receiving a suspect outcome were either grouped with those classified as normal or with those receiving an abnormal classification. Predictive values were determined for all three measures initially administered at four and eight months. When grouping infants suspect for delay with normal infants at the 18-month outcome assessment, the AIMS and MAI showed similar sensitivity at four months but the MAI provided greater specificity.19 When infants were assessed at eight months, however, the AIMS demonstrated greater specificity than the MAI for the 18-month outcomes. The PDGMS did not show an acceptable combination of sensitivity and specificity until its cutoff score was set at the 16th percentile rank at four months. When infants suspect for delay were grouped with abnormal classifications, a grouping proposed by Glascoe et al,16 the sensitivity of the AIMS and MAI decreased for assessments at both four and eight months while the sensitivity of the PDGMS was poor at four months and its specificity was poor at eight months.19

Grouping suspect with abnormal outcomes maximized both the sensitivity and specificity of the AIMS: 77.3% and 81.7% at four months and 86.4% and 93.0% at eight months, respectively. The positive predictive value for the AIMS was 39.5% at four months and 65.5% at eight months.19 The values at eight months for this grouping of suspect with abnormal compare quite favorably to the preferred values recommended by Glascoe et al16 (ie, preferred sensitivity is 80%, preferred specificity is 90%, and preferred positive predictive value is 70%).

Back to Top | Article Outline


The Denver Developmental Screening Test (DDST) was first standardized and published in 1967. Designed to assess multiple aspects of child development, the DDST has become one of the most prevalent screening tools used for children aged one month to six years, with more than 50 million children screened worldwide for potential developmental delays.2,16,26

Despite its common use, concerns have been raised about the test’s sensitivity and specificity.16,27 In the early 1990s, the DDST underwent a significant revision and restandardization and has since been known as the Denver-II.4 Similar to the original DDST, the Denver-II involves an aggregate of items assessed by observation, parental report, or direct elicitation.28 With an administration time of 10 to 20 minutes, the Denver-II continues to evaluate four domains of function: personal-social, fine motor adaptive, language, and gross motor. Performance on age-appropriate tasks within these domains is scored to determine how a child is classified, ie, developmentally delayed, suspect, or within normal range. Other differences in the Denver-II include modifications of previously difficult to administer items and a substantial increase in the number of language items.

Back to Top | Article Outline

Standardization Sample for the Denver-II

To standardize the Denver-II, a quota sample of 2096 children from the state of Colorado, aged from birth to six years was used. To approximate the distribution of the population of Colorado, this sample was stratified by various demographic characteristics such as gender, maternal education, ethnicity, and socioeconomic status.4,26,29

Back to Top | Article Outline

Reliability and Validity of the Denver-II

To demonstrate the proficiency of the revised tool, Frankenburg et al4 evaluated the item reliability of the Denver-II. Subjects were 38 children, 34 of whom were tested on separate occasions, seven to 10 days apart. The Denver-II demonstrated excellent examiner-observer or interrater reliability (κ ≥ 0.75).4 Test-retest reliability at seven to 10 days was found to show excellent agreement for 59% of the Denver-II items (κ ≥ 0.75), whereas 23% of items demonstrated fair to good agreement (κ ≥ 0.40). Compared to its predecessor, the Denver-II demonstrates a considerably higher degree of reliability.

Although the revision and restandardization of the DDST attempted to address previous inadequacies, the Denver-II was published and distributed prior to studying its concurrent validity or predictive accuracy. Later in 1992, a study evaluating the accuracy of the Denver-II in identifying children with atypical development was published.16 To examine the test’s concurrent accuracy, Glascoe et al assessed the correspondence of categorical classifications on the Denver-II with categorical classifications on other reference screening tests in order to determine the sensitivity, specificity, and positive predictive value of the Denver-II.

One-hundred four children, ranging in age from three to 72 months, were recruited from five different daycare centers. All children were assessed initially using the Denver-II. Within seven days of that assessment, each child was assessed on a battery of standardized assessments, including the Vineland Adaptive Behavior Scale,30 the Kaufman Assessment Battery for Children Achievement Subtests,31 the Fluharty Preschool Speech and Language Screening Test,32 and one of the following cognitive tests: the Bayley Scales of Infant Development,11 the Stanford-Binet Intelligence Scale, 4th edition,12 or the Kaufman Assessment Battery for Children.31 By applying the results from this battery of tests to federal and state criteria for special education eligibility, the presence or absence of developmental delay was determined and these classifications were compared to those derived from performance on the Denver-II.

Thirty-eight percent of the children in the study scored within the normal range on the Denver-II, 26% received abnormal scores, 33% received questionable or suspect scores, and 3% were not testable. According to the scores demonstrated by the battery of criterion assessments, only 17% of children showed evidence of developmental problems. When questionable scores were grouped with normal scores, the Denver-II yielded sensitivity of 56%, specificity of 80%, and a positive predictive value of 37%. When grouping abnormal scores with suspect scores, the sensitivity of the Denver-II was 83%, specificity was 43%, and positive predictive value was 23%. Based on these findings, the American Academy of Pediatrics noted that “the Denver-II screening test is used widely but has modest sensitivity and specificity depending on the interpretation of questionable results.”1

Glascoe et al16 suggested that interpreting questionable scores as abnormal was more plausible than the alternative because questionable scores tend to lead to referrals for more comprehensive testing. Despite attaining the desired level of sensitivity by using this recommended grouping, more than 50% of typically developing children would then be classified as suspect for developmental delay. As a result, developmental screening with the Denver-II could lead to further diagnostic examination of about 60% of the children tested. Such low specificity of a screening test produces concern for an unnecessarily high referral rate, leading to increased expense and excessive parental distress.16

A study published the following year by Glascoe and Byrne29 examined the relative accuracy of the Denver-II, the Developmental Profile II,33 and the Battelle Developmental Inventory Screening Test34 when compared to scores on a criterion battery of standardized assessments. As in the earlier study,16 specificity for the Denver-II was unacceptably low (46%) when grouping questionable with abnormal scores, although sensitivity was 83%. Using this grouping, positive predictive value was only 28%, however.

Again, as Fletcher et al17 have noted, interpretation of acceptable levels of sensitivity and specificity should be based on the philosophy of individual clinics or treatment settings and positive predictive value will be affected by the prevalence of the disability in the actual clinical setting in which test will be administered.

When grouping questionable scores with normal scores, specificity increased to 80% but sensitivity fell to 56%. The positive predictive value using this grouping was 42%. Glascoe and Byrne29 concluded that the Denver-II “produced more incorrect than correct classifications.”

Although revision and restandardization represented an improvement over the original DDST, the failure to address the revised test’s accuracy based on standard criteria for the development of new tests is a major shortcoming. Because early identification is essential for initiating family support and increasing the opportunity for positive outcomes in children with developmental delays, it is crucial that screening instruments demonstrate high degrees of both sensitivity and specificity. More than a decade ago, it was proposed that the authors of the Denver-II extensively study the test’s validity so that changes could be implemented to improve test accuracy.16 Unfortunately, those studies have not yet been undertaken.

Back to Top | Article Outline


A screening test developed to identify neuromotor, cognitive, or behavioral concerns in infants between the ages of 2.5 and 12.5 months, the HINT5 is composed of three parts: a section for recording background information on the child and caregiver, a section including five questions to assess the caregiver’s level of concern about the infant’s movement and play, and a final section including 21 items, mostly observational, to assess the infant’s movement against gravity, muscle tone, behavior and cooperation, stereotypical behaviors, and head circumference.5 Designed for administration and scoring in less than 30 minutes, the HINT can be administered by occupational therapists, physical therapists, community health nurses, general practitioners, and early childhood special education professionals.

Back to Top | Article Outline

Standardization Sample for the HINT

Recent normative data for the HINT were collected for 412 Canadian infants from the provinces of British Columbia, Manitoba, Nova Scotia, Ontario, and Quebec.35 The normative sample was stratified by gender, maternal education, and ethnicity. Data were collected on at least 40 infants (20 girls and 20 boys) in each of the 10 monthly age groupings. Seventy-three percent of the infants were white, 16% were Asian, 5% were Aboriginal (Native Canadian), 4% were African/Caribbean, and 3% were Arabic. Maternal education levels included 36.2% with high school or less, 24% with a two-year certificate or diploma, and 39.8% with a bachelor’s degree or higher.

Back to Top | Article Outline

Reliability and Validity of the HINT

Interrater reliability of the HINT was examined by five pediatric physical or occupational therapists for 28 infants at high risk.36 One therapist served as primary examiner and administered the test while a second therapist observed and scored the test independently. For the total HINT score, the interreliability ICC was 0.99. To evaluate test-retest reliability, the primary examiner readministered the HINT within nine days of the initial screening for 20 of the infants. The test-retest reliability ICC = 0.98. Intrarater reliability was examined by videotaping the assessments of 20 infants, scoring the videotapes, and then rescoring them at least one month later. For the five therapists, the intrarater reliability intraclass correlation coefficients (ICCs) ranged from 0.98 to 0.99. According to Anastasi,37 an ICC of ≥0.90 is considered “desirable,” thus suggesting that all the HINT reliability coefficients were in the desirable range.

Concurrent validity was assessed by comparing HINT total scores to raw scores on the Mental and Motor Scales of the Bayley-II38 for 54 infants at high risk. Both tests were administered at the same assessment session. The Pearson r was used to calculate the relationship between the HINT and the two Bayley-II scales. For the HINT and the Bayley Mental Scale, the concurrent validity was r = −0.73, whereas for the HINT and the Bayley Motor Scale, the relationship was r = −0.89.36

The predictive validity of the HINT, administered between 2.5 and 12.5 months, was assessed by comparing those scores to scores on the Bayley Mental and Motor Scales at 17 to 22 months. The predictive correlation (Pearson r) between the HINT and the later Bayley Motor Scale was −0.49 (p < 0.01), which is considered a modest relationship. However, the predictive validity of the HINT for the Bayley Mental Scale was poor (r = −0.11).36 Interestingly, the predictive correlation for the HINT to the later Bayley Motor Scale was stronger than the relationship between the early Bayley Motor Scale and the later Bayley Motor Scale (−0.49 vs −0.34).

Sensitivity and specificity analyses of the HINT’s predictive accuracy with the Bayley-II at 17–22 months are currently in progress and involve a sample of 119 infants, approximately half of whom were at high risk and half of whom were at low risk for developmental delays.

Back to Top | Article Outline


As can be seen from the foregoing review of four screening tests designed for infants or young children, no one test satisfies the criteria of having a representative standardization (or normative) sample as well as acceptable levels of reliability, validity, sensitivity, specificity, and positive predictive values (see Table 2). Of the four tests reviewed, the AIMS is certainly the strongest with regard to reliability, concurrent validity, and predictive validity. However, it is limited in comparison to the other tests in that it assesses gross motor development only. Although the AIMS’ standardization sample is impressively large (n = 2202) and was randomly selected and stratified by age and gender, concerns exist about the failure to report ethnic and socioeconomic variables. Thus, the “representativeness” of the sample in comparison to other Canadian provinces as well as the United States has been questioned.20

Although widely used around the world, the Denver-II has questionable accuracy in concurrently identifying typical children when compared to standardized developmental assessments. In spite of having been published more than a decade ago, the predictive accuracy of the Denver-II has yet to be evaluated. Furthermore, the normative sample for the Denver-II was limited to children from the state of Colorado. One strength of the Denver-II, especially in comparison to the AIMS and the HINT, is that it encompasses four different developmental domains.

Similar to the Denver-II, no predictive validity studies have been conducted for the ASQ, although it is certainly the most family centered of the four tests reviewed here. As the intended goal of the ASQ is to accurately identify infants for the purpose of early intervention, its ability to predict later performance is perhaps not an issue.

The HINT has impressively strong reliability as well as acceptable concurrent validity with the Bayley-II. The predictive relationship of the HINT to the Bayley Motor Scale was modest, albeit stronger than the early Bayley Motor Scale’s relationship to the later Bayley Motor Scale. Compared to the other tests, the HINT normative sample is the smallest but its age range is also the narrowest. The normative sample is, however, diverse and representative of a variety of ethnic groups and maternal education levels.

See Table 3 for a summary of the psychometric properties and/or clinical epidemiological characteristics and standardization samples for the four tests.



Back to Top | Article Outline


In light of the fact that each of the screening tests reviewed has identified strengths and weaknesses, practicing clinicians should use the following criteria in determining which test to use within their own setting:

  1. The purpose of the screening test, as stated in the test manual
  2. The desired qualifications of the examiner or assessor
  3. The age range that the test covers
  4. The time needed to administer and score the test
  5. The developmental domains encompassed by the screening test
  6. The comparability between the standardization sample used to determine normal values of the test (eg, ethnicity, gender, demographic characteristics) and those of the infants or children being screened
  7. The traditional psychometric properties of the test (eg, reliability, validity) and/or the clinical epidemiological characteristics (eg, sensitivity, specificity)
Back to Top | Article Outline


Screening tests are designed for individuals who are healthy to ascertain whether they are at increased risk of a disease or disorder.39 Typically, pediatric physical therapists work with infants and young children for whom a disability has already been identified. As we move into the realm of health promotion, however, the importance of screening healthy infants for possible motor disorders is increasing. In fact, the Committee on Children with Disabilities of the American Academy of Pediatrics has underscored the importance of screening all healthy infants and children for possible developmental delay.1 It is hoped that this overview of the standardization samples and psychometric properties of four pediatric screening tests will assist physical therapists in choosing the screening tests most appropriate for use with the children whom they serve.

Back to Top | Article Outline


1. American Academy of Pediatrics. Committee on Children with Disabilities. Developmental surveillance and screening of infants and young children. Pediatrics. 2001;108:192–196.
2. Bricker D, Squires J. Ages and Stages Questionnaires: A Parent-Completed, Child-Monitoring System, 2nd ed. Baltimore: Paul Brookes; 1999.
3. Piper M, Darrah J. Motor Assessment of the Developing Infant. Philadelphia: WB Saunders; 1994.
4. Frankenburg WK, Dodds J, Archer P, et al. The Denver-II: a major revision and restandardization of the Denver Developmental Screening Test. Pediatrics. 1992;89:91–97.
5. Harris SR. Development of an infant neuromotor assessment tool. Mary E. Switzer Research Fellowship Final Report. Washington, DC: National Institute on Disability and Rehabilitation Research; 1991.
6. Bricker D, Squires J, Mounts L. Ages and Stages Questionnaires: A Parent-Completed, Child-Monitoring System. Baltimore: Paul Brookes; 1995.
7. Squires J, Bricker D, Potter L. Revision of a parent-completed developmental screening tool: Ages and Stages Questionnaires. J Pediatr Psychol. 1997;22:313–328.
8. Squires J, Potter L, Bricker D, et al. Parent-completed developmental questionnaires: effectiveness with low and middle income parents. Early Child Res Q. 1998;13:345–354.
9. Portney LG, Watkins MP. Statistical measures of reliability. In: Portney LG, Watkins MP, eds. Foundations of Clinical Research: Applications to Practice, 2nd ed. Upper Saddle River, NJ: Prentice Hall Health; 2000:557–586.
10. Knobloch H, Stevens F, Malone A. Manual of Developmental Diagnosis: The Administration and Interpretation of the Revised Gesell and Amatruda Developmental and Neurological Examination. Hagerstown, MD: Harper & Row; 1980.
11. Bayley N. The Bayley Scales of Infant Development. New York: Psychological Corporation; 1969.
12. Thorndike R, Hagen E, Sattler J. Stanford-Binet Intelligence Scale, 4th ed. Chicago: Riverside; 1985.
13. McCarthy D. McCarthy Scales of Children’s Abilities. New York: Psychological Corporation; 1972.
14. Skellern CY, Rogers Y, O’Callaghan MJ. A parent-completed developmental questionnaire: follow up of ex-premature infants. J Paediatr Child Health. 2001;37:125–129.
15. Griffiths R. Abilities of Young Children: A Comprehensive System of Mental Measurement for the First Eight Years of Life. London: Child Development Research Centre; 1970.
16. Glascoe FP, Byrne KE, Ashford LG, et al. Accuracy of the Denver-II in developmental screening. Pediatrics. 1992;89:1221–1225.
17. Fletcher RH, Fletcher SW, Wagner EH. Clinical Epidemiology. The Essentials, 3rd ed. Baltimore: Williams & Wilkins; 1996:57–64.
18. Piper M, Pinnel L, Darrah J, et al. Construction and validation of the Alberta Infant Motor Scale. Can J Public Health. 83(Suppl 2):S46–S50, 1992.
19. Darrah J, Piper M, Watt M. Assessment of gross motor skills of at-risk infants: predictive validity of the Alberta Infant Motor Scale. Dev Med Child Neurol. 1998;40:485–491.
20. Coster W. Critique of the Alberta Infant Motor Scale. Phys Occup Ther Pediatr. 1995;15:53–64.
21. Capute AJ, Shapiro BK, Palmer FB, et al. Normal gross motor development: the influences of race, sex, and socio-economic status. Dev Med Child Neurol. 1985;27:635–643.
22. Poresky RH, Henderson ML. Infants’ mental and motor development: effects of home environment, maternal attitudes, marital adjustment, and socioeconomic status. Percept Mot Skills. 1982;54:695–702.
23. Badr Zahr LK. Quantitative and qualitative predictors of development for low-birth weight infants of Latino background. Appl Nurs Res. 2001;14:125–135.
24. Folio MR, Fewell RR. Peabody Developmental Motor Scales and Activity Cards: Manual. Allen, TX: DLM Teaching Resources; 1983.
25. Chandler LS, Andrews MS, Swanson MW. Movement Assessment of Infants: A Manual. Rolling Bay, WA: Authors; 1980.
26. Wade GH. Update on the Denver-II. Pediatr Nurs. 1992;18:140–141.
27. Borowitz KC, Glascoe FP. Sensitivity of the Denver Developmental Screening Test in speech and language screening. Pediatrics. 1986;78:1075–1078.
28. Glascoe FP. Are overreferrals on developmental screening tests really a problem? Arch Pediatr Adolesc Med. 2001;155:54–59.
29. Glascoe FP, Byrne KE. The accuracy of three developmental screening tests. J Early Intervent. 1993;17:368–379.
30. Sparrow SS, Balla DA, Cichetti DV. Vineland Adaptive Behavior Scales. Circle Pines, MN: American Guidance Service; 1984.
31. Kaufman AS, Kaufman NL. The Kaufman Assessment Battery for Children: Interpretive Manual. Circle Pines, MN: American Guidance Service; 1983.
32. Fluharty NB. Fluharty Preschool Speech and Language Screening Test. Allen, TX: DLM Teaching Resources; 1984.
33. Alpern G, Boll T, Shearer M. The Developmental Profile-II. Los Angeles: Western Psychological Services; 1986.
34. Newborg J, Stock JR, Wnek L, Guidabaldi J, Svinicki J. Battelle Developmental Inventory Screening Test. Allen, TX: DLM Teaching Resources; 1984.
35. Harris SR, Megens AM, Backman CL, et al. Development and standardization of the Harris Infant Neuromotor Test. Infants Young Child. 2003;16:143–151.
36. Harris SR, Daniels LE. Reliability and validity of the Harris Infant Neuromotor Test. J Pediatr. 2001;139:249–253.
37. Anastasi A. Psychological Testing. New York: Macmillan; 1988.
38. Bayley N. Bayley Scales of Infant Development, 2nd ed. San Antonio, TX: Psychological Corporation; 1993.
39. Grimes DA, Schulz KF. Uses and abuses of screening tests. Lancet. 2002;359:881–884.

review; review/tutorial; developmental disabilities/diagnosis; diagnostic tests; infant; child; child/preschool/therapy; psychological tests/methods; predictive value of tests; sensitivity and specificity

© 2005 Lippincott Williams & Wilkins, Inc.