The Hooper Visual Organization Test (HVOT) (Hooper, 1958; Western Psychological Services, 1983) is a screening measure designed to tap an individual’s ability to integrate fragmented visual stimuli. It includes 30 drawings of common objects and animals that have been cut into two or more pieces. The task is to identify and name each item after reorganizing these pieces mentally into a coherent whole. Successful performance relies primarily on visual analytic and synthetic abilities, and on the capacity to label objects either verbally or in writing. The HVOT was originally developed to differentiate patients with brain dysfunction from other groups and has since been used in a variety of mental health, medical, and geriatric settings.
The HVOT has shown acceptable reliability and validity in both clinical and nonclinical populations (Western Psychological Services, 1983; Merten and Beal, 1999; Greve et al., 2000; Paul et al., 2001; Lopez et al., 2003). Normative data for the USA and Greece were obtained, and the influence of age, sex, and educational level on the HVOT performance was evaluated. The effects of age were found, with younger individuals performing better on the test. Similarly, educational level contributed significantly toward test performance; those with higher education showed better performance. No significant effects of sex have been reported in the literature (Western Psychological Services, 1983; Giannakou and Kosmidis, 2006).
In response to probable cultural bias (e.g. variations in linguistic label and unfamiliarity with certain objects) in connection with the HVOT performance, different language versions have been published in recent years. For example, Merten (2002) has developed a German version of the 15-item short form using classical item analysis techniques in a sample of neurological patients aged 15–87 years. Similarly, Giannakou and Kosmidis (2006) have provided normative data for the original English version of the HVOT from a sample of 206 healthy Greek adults aged 18–79 years. However, the suitability of the use of the HVOT in Chinese-speaking adults has not yet been examined.
Given that the HVOT appears to be a reliable and valid screening measure applicable to a wide range of patients, and no Chinese-language version has been adapted and validated culturally for a growing Chinese-speaking population, it is important to develop a psychometrically sound Chinese version for clinical and research purposes. Rasch analysis was used in the development of the Chinese version as it allows us to obtain estimates of item difficulty and an individual’s ability along the same logit scale, which is interval-based, and enables us to calculate parametric statistics (Weiss and Yoes, 1991). If items satisfy Rasch model expectations and the unidimensionality assumptions, it is justifiable to add up each of the item scores for a total score as an overall measure of the ability to integrate visual stimuli.
Thus, this study consisted of two phases. The first phase was designed with the following aims: (a) evaluate the psychometric properties of the HVOT in healthy Chinese-speaking adults, (b) produce estimates on a linear interval scale for the set of items that fit model expectations and (c) validate the Rasch-derived Chinese version in patients with stroke and schizophrenia. The second phase was designed to enhance the scale’s clinical utility through the establishment of normative data using item difficulty and individual ability parameters and determination of the cut-off value that differentiates individuals with and without visuosynthetic deficit.
The expert committee included two occupational therapists with expertise in the assessment of cognitive functioning and one psychometrician. Translation and cross-cultural adaptation were performed according to the guidelines proposed by Beaton et al. (2000), and involved translating the administration instructions and test items (i.e. object names) into Chinese (two independent bilingual translators), synthesizing the two translations (performed by the principal investigator), back-translating it into English (performed by one English mother tongue bilingual translator) and a review of the final version by the expert committee.
Several adjustments were made to the scoring rules of the final version to screen out the confounding effects of cross-linguistic diversity in naming. In particular, full credit was given when the participant was able to correctly describe the variation in appearance between ‘teakettle’ (item 16) and ‘tea pot’ (item 19) because Chinese speakers use the same label for these two objects. For item 17, ‘sofa’ was given full credit as Chinese label a large, stuffed seat for one individual as ‘sofa’ (what English speakers would call chair). Besides these, for item 14 ‘cane’, ‘candy cane’ was given full credit as these two objects are perceptually similar. For item 25 ‘block’, ‘cube’ was given full credit because of the general unfamiliarity with the English alphabet block in older Chinese individuals.
The test procedures were also revised such that participants were encouraged to be more specific in their answers if they only gave the correct category name (e.g. ‘animal’ instead of ‘dog’). This deviation allows for maximum performance as most participants were not aware of the level of specificity required of their answer.
The stratified sample included 1008 healthy adults (502 men, 506 women) between 15 and 79 years of age living in southern Taiwan who were selected to match 2010 Taiwan Census data (Department of Statistics, Ministry of the Interior) with respect to the proportions of individuals in specific demographic categories in this age range. Specifically, the stratification procedure involved proportional representation on the basis of sex (49.9% men, 50.1% women) and education (17.5% with 0–6 years, 14.4% with 7–9 years, 33.6% with 10–12, and 34.5% with 13–16 years). The sampling plan also ensured equal numbers of participants (n=144) in each of the seven age groups (15–19, 20–29, 30–39, 40–49, 50–59, 60–69, and 70–79 years). Participants were recruited through community organizations, local senior centers, places of employment, schools, and by word of mouth. To rule out the possibility of dementia, those aged 60 years and older with a score of 23 or less on the Mini-Mental State Examination (MMSE) (Folstein et al., 1975) were excluded. Participants were also excluded for a history of stroke, traumatic brain injury, Parkinson’s disease, dementia, chronic drug and/or alcohol abuse, or psychiatric illness.
Overall, the mean age and education of the entire sample was 44.83 (SD=19.55) and 11.42 (SD=3.63) years, respectively. Adopting an α level of 0.05, χ2 goodness-of-fit tests showed that the normative sample did not differ significantly from the 2010 Taiwan population’s distribution on sex, χ2(1)=2.0, P=0.16, and education, χ2 (9)=12.0, P=0.21. For further analysis, the age groups were collapsed into four categories namely, 15–29, 30–49, 50–69, and 70–79 years, on the basis of the examination of the distribution of HVOT total scores by age group. In the same way, education was divided into two broad levels of schooling: 0–9 years (n=314) and 10–16 years (n=694).
Two random subsamples (n=60 each) were drawn separately from the total sample for the known-groups validity analysis, by matching the age and education with those in stroke and schizophrenia samples, respectively. Subsample 1 had a mean age of 52.35 years (SD=11.47) and a mean level of education of 9.93 years (SD=4.04). The mean age of subsample 2 was 35.93 years (SD=9.42); the mean educational level was 12.30 years (SD=1.76).
The stroke sample included 60 patients (42 men, 18 women) recruited consecutively from the Department of Rehabilitation Medicine at a regional teaching hospital. All patients fulfilled the following criteria: (a) a diagnosis of stroke, as determined clinically by a neurologist or a neurosurgeon; (b) a single cerebral infarction or hemorrhage restricted only to one hemisphere; (c) first-ever stroke; (d) no known history of other neurological or psychiatric disorders; and (e) at least 3 months after onset at the time of inclusion in the study. Individuals with severe aphasia and unilateral neglect were excluded. In all cases, the location of the lesion was established on the basis of computed tomography or MRI scans. The mean age of the patients was 52.32 years (SD=11.43), with a mean duration of illness of 760.52 days (SD=973.86). The same group had a mean of 9.78 years of education (SD=4.13). Thirty-five patients had right-sided lesions and 25 had left-sided lesions. Cerebral infarction was present in 15 patients whereas cerebral hemorrhage was present in 45 patients. Of these patients, 30 were randomly selected for the test–retest reliability study, and were retested within 1 week.
The schizophrenia group included 60 participants (29 men, 31 women) who fulfilled the Diagnostic and Statistical Manual of Mental Disorders, 4th revision (DSM-IV) (American Psychiatric Association, 1994) criteria for schizophrenia or schizoaffective disorder confirmed by the treating psychiatrists using the Structured Clinical Interview for DSM-IV (First et al., 1996). Patients were recruited from an outpatient clinic of the Department of Psychiatry at the Kaohsiung Medical University Hospital. Exclusion criteria included evidence of current substance abuse, mental retardation, or a history of neurological illness and significant changes in their medication treatment during the month preceding the baseline evaluation.
The patient sample was characterized by a mean educational level of 12.30 years (SD=1.76), a mean age of 35.93 years (SD=9.42), and a mean duration of illness of 14.0 years (SD=8.02). All patients were taking antipsychotic medications. Of these, a randomly selected subsample of 30 patients agreed to participate in a test–retest study 3 weeks after their initial assessment.
This study was approved by the Institutional Review Board of the Kaohsiung Medical University Hospital. All participants provided their informed consent before the commencement of the study. Normative data were collected by six trained research assistants, one occupational therapist, and five psychology students who had attended courses and supervised experience in testing. Before data collection, all examiners participated in a 2-h session, during which they practiced administering and scoring the MMSE and the HVOT under the supervision of the principal investigator (C.-Y.S.). Examiners were provided a one-page handout with written instructions specifying how the respondent’s verbal response to each HVOT item was to be recorded and scored. Inter-rater reliability was determined by having the experienced occupational therapist individually administer the HVOT to two healthy volunteers while the other examiners observe and independently score the individual’s performance at the same time. Paired t-tests did not yield a systematic difference between examiners’ scores on the HVOT. The intraclass correlation coefficient (ICC) model (2,1) (two-way random effect model with absolute agreement) was calculated to assess inter-rater reliability (Tesio, 2012). The ICC value for the total scale was 0.99. Test administration was conducted in participants’ homes, schools, a senior citizen center, and local community centers.
Patient data were collected by another two OTs, specializing in neurological and psychiatric rehabilitation, respectively. The therapists were trained and assessed by the principal investigator (C.-Y.S.) to ensure standardization of administration. Inter-rater reliability was determined by having the principal investigator individually administer the HVOT to two patients with stroke or schizophrenia while the two therapists observe and independently score the patient’s performance at the same time. Almost perfect inter-rater reliability was achieved [ICC(2,1)=0.99]. All patients were tested individually at their respective clinics in a quiet room that was free from distractions. To determine the test–retest reliability of the Chinese version, a smaller subset of patients within each group was retested on the HVOT. Two different retest intervals (i.e. 1 and 3 weeks) were used in different patient groups. The 3-week interval used for the outpatients with schizophrenia was hard to implement with the stroke patients, of whom 13% had an onset for less than 6 months, as previous research has found that the potential for perceptual recovery remains 6 months after stroke (Oxbury et al., 1974). A 3-week interval was chosen for the psychiatrically stable schizophrenia patients on the basis that significant changes were not expected to occur in the elapsed time. To decrease possible experimenter bias, all retest results were scored by the principal investigator who was blinded to the patient’s scores from the first assessment.
Rasch analysis was carried out using the WINSTEPS version 3.68 (Linacre, 2009) software package. The partial credit model was used because (a) there were no reasons to assume an invariant pattern of category difficulty across items and (b) the numbers at hand allowed stable estimates of the many item-specific difficulty parameters of categories. Before the analysis, all items were rescored with 0=0; 0.5=1; and 1=2 as required by the Rasch model.
Three methods were used to assess the unidimensionality of the Chinese version of the HVOT. First, the match between observed data and a theoretical model was evaluated using fit statistics. The infit and outfit are two commonly used mean square (MNSQ) fit statistics. The outfit MNSQ is not weighted and is sensitive to outliers for either person or item parameters, whereas the infit MNSQ is sensitive to residuals close to the estimated person abilities (Bond and Fox, 2001). If an item fits the Rasch model, the ratio of observed variance to expected variance will be 1.0. Mean-square values considerably less than 1.0 indicate overfit (i.e. less variation than predicted), whereas values in excess of 1.0 indicate underfit (i.e. more variation than expected between the model and the observed scores). The infit coefficient is generally preferred for assessing the item fit with the model. This index emphasizes a series of unexpected responses when difficulty and ability are close to each other and ‘unexpectedness’ may be concealed by the high variance tolerated by the model itself. This index is less susceptible to single influential outliers, which may yield sample specific, nonreplicable, misfitting responses. Consequently, we used the infit statistic to assess the fit of items to the model. Infit MNSQ values within the range of 0.7–1.3 are indicative of acceptable model fit for the sample size at hand (Wright and Linacre, 1994). Items with fit statistics greater than 1.3 are considered to underfit the model, implying that item responses may not reflect the underling construct being measured. Items with fit statistics below 0.70 are considered to overfit the model, implying that these items are redundant and contribute little extra information beyond that already provided by other items. Items that either underfit or overfit were potential candidates for item removal. Successive Rasch analyses were carried out until a final set of items satisfied the model fit requirements.
Second, differential item functioning (DIF) analysis was carried out, which is an additional aspect of fit to the Rasch model that has been conceptualized as multidimensionality and can bias scores (Andrich, 1988). DIF analysis was carried out to determine whether item responses were independent of interactions between item and person characteristics. That is, given the same level of ability, the expected score on any item is stable within confidence limits, irrespective of demography (e.g. sex). If the items failed to work in the same way irrespective of which demographic subgroup is being assessed, then this item would be considered to show DIF. In the current study, DIF analyses were carried out for age and education, given the relationship found between these parameters and HVOT performance (Western Psychological Services, 1983; Giannakou and Kosmidis, 2006). The person–item interactions that resulted in a substantive difference of more than 1 logits and a statistically significant difference (P<0.05) were identified as having notable evidence of DIF taken into consideration the power and type 1 error rate issues raised by Smith (2004). Significant DIF items were considered for deletion as they violate the unidimensionality assumption.
Our approach to item reduction was not based solely on statistical grounds, but item content was also examined for the misfitting items before removal from the scale. In this way, we attempted to create a short scale that is appropriate for the Chinese-speaking populations.
After item removal, the dimensionality of the final version of the HVOT was further assessed by principal components analysis of the residuals. The criteria set for scale unidimensionality were that at least 50% of the variance should be explained by the first latent variable and that any additional factor should not explain more than 5% of the remaining variation of the residuals associated with an eigenvalue equal to or less than 1.4 after the removal of the first latent variable (Linacre, 2009).
The reliability of the final version was evaluated by examination of the person separation index (PSI), which is analogous to the Cronbach α (Fisher, 1997). A PSI value of 2.0 with an associated person reliability of 0.80 indicates that the scale can discriminate three strata of person ability, which is the minimum level for a satisfactory scale (Prieto et al., 2003).
The targeting of the items and persons was assessed by comparing the mean person location with the mean item location (constrained to be zero). An average person measure of zero would represent optimal targeting of the items in terms of their difficulty for the sample. On the basis of the results of simulation studies, scale precision is compromised when the person mean is more than 1.0 logits from the item mean (Curtis and Boman, 2007).
Validation of the Chinese version in schizophrenia and stroke
Internal consistency was measured using the Cronbach α. A scale has been defined as having excellent internal consistency if α>0.9, good if α>0.8, and acceptable if α>0.7 (George and Mallery, 2003). The test–retest reliability was calculated using the ICC with a two-way random effects model that allows for the results to be generalized to testing conditions beyond the one in this study. In accordance with Cicchetti (1994), ICCs are considered excellent if greater than 0.74, good from 0.60 to 0.74, fair from 0.40 to 0.59, and poor if below 0.40. To evaluate the known-groups validity, independent samples t-tests were carried out to compare each patient group with age-matched and education-matched controls on the Rasch-scaled Chinese version.
Determination of norm values for the Chinese version of the Hooper Visual Organization Test
An exploratory one-way analysis of variance (ANOVA) found no significant sex differences in scores on the Chinese version of the HVOT [F(1,1007)=0.008, P=0.929]. Therefore, a two-way analysis of variance was carried out to determine whether scores on the Chinese version varied as a function of age and level of education. After the establishment of age-dependent education-related differences, the normative procedure for the Chinese HVOT scores involved the fitting of multiple regression models adjusted for these two variables. To evaluate the stability of the regression model, cross-validation was carried out by dividing the total sample into two subsamples (60 and 40%) by random selection procedures. The prediction equation was created in the first sample. That equation was then used to create predicted scores for the individuals of the second sample. The predicted scores were then correlated with the observed scores on the dependent variable (ryy′), the so-called cross-validity coefficient. The difference between the original R2 and ryy′2 is the shrinkage. A shrinking of variance explanation lower than 10% was adopted as a criterion for sufficient stability (Stineman et al., 1994; Osborne, 2000).
To determine its optimal cut-off score, two receiver operating characteristic (ROC) analyses were carried out using HVOT total raw scores and Rasch-derived person ability scores. The two patient groups and subsamples 1 and 2 were combined to form patient and healthy control groups, respectively. In constructing a ROC curve, sensitivity and specificity are determined for each possible cut-off point. In addition, the area under the curve (AUC) was calculated for the ROC curves. AUC represents the overall discriminatory ability of a test, where a value of 1.0 denotes perfect ability and a value of less than 0.5 indicates that a test performs worse than chance. The AUC should be at least 0.7 to be acceptable; AUCs between 0.7 and 0.9 and greater than 0.9 are considered moderate and high, respectively (Fischer et al., 2003). The optimal cut points were obtained from the Youden index [maximum (sensitivity+specificity−1)] and the point on the ROC curve closest to (0, 1) that was calculated as the minimum value of the square root of [(1−sensitivity)2+(1−specificity)2]. Greater accuracy is reflected by a larger Youden index and a smaller distance to (0, 1) (Youden, 1950; Perkins and Schisterman, 2006).
As shown in Table 1, the initial analysis of all 30 items yielded a PSI of 2.25, with a person reliability of 0.84. Item measures ranged from 2.01 ‘shoe’ to −2.22 ‘dog’ logits. Seven items did not fit with model expectations. Four of them (i.e. ‘fish’, ‘table’, ‘hammer’, and ‘basket’) had infit statistics that exceeded 1.3, and are considered to underfit the model. An analysis of these items indicated that visuosynthetic ability might not be involved in the processing of these items because they contained one informative piece that is relevant to the correct solution (Merten et al., 2007). Three items (‘rabbit’, ’truck’, and ‘cat’) had infit statistics less than 0.7. This implies that the information obtained by these items is redundant and was already provided by other items. Inspection of the items showed that item 22, ‘mouse’, measures ability similar to item 20, ‘cat’, and item 24, ‘rabbit’, that is, the ability to discern animals with whiskers. Similarly, there was another vehicle-related item, ‘sailboat’, other than item 8, ‘truck’. Consequently, these seven items were deleted and the Rasch analysis was rerun. Two additional items ‘dog’ and ‘teakettle’ failed to fit the model (infit statistics <0.7). These items were excluded because of redundancy. Once these items were excluded from the scale, the remaining 21 items all showed good fit to the model, with person reliability being 0.82 (person separation=2.15) (Table 1). The scale reliability was not greatly challenged despite the item deletion, thus confirming that these items were not adding substantial information.
The 21 items were also examined for DIF across subgroups within the healthy sample (age, sex, and education). None of the items showed significant DIF by sex and education. Three items showed significant age DIF. The 30–49-year-old age group was more likely to respond incorrectly to the item ‘airplane’ than those aged 20–29 years. The item ‘scissors’ was significantly easier for the 20–29-year-old age group than for those aged 50–59 years. One further item ‘candle’ was significantly easier than predicted by the model for the youngest age group (15–19 years). These items were deleted because they functioned differently by age. Items were also examined for DIF across subgroups within the combined patient sample. None of the items showed significant DIF by age, sex, and education.
Principal components analysis of residuals after fitting the Rasch model to the final 18-item Chinese version indicated that the variance explained by measures (Rasch dimension) was 88% and the unexplained variance by the first contrast in the residuals was only 0.9% (1.3 eigenvalue units). Taken together, these results suggested that the 18-item version was a valid unidimensional scale, with a PSI of 2.04 and reliability of 0.81.
The person–item map shown in Appendix 1 presents the participants’ scores on the Rasch-calibrated scale (on the left-hand side) and shows the relative difficulty levels of each of the 18 items on the right-hand side. The mean person location parameter was higher than zero (M=0.53 logits, SD=0.90), indicating that the respondents’ mean level of ability was higher than the mean required level of difficulty of the items. In other words, items were mildly targeted on less able participants. There were only 11.1% (112/1008) and 0.003% (3/1008) of participants at the more able and less able end of the spectrum, respectively, whose ability levels fell beyond the range of item locations. The answer sheet and the scoring key of the Chinese version of the HVOT are presented in Appendix 2.
The 18-item Chinese version showed acceptable internal consistency (α=0.78) and excellent 1-week test–retest reliability [ICC(2,1)=0.95] in the stroke group. For patients with schizophrenia, the internal consistency of the Chinese version was acceptable (α=0.71) and test–retest reliability was excellent [ICC(2,1)=0.90] between the baseline and the three-week follow-up.
With respect to known-groups validity, a comparison of 60 stroke patients with 60 healthy controls, matched on age [t(118)=0.02, P=0.99] and level of education [t(118)=0.20, P=0.84] showed that these stroke patients performed significantly worse than controls [t(118)=8.22, P<0.0001]; subsample 1 (M=24.9, SD=7.98), stroke group (M=13.82, SD=6.75). Comparison of schizophrenia patients with healthy controls matched on age [t(118)=−0.04, P=0.97] and education [t(118)=−0.15, P=0.88] also showed that patients with schizophrenia performed worse than their healthy counterparts [t(118)=−8.97, P<0.0001]; subsample 2 (M=28.5, SD=2.79), schizophrenia group (M=21.03, S D=5.82).
Determination of normative data
A two-way ANOVA was carried out to determine the impact of age and education on the scores of the Chinese version of the HVOT. The Scheffe test was used for post-hoc effects. The results showed significant main effects of education [F(1,1007)=60.33, P<0.0001, η2=0.057], and age [F(3,1007)=132.36, P<0.0001, η2=0.284]. Individuals with more years of education (M=23.60, SE=0.25) outperformed those with fewer years of schooling (M=19.98, SE=0.39). The post-hoc analysis showed that the 15–29-year-old age group scored highest on the Chinese version (M=28.66, SE=0.60), followed by the 30–49 (M=24.13, SE=0.44), 50–69 (M=18.92, SE=0.31), and 70–79-year-old age groups (M=15.45, SE=0.47).
Because of the significant interaction between age and education [F(3,1007)=8.81, P<0.0001, η2=0.026], two one-way ANOVAs were carried out to determine the effect of age in each educational group and vice versa. There was a significant difference between the four age groups for those with 0–9 years of education [F(3,1003)=72.19, P<0.0001]. There was also a significant difference between the four age groups for those with 10–16 years of education [F(3,1003)=77.72, P<0.0001]. However, significant differences were found between the two educational levels in the 30–49-year-old age group [F(1,1001)=19.18, P<0.0001], the 50–69-year-old age group [F(1,1001)=95.81, P<0.0001], and the 70–79-year-old age group [F(1,1001)=30.66, P<0.0001]. However, there was no significant difference between the two educational levels with the 15–29-year-old age group [F(1,1001)=0.33, P=0.565].
The above analyses indicated that normative data for each of the four age groups should be divided into two educational levels, except for the 15–29-year-old age group. Subsequently, Rasch-calibrated person and item measures for each age×education group were used to compute normalized T-scores. To avoid problems with unstable normative estimates resulting from an unequal sample size in two education subcategories of 30–49, 50–69, and 70–79-year-old age groups, multiple regression analysis was carried out using normalized T-scores from the 15–79-year-old age range as the dependent variable. Independent variables included three dummy variables representing the age groups, one dummy variable representing the educational level, and the corresponding raw scores as well as their exponents (squares, cubes, and fifth powers). With the exclusion of raw scores squared, this model yielded a multiple correlation (R) of 0.992 (R2=0.983, adjusted R2=0.983), which was statistically significant [F(8,999)=7408.93, P<0.0001] (Table 2). The results of cross-validation showed that the regression model based on the first random sample of 607 participants yielded a multiple correlation of 0.989 (R2=0.978, adjusted R2=0.977), which was statistically significant [F(8,606)=3271.75, P<0.0001] (Table 3). This equation was used in the second group (n=401) to create predicted scores, and those predicted scores correlated ryy′=0.977 with observed scores. With an ryy′2 of 0.954, shrinkage was 0.024 (2%), which fulfilled the predetermined criteria (10%).
The use of regression-based normative data that took age and education into account ensured that the same raw score would correspond to a lower T-score in the groups that had a higher mean (i.e. higher education group or the younger age group) and a higher T-score in the groups that had a lower mean (i.e. lower education group or the older group) as shown in Figs 1 and 2, respectively. Because the steps that are required in this regression-based approach require some active calculation from the user of the normative data, we also calculated simplified normative tables to increase the user-friendliness of the regression-based norms (Table 4).
ROC curve analysis of cut-off points for identifying patients with visuosynthetic deficit is summarized in Table 5. The ROC curve in Fig. 3 was constructed on the basis of the data in Table 4. As expected, the cut-off defined on raw scores and the one defined on Rasch person measures was consistent. The AUC of the Chinese version to distinguish between participants with and without visuosynthetic ability was 0.84 (95% confidence interval, 0.79–0.89). The optimal cut-off point determined by the shortest distance to point (0, 1) and the Youden index was 21.5 (sensitivity=0.86, specificity=0.68).
The aim of the present study was to cross-culturally adapt the HVOT for use with Chinese-speaking adults and to examine its psychometric qualities. The cross-cultural adaptation process presented no notable problems. Using Rasch analysis, we investigated whether the items of the original version of the HVOT possess the required psychometric characteristics to measure visuosynthetic ability in Chinese adults aged 15–79 years. Our results indicate that although the original HVOT showed good reliability, there were nine misfit items and three items with significant age DIF that would imply deviation from unidimensionality. After the exclusion of these items, the remaining 18 items constituted a unidimensional scale with a satisfactory reliability for person measures (0.81). This Rasch-calibrated version has greater precision in screening patients for mild to moderate deficits in visual organization as 55.6% of the items were clustered between one and two SDs below the person mean in the general population.
The hierarchical order of items appears to be largely similar at the two extreme ends of the continuum between Chinese HVOT and other versions such as Greek and German (Merten and Beal, 1999; Giannakou and Kosmidis, 2006), with the items ‘shoe’ ‘ring’, ‘key’, and ‘broom’ being the harder items and ‘saw’ and ‘tea pot’ being the easier items. Our results show that item hierarchy of the Chinese version differed from that of the original HVOT, a finding consistent with the results of other studies (Merten and Beal, 1999; Merten, 2004; Giannakou and Kosmidis, 2006). The Rasch person–item map presented in Appendix 1 may serve as a useful guide for the order of item presentation during test administration.
Although it was slightly lower than those reported previously for neurological patients (Merten, 2002), the internal consistency reliability value was adequate for the Chinese version of the HVOT in stroke patients. Its 1-week test–retest reliability was high. For schizophrenia patients, the Chinese version had acceptable internal consistency and high test–retest reliability over a 3-week interval. To our knowledge, this reliability has not been established previously for schizophrenia. Known-groups comparison indicated that the Chinese version was able to distinguish between healthy controls and two patient groups.
Consistent with previous reports (Merten, 2002; Giannakou and Kosmidis, 2006), increased age and lower education level were related to poorer performance. As a consequence, our normative data were stratified according to these two factors and would be of significant benefit to clinicians and researchers working with Chinese-speaking individuals. The inclusion of normative data providing Rasch-transformed interval scores allows the use of parametric statistics to analyze data for complex effects, such as assessment of change. In addition, the validation analysis supports the generalizability of the findings of the normative regression equation to the Chinese-speaking population represented by the sample. A cut point of 21.5 with a sensitivity of 86% and a specificity of 68% on the basis of both graphical approaches (the Youden Index and the point closest to the upper left corner) was also generated to enhance the interpretation of scores on the Chinese version for patients with stroke or schizophrenia.
The Rasch-scaled Chinese version represents a unidimensional, hierarchical set of items appropriate for application to the Chinese-speaking adult population. The adequate psychometric properties of the Chinese version lent support to its usefulness in screening for visuosynthetic deficit in patients with stroke and schizophrenia. Norms were provided for a broad range of age groups for the composite score, along with the cut-off score. Future research should further investigate its psychometric properties in groups of different diagnostic categories such as traumatic brain injury, Parkinson’s disease, and dementia.
Conflicts of interest
There are no conflicts of interest.
© 2013 Lippincott Williams & Wilkins, Inc.