Increasing Meaning in Measurement: A Rasch Analysis of the Child–Adolescent Teasing Scale
Vessey, Judith A.; DiFazio, Rachel L.; Strout, Tania D.
Judith A. Vessey, PhD, MBA, RN, FAAN, is Lelia Holden Carroll Professor in Nursing, William F. Connell School of Nursing, Boston College, Chestnut Hill, Massachusetts.
Rachel L. DiFazio, RN, PhDc, is Pediatric Nurse Practitioner, Department of Orthopaedics, Children’s Hospital, Boston, Massachusetts.
Tania D. Strout, PhD, RN, is Associate Director of Research, Department of Emergency Medicine, Maine Medical Center, Portland.
Accepted for publication January 12, 2012.
Thank you to the initial CATS research team.
The original study, “Development of the CATS: Child–Adolescent Teasing Scale,” was funded by the National Institutes of Nursing Research (R01 NR 04838). External funding was not received for the secondary data analysis that is the focus of this study.
The authors have no conflicts of interest to disclose.
Corresponding author: Judith A. Vessey, PhD, MBA, RN, FAAN, William F. Connell School of Nursing, Boston College, 140 Commonwealth Ave., Chestnut Hill, MA 02467 (e-mail: firstname.lastname@example.org).
Background: In today’s increasingly violent society, many childhood incidents that begin as simple teasing deteriorate into persistent bullying. The Child–Adolescent Teasing Scale (CATS) was developed to measure self-perceived teasing in youths aged 11–15 years. It was validated initially using the principles of classical test theory and deemed to be a reliable and valid measure of teasing; it has been responsive to change in intervention studies.
Objective: The aim of this study was to evaluate further the psychometric properties of the CATS by evaluating the degree to which the CATS items are congruent with the primary assumptions of the Rasch measurement model.
Methods: A methodological study design using a Rasch Rating Scale Model was utilized to examine the psychometric properties of the 32-item CATS. The sample of the CATS consisted of 666 youths aged 11–15 years from diverse racial and socioeconomic backgrounds and geographic regions. Unidimensionality, hierarchical ordering, and stretching of the variable’s responses along a continuum were examined.
Results: The current CATS subscales do not fit the criteria for the Rasch model. The subscales are not unidimensional or hierarchical and do not exist on upon a continuum upon which items can be ordered and children can be placed.
Discussion: The divergent results between the classical test theory and Rasch analyses, although not completely surprising, underscore the need for continued refinement of an instrument’s psychometric properties to ensure it is measuring the concept of interest in the way it was intended.
Teasing, bullying, and their sequelae are a major public health problem among the world’s children. The Youth Risk Behavior Survey (Centers for Disease Control, 2011) estimates that approximately 20% of U.S. adolescents report being bullied at school in a given year. Globally, between 20% and 54% of schoolchildren have been reported as being involved with bullying as either a victim or a perpetrator (Jansen, Veenstra, Ormel, Verhulst, & Reijneveld, 2011; Kumpulainen & Rasanen, 2000; Wolke, Woods, Stanford, & Schultz, 2001). Critical deleterious outcomes associated with bullying include low self-esteem, depression, and anxiety (Dao et al., 2006; Kim, Boyce, Koh, & Leventhal, 2009; Kyriakides, Kaloyirou, & Lindsay, 2006; Olweus, 1993; Roland, 2002); unhappiness, social isolation, and an external locus of control (Slee, 1995); community and school violence (Greif & Furlong, 2006; Nansel, Overpeck, Haynie, Ruan, & Scheidt, 2003); and escalation to dating violence, harassment, assault, child abuse, domestic violence, or hate crimes (Colvin, Tobin, Beard, Hagan, & Sprague, 1998; Stein, 2003). Bullying is associated with a wide range of negative psychological, emotional, social, and academic outcomes (Greif & Furlong, 2006; Hawker & Boulton, 2000; Nickerson, Brock, Chang, & O’Malley, 2006).
Although many definitions for bullying exist in the current literature, most contemporary researchers conceptualize bullying as a unique form of peer aggression that includes three core features: repetition over time, an intention to harm the victim, and a perceived imbalance of power between the perpetrator and the victim (Espelage & Swearer, 2003; Greif & Furlong, 2006; Liu & Graves, 2011; Olweus, 1978). An accurate assessment of bullying is critical to identifying its presence, planning interventions, and evaluating outcomes. Several critical concerns regarding the measurement of bullying have been reported (Cornell, Sheras, & Cole, 2006; Felix, Sharkey, Green, Furlong, & Tanigawa, 2011; Furlong, Sharkey, Felix, Tanigawa, & Green, 2010; Greif & Furlong, 2006); two of these concerns are the multidimensional nature of the currently accepted bullying definition and whether published bullying instruments truly measure bullying.
Most current bullying instruments were developed as a means of assessing the prevalence of bullying in schoolchildren, with data being used to garner support for antibullying interventions and to influence policy at a larger level (Felix et al., 2011; Greif & Furlong, 2006). Lacking psychometrically sound instruments developed specifically for use with individuals, researchers used early bullying instruments for assessments more complex than those originally intended (Felix et al., 2011). Only recently have advanced psychometric analyses of such tools been undertaken (Kyriakides et al., 2006; Lee & Cornell, 2010). Currently, further psychometric analysis of existing instruments is an important priority for the advancement of bullying research, particularly intervention research (Felix et al., 2011).
Although numerous bullying instruments are available in the literature (Cheng, Chen, Liu, & Chen, 2011; Felix et al., 2011; Reynolds, 2003; Solberg & Olweus, 2003; Swearer, 2001; Vessey, Horowitz, Carlson, & Duffy, 2008; Vessey, Walker, DiFazio, & Strout, 2012), a complete review of their development and psychometric properties will not be undertaken here. Rather, the focus of this study was on the use of Rasch analysis in evaluating the Child–Adolescent Teasing Scale (CATS).
Kyriakides et al. (2006) conducted a Rasch analysis of a revised version of the widely used Olweus Bully/Victim Questionnaire (OBVQ) with the goals of evaluating reliability, model fit, meaning, and construct validity. The OBVQ was administered to a sample of 335 students 11 to 12 year old in Cyprus. The authors determined that the instrument was psychometrically sound and that the data collected in their study demonstrated an adequate fit to the specifications of the Rasch measurement model. A second study, conducted by Cheng et al. (2011), utilized the Rasch model in the development of a new series of three school bullying scales: the Bully Scale, the Victim Scale, and the Witness Scale. Using data from two samples (n = 860 and n = 3,941) of Taiwanese secondary school students, the authors reported a good model-to-data fit and evidence for the unidimensionality of the measures.
The CATS was designed initially to detect the amount and bother of teasing in youths aged 11–14 years. Since the inception of the CATS, there have been significant advances in explicating the concepts of teasing and bullying and their relationships. The emphasis has moved from focusing solely on the instigator’s intent and behavior to the meaning of the interaction to the recipient. When the recipient perceives teasing as chronic and hurtful, subjective bullying has occurred, even if the instigator intent was not to harm. This contrasts to objective bullying, where the instigator intends to be hurtful and a power differential between instigator and recipient exists. Recipients do not distinguish these conceptual differences, however, and react to both types of bullying in the same way; persistent, bothersome teasing, then, is a proxy measure for bullying. The CATS is one of the few instruments available to measure teasing and has been shown to be psychometrically valid and useful in intervention studies where bullying is the primary construct of interest (Vessey & O’Neill, 2011).
As with most health-related measures, the CATS instrument was developed and initially evaluated using the principles of classical test theory (CTT). Despite wide application in the health sciences, there are several important limitations of the CTT approach, including that the ability estimate of a person is dependent on the difficulty of the items, that the standard error applies equally to all ability levels, and that item discrimination can be too high (Hattie, Jaeger, & Bond, 1999; Masters, 1988; Traub, 1997). An alternative approach that addresses CTT limitations is the one-parameter item response theory (IRT), or Rasch, measurement model.
Item response theory models are a group of statistical models designed to define an underlying construct, such as teasing, by calculating the probability of a specific response from an individual to a specific item (Hambleton, Swaminathan, & Rogers, 1991; Ludlow, Enterline, & Cochran-Smith, 2008). These models are differentiated by the number of parameters, or item characteristics (known as item difficulty, item discrimination, and the item pseudo-guess parameter), that the model takes into account. The one-parameter logistic model, or Rasch model, takes item difficulty into account. Two-parameter models also include item discrimination and three-parameter logistic models add pseudo-guessing (Harris, 1989).
The fundamental difference in these types of models lies within their purposes. Two- and three-parameter models are concerned primarily with maximizing the extent to which variations in item responses are accounted for, and therefore, they always offer a better mathematical fit than Rasch models do. Rasch models are used to offer understanding of the variable of interest, specifically understanding what it means or looks like to be a high or low scorer on the instrument or to move along the continuum.
Rasch models describe the probabilistic relationship between item difficulty and person ability and are measured in logits or log-odds (Rasch, 1960). Knowledge of an item’s difficulty and a person’s ability allows the researcher to formulate the probability of the person correctly answering the item or completing the task. There are several important attributes of Rasch models in comparison with the CTT model of measurement: (a) Item statistics are considered to be sample-free, meaning that the instrument can be administered to different individuals and the same difficulty estimates should result for each of the items; (b) examinee estimates are considered to be test-free, meaning that individuals who take different combinations of items should get the same ability estimates; (c) there is a standard error for each item and each person estimate; (d) the models provide an expected value for performance on an item; and (e) they provide a confirmatory test of the instrument’s construct validity. These characteristics address some of the limitations of the CTT model of measurement.
Conceptually, Rasch models assume the construct under study to be unidimensional and stretching across a continuum upon which a hierarchical series of items can be constructed. Undimensionality means that a single latent trait explains all of the variance in the data, thereby determining if one characteristic of a person is being measured at a time (Bond & Fox, 2007). Hierarchical ordering means that all items are ordered according to their level of difficulty, from easiest to most difficult (Ludlow, 2010). The process of instrument development involves developing a clear idea of the variable, constructing specific items to represent the variable, testing the congruence of the scale solution to an a priori theory, and testing the congruence between the observed and expected responses (Ludlow, 2010).
Measurement in nursing science must be concerned with developing a full understanding of the phenomenon of interest. The focus extends beyond the knowledge of a person’s simple numeric score on an instrument to a fuller understanding of what it means to have a particular score, to go from one score to another, and to move along the continuum of scores. Simply having the best numeric fit does not offer the in-depth understanding of the characteristics of a particular score that Rasch models offer.
The purpose of this study was to examine further the psychometric properties of the CATS by evaluating the degree to which the CATS items are congruent with the primary assumptions of the Rasch measurement. Of specific interest were testing the assumptions for unidimensionality, hierarchical ordering, and the stretching of the variable’s responses out along a continuum, meaning that the variables stretch from simple to complex along a logit scale. It was hypothesized that teasing, as measured by the 32 items on the CATS instrument, (a) is unidimensional in nature, (b) follows a hierarchical order in the way that the items define the variable, and (c) exists along a continuum upon which the CATS items can be ordered and children experiencing various levels of teasing can be placed.
Design and Data
A methodologic design was used. The study was a secondary analysis of data collected for the primary psychometric evaluation of the CATS using CTT (Vessey et al., 2008). The original CATS data consisted of demographic and item-level CATS score information from 666 children aged 11–15 years (M = 12.6 years). The majority of participants were White (n = 335, 50.3%), female (n = 344, 51.7%), and from diverse socioeconomic backgrounds. Participants were recruited from a mix of metropolitan (n = 4) and rural (n = 3) schools located in New Mexico (n = 276, 41.4%), Massachusetts (n = 237, 35.6%), Montana (n = 56, 8.4%), Mississippi (n = 49, 7.4%), and North Carolina (n = 48, 7.2%). Sample characteristics are described more fully elsewhere (Vessey et al., 2008). For Rasch analysis, a sample size of 108 will give accurate estimates of person and item locations (99% confidence of locations being within 0.5 logits). Therefore, the sample size of 666 used in this study will provide accurate estimations of items and person locations (Linacre, 1994).
The CATS was designed to measure self-perceived teasing and was intended to be used both as an outcome measure in research and as a screening tool in academic or clinical settings. The items were developed from an analysis of focus group responses of middle school students as to what constitutes all aspects of benign and hurtful teasing (Horowitz et al., 2004). The final instrument consists of 32 Likert-scored items representing four domains of teasing: (a) Personality and Behavior Teasing (14 items), (b) Family and Environment Teasing (seven items), (c) School-Related Teasing (nine items), and (d) Teasing About My Body (two items). Each of the 32 items has two separate responses, one for frequency and one for bother, with four response options ranging from never (1) to a whole lot (4). The total score for each item ranged from 1 to 16 and is calculated by multiplying the frequency score by the corresponding bother score. Scoring is done in this manner because the amount and bother components of teasing are synergistic and not additive relative to the amount of distress that can result. Children with higher total CATS scores perceive more teasing than do children with lower total CATS scores.
In the original analysis, the standardized Cronbach’s alpha coefficient of the CATS was .94, with coefficients ranging from .83 to .90 for each of the subscales. Evidence for the validity of the CATS was provided through significant t tests comparing differences for low- and high-scoring CATS groups on the Piers-Harris Children’s Self Concept Scale (t = −11.39, p < .001) and the Pediatric Symptom Checklist (t = −3.41, p < .03). The CATS was found to be responsive to change in an intervention study (Vessey & O’Neill, 2011).
For this analysis, SPSS for Windows v. 18.0 (Chicago, Illinois) and MedCalc v. 18.104.22.168 (Mariakerke, Belgium) statistical software were used. The WINSTEPS v. 3.69.0 software package was used to perform the Rasch analysis (Winsteps, Beaverton, Oregon). Rasch analysis is an iterative process, with analytic decisions being based upon a combination of statistical and clinical considerations (Avery, Russell, Raina, Walter, & Rosenbaum, 2003). The Rasch Rating Scale Model was used to evaluate each of the CATS subscales.
The Rasch model conceptualizes the probability that a person will endorse an item as a logistic function of the difference between the person’s level of the construct of interest (teasing, or &thetas;, also referred to as the latent trait score or person measure) and the level of teasing expressed by the item, δi, or item difficulty (Rasch, 1960). Difficult items, expressed in logits, can be thought of as items that represent a high degree of teasing with a low probability of endorsement, relative to the individual person measure.
Person and item separation are indices of how well the instrument can separate or distinguish persons and items in terms of their latent variable locations (DeAyala, 2008). Item separation indicates how well the sample separates items into different levels of difficulty, whereas person separation indicates how well the items separate the sample into distinct levels of ability (Bond & Fox, 2007). Items must be sufficiently well separated in terms of item difficulty to identify the direction and meaning of the latent scale (Forkmann et al., 2010; Wright & Stone, 2003). A clinically useful set of items should define at least three strata of youths and items (e.g., high, moderate, and low levels of teasing), which are reflected in a separation index of 2.0 and an associated separation reliability of 0.80 (Forkmann et al., 2010; Gauggel et al., 2004; Wright & Stone, 2003). An index of 1.5 allows for the discernment of two strata (high and low); an index of 2.0, three strata (high, moderate, low); and an index of 3.0, four levels of strata (high, above average, below average, and low; Ross & Doyle, 2008). Person and item separation statistics were computed for each CATS subscale.
Smith (1993, p. 262) wrote that, “A unique strength of the Rasch model is its requirement that the outcome of any interaction between person and item be solely determined by just two parameters, the ability of the person and the difficulty of the item.” With this in mind, it is possible to examine the fit of the data and the model. Because the Rasch model allows computation of the expected response under the model, it is possible to calculate residuals—the difference between the observed and expected responses. Positive residuals are the result of higher than expected responses and negative residuals result from lower than expected responses. Examining the pattern of residuals for each person-and-item combination allows for identification of misfitting item–person combinations (Bond & Fox, 2007; Ludlow, 1985, 1986).
Fit statistics are used to identify the items or persons whose response deviated from what was expected. The fit statistics are normalized mean square residuals and are reported in two different ways, infit and outfit. Mean square infit and outfit statistics reflect the amount of variation in the observed data, as compared with that predicted by the Rasch model. Infit statistics are weighted and outfit statistics are based upon conventional sum of squared standardized residuals. They represent slightly different approaches to assessing the fit of an item. The infit statistic gives more weight to the responses of persons who are closer to the item measure, whereas the outfit statistic is not weighted and is more sensitive to the influence of outlying (more extreme) responses as a result (Bond & Fox, 2007). Infit and outfit mean square values >0.75 and ≤1.3 indicate good fit (Bond & Fox, 2007). Infit and outfit statistics were examined for each CATS subscale.
To assess the dimensionality and related local independence assumptions of the Rasch Rating Scale Model, a principal components factor analysis of the Rasch residuals (PCFA-R) was performed. Unidimensionality requires that a single latent dimension, here teasing, accounts for the common variance observed in the data. Evidence of unidimensionality provides support for local independence because if all CATS items measure the same underlying construct, the construct accounts for all of the relationships among the items and other relationships are highly unlikely. As a result, when the latent dimension of interest is accounted for, no substantial correlations between the scale items remain. The PCFA-R results were interpreted using the recommendations of Linacre (1994), that greater than 60% of the variance being explained by the primary Rasch dimension and less than or equal to 5% of variance being explained by the largest secondary dimension were considered a good finding. In addition, Linacre (1994) and Chou and Wang (2010) noted that an eigenvalue of 2.0 has the strength of approximately two items, the smallest amount that could be considered an additional dimension; therefore, an eigenvalue of less than or equal to 3 was considered to have only marginal explanatory power (Chou & Wang, 2010; Linacre, 1994).
Each subscale was examined independently for fit into the Rasch Rating Scale Model. Initially, the item separation statistic, person reliability, and Cronbach’s alpha were calculated. It was then determined which items in each subscale were considered to be underfitting or overfitting. Underfitting items have response patterns that are too haphazard and possess too much variation, and overfitting items have responses that are too determined with too little variation (Bond & Fox, 2007). Item level fit data, denoting specific underfitting and overfitting items, for each of the CATS subscales is provided in Table 1. The residual correlation matrix for each subscale was examined to determine dimensionality and local independence using PCFA-R. Variable maps were used to evaluate the degree to which the CATS items can be arranged in a hierarchical manner and cover the theoretical continuum describing the latent variable, teasing. A summary of the findings is found in Table 2.
Personality and Behavior Teasing Subscale
The Personality and Behavior Teasing subscale of the CATS consists of 14 items. The person separation statistic for this subscale was 1.16 standard deviation (SD) units and person reliability was 0.58 SD units, with each index failing to meet the critical values of 2.0 and 0.80, respectively. The item separation statistic was 2.90 SD units, whereas the item reliability was 0.89 SD units, where these values met the minimum threshold values. Cronbach’s alpha, .90, was excellent (Burns & Grove, 2009; Nunnally & Bernstein, 1994). The log-likelihood chi-square statistic, an approximate value of the global fit statistic, was 16,752.49 (df = 7,186, p < .0001) for this subscale, indicating significant departure from the ideal Rasch model.
Examination of this subscale’s items for fit to the Rasch model revealed four underfitting items and four overfitting items. The PCFA-R indicated that the primary Rasch dimension for teasing explained 43.5% of the variance (eigenvalue = 10.8). The largest potential secondary dimension explained 5.8% of the variance (eigenvalue = 1.4). These findings do not meet Linacre’s (2009) standard meaning; the Personality and Behavior Teasing CATS subscale likely demonstrates multidimensionality and local dependence.
A variable map, or graphical representation of subjects and scale items located along a common continuum, for the Personality and Behavior Teasing subscale is displayed in Figure 1. An examination of the subjects’ side of the variable map indicates that the majority of participants experienced levels of personality and behavior-related teasing that were too low to be detected by the subscale items; in general, the subjects appear below the items. Inspection of the item side of the map reveals that the items are clustered together, extending across the horizontal, rather than vertical, plane. This pattern indicates that these current subscale items generally provide information on a limited range of teasing experience severity and that some of the items may be redundant. In particular, there is a large gap on the low end of the teasing experience continuum that is currently uncovered by these items.
Family and Environment Teasing Subscale
The Family and Environment Teasing subscale consists of seven items. The person separation statistic was 0 SD units and person reliability was 0 SD units, with each failing to meet the critical values. The item separation statistic was 1.72 SD units, whereas the item reliability was 0.75 SD units, where these values also fail to meet the minimum threshold values. These values indicate that this subscale currently cannot separate subjects in terms of their locations on the latent variable. Cronbach’s alpha, .83, was good. The log-likelihood chi-square for this subscale was 4,005.27 (df = 1,784, p < .0001), again indicating significant misfit when compared with the ideal of the Rasch model.
Examination of this subscale’s items for fit to the Rasch model revealed two overfitting items. All other subscale items met Rasch fit criteria. The PCFA-R revealed that the primary Rasch dimension for teasing explained 43.0% of the variance (eigenvalue = 5.3). The largest potential secondary dimension explained 11.6% of the variance (eigenvalue = 1.4). These findings do not meet Linacre’s (1994) standards, providing evidence that this subscale demonstrates multidimensionality and local dependence.
Examination of the subjects’ side of the variable map indicates that the majority of participants experienced levels of family and environment-related teasing that were too low to be detected by the subscale items; in general, the subjects appear below the items (Figure 2). Inspection of the item side of the map reveals that the items are clustered together. This pattern indicates that the current Family and Environment Teasing subscale items generally provide information on a limited range of teasing experience severity. In particular, there is a large gap on the low end of the continuum that is currently uncovered by the subscale items.
School-Related Teasing Subscale
The School-Related Teasing subscale of the CATS consists of nine items. The person separation statistic was 0.62 SD units and person reliability was 0.28 SD units, with each failing to meet the minimum threshold values. The item separation statistic was 1.02 SD units, whereas the item reliability was 0.51 SD units, where these values also failed to meet the critical standards. Cronbach’s alpha, .85, was good. For the School-Related Teasing subscale, significant misfit was noted when the subscale was compared with the Rasch ideal (log-likelihood χ2 = 9,534.40, df = 4014, p < .0001).
All subscale items adhered to the infit mean square criteria of >0.75 and ≤1.3. Examination of the outfit mean square values identified two overfitting items, indicating that there was 40% and 56% less variation, respectively, in the observed responses than was predicted by the model. There was little variation in the responses to the items, perhaps because teasing regarding school performance and schoolwork is very common, yielding a Guttman-like response pattern. Examination of the standardized infit and outfit values, which take sample size into account, confirmed the overfit findings. A PCFA-R demonstrated that the primary Rasch dimension for teasing explained 44.1% of the variance (eigenvalue = 7.1). The largest potential secondary dimension explained 9.7% of the variance (eigenvalue = 1.6). These findings do not meet Linacre’s standards, providing support that the School-Related Teasing CATS subscale likely demonstrates multidimensionality and local dependence, although the potential secondary dimension in the data would have only marginal explanatory power, as evidenced by an eigenvalue of 1.6.
Examination of the subject side of the variable map depicted in Figure 3 indicates that the majority of participants experienced levels of school-related teasing that were too low to be detected by the subscale items; as previously, the subjects appear below the items. The item side of the map reveals that the items are clustered together, extending horizontally rather than vertically. This pattern indicates that these current subscale items generally provide information on the same severity, or degree, of teasing of this nature.
Teasing About My Body Subscale
The Teasing About My Body subscale of the CATS consists of two items. The person separation statistic was 0.95 SD units and person reliability was 0.48 SD units; each of these indices failed to meet the critical values of 2.0 and 0.80, respectively. The item separation statistic was 2.54 SD units, and the item reliability was 0.87 SD units. These values met Linacre’s (1994) minimum threshold values. Cronbach’s alpha, .84, was high. Significant deviation from the Rasch model ideal was noted for this subscale (log-likelihood χ2 = 1,607.03, df = 262, p < .0001). All subscale items adhered to the infit and outfit mean square criteria. Examination of the standardized outfit mean square values revealed one overfitting item, my weight, indicating that there was less variation in the observed responses than was predicted by the model.
A PCFA-R of the Teasing About My Body subscale residuals revealed that the primary Rasch dimension for teasing explained 69.8% of the variance (eigenvalue = 4.6). The largest potential secondary dimension explained none of the variance. These findings meet the standards proposed by Linacre (1994), providing support for the Teasing About My Body CATS subscale as demonstrating unidimensionality and local independence, with no identifiable secondary dimension present in the data.
Examination of the subject side of the variable map indicates that the participants experienced all levels of body-related teasing, ranging from little to no teasing at the very bottom of the map to a great deal of bothersome teasing at the top of the map (Figure 4). Inspection of the item side of the map shows that the items are clustered together, covering only a small portion of the range of body-related teasing experienced by the subjects. Large gaps, indicated by the brackets, reveal areas where additional items are required to span the continuum of teasing reported by the participants.
The current CATS subscales do not fit the specifications of the Rasch model. In view of the study hypotheses, it was demonstrated that the subscales (a) were not unidimensional in nature, (b) do not follow a strictly hierarchical order in the way that the items define the variable, and (c) do not exist along a continuum upon which the CATS items can be ordered and children experiencing various levels of teasing can be placed. Although disappointing, these findings are not completely surprising given that the CATS was developed using the principles of CTT, rather than IRT. When testing for unidimensionality, of the 32 CATS items, 22 met the thresholds (>0.75 and ≤1.3), indicating a good fit to the Rasch model. Of the 10 misfitting items, seven were overfits, indicating a response pattern that is too determined. Because the Rasch model is a probabilistic (or stochastic) model, it fundamentally assumes that a perfect, Guttman-like response pattern is unrealistic. The Rasch model assumes a degree of unpredictability around an individual’s level of ability (Bond & Fox, 2007). However, particular aspects of human development often do occur in predictable patterns. It is quite possible that the overfitting items detected here are identifying a common pattern in the experience of bullying, suggesting, for example, that it is very commonplace for children to be bullied regarding their weight, their school performance, and their schoolwork.
Only three underfitting items were identified, indicating haphazard responses with too much variation. Underfitting items can degrade the quality of a Rasch measure, and careful consideration of the benefits versus consequences of such items is warranted (Bond & Fox, 2007). The three underfits all were within the Personality and Behavior Teasing subscale, which contains an additional 11 items. Principal components analyses of subscale residuals supported unidimensionality and local independence for only the Teasing About My Body subscale.
A second hypothesis of the Rasch model is that items measuring a single attribute can be ordered in a hierarchical manner as they define the variable of interest (Bond & Fox, 2007; Ludlow, 2010). The hierarchy of subscale items tended to be clustered together along the horizontal plane of the variable maps, indicating that the items generally measure the same degree of bullying. In essence, this pattern demonstrates that the current CATS items are detecting a high degree of bullying rather than differentiating among children with low to high levels of bullying experience. This is not an unanticipated finding; the original intent of the CATS developers was to identify children who were experiencing problematic teasing or bullying.
The third hypothesis evaluated was the degree to which the current CATS items stretch along the theoretical continuum of bullying. Person separation statistics for each of the subscales provide evidence of inadequate spread along the theoretical continuum of bullying, meaning that this sample did not report the full range of teasing experiences. In addition, only the Personality and Behavior Teasing and Teasing About My Body subscales produced item separation statistics indicating adequate spread. This can be interpreted as meaning that children with varied levels of bullying may not be differentiated adequately using the other two CATS subscales.
Two articles were identified as reporting the results of Rasch analyses of bullying instruments. Kyriakides et al. (2006) reported a good fit to the Rasch model for each scale of the revised OBVQ. They also noted that generation of additional, more difficult to endorse items would allow for better coverage of the entire continuum of the constructs under study. Although these results are similar to those observed in the current investigation, it is interesting to note that additional revised OBVQ items are desirable on the more difficult to endorse (more bullying experience) end of the continuum, whereas additional items on the easier to endorse (less bullying experience) end of the continuum are desirable for the current version of the CATS instrument. Ultimately, in each case, the development and psychometric evaluation of additional items are warranted.
Cheng et al. (2011) reported on their use of the Rasch partial credit model in the development of three school bullying scales: the Bully Scale, the Victim Scale, and the Witness Scale. Their focus was on the development of new scales, in accordance with Rasch principles; the authors reported good model-to-data fit and strong evidence for unidimensionality. As with the current CATS analysis, many of the subjects reported having little experience with bullying, being bullied, or witnessing bullying behavior, which resulted in variable maps similar to those obtained in analyzing the CATS subscales. As with the CATS, additional items representing lower levels of experience with bullying would enhance the ability of the scales to differentiate among children.
Several important limitations should be considered when interpreting these findings. First, the original CATS sample consisted of children with a broad range of experiences with teasing, but many of the participants did not experience high levels of teasing. As a result, many of these children can be seen at the bottom of the variable maps (Figures 1–4). Although this is consistent with the original intent of the CATS developers, it illustrates the need for additional items to differentiate between children.
Second, it should be noted that despite a rigorous informed consent and assent process, children in the original CATS sample may have had some desire not to report their experiences fully with teasing. Research has demonstrated that most middle-school-age children historically have underreported negative teasing and bullying experiences (Dao et al., 2006; Fekkes, Pijpers, & Verloove-Vanhorick, 2005; Hunter & Borg, 2006). As a result, teasing experiences may have been underrepresented in the original CATS sample.
These findings illustrate the differences in the classical test and IRT measurement models. Although CTT is used widely in nursing science, it can be complemented by IRT models to develop instruments that can assist clinicians to understand more fully the nuances of the constructs being measured. The CATS was designed to identify youths who viewed their peer interactions as highly problematic; it is of limited utility, however, for clinicians interested in differentiating between children experiencing low levels of bullying.
The results provide a roadmap for the development of a revised instrument that better covers the continuum of teasing and bullying and captures the importance of the meaning of the interactions between recipients and instigators. The plan to create the CATS was conceived prior to the groundswell of academic attention paid to concept and theory development regarding teasing, bullying, and the need to focus on the meaning of the dyadic relationship rather than just on the victim’s response. In light of societal changes, the need is great for psychometrically valid instruments that more completely explicate how youths interpret the full continuum of teasing and bullying experiences. Most states now mandate that primary prevention antibullying programs be included in school curricula, increasing the need to accurately measure changes in perceived bullying (National Association of State Boards of Education, n.d.).
Future work with the current CATS will focus on the generation and evaluation of additional items aimed at more fully covering the continuum of bullying experiences. In addition, work will be undertaken to eliminate items that measure the same degree of bullying, creating a more succinct instrument amenable to use in varied clinical settings. Future intervention would allow for comparison of preintervention and postintervention variable maps and the creation of pre–post item estimate plots that would facilitate an evaluation of the invariance for the instrument. Finally, the development of an updated scoring system for the CATS may allow for additional insight into the meaning of an individual child’s CATS score or into the significance of a change up or down along the range of possible scores.
Rasch analysis can be used in nursing science as a complement to traditional CTT methods. Rasch models can help to guide more fully the initial theoretical conceptualization of the constructs measured, providing a deeper understanding of the meaning of the construct of interest. As the definitions and dimensions of many of the psychosocial constructs of interest to nursing practice and research, such as teasing and bullying, continue to change with the development of new knowledge, Rasch may be used also to evaluate and refine existing instruments as demonstrated here. In sum, the adjunctive use of Rasch analysis will allow nurse scientists to develop psychometrically sound and clinically useful instruments with which to improve outcomes for patients.
Avery L. M., Russell D. J., Raina P. S., Walter S. D., Rosenbaum P. L. (2003). Rasch analysis of the Gross Motor Function Measure: Validating the assumptions of the Rasch model to create an interval-level measure. Archives of Physical Medicine and Rehabilitation, 84, 697–705. doi: 10.1016/S0003-9993(03)04896-7.
Bond T. G., Fox C. M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Lawrence Erlbaum Associates.
Burns N., Grove S. K. (2009). The concepts of measurement. In Burns N., Grove S. K. (Eds.), The practice of nursing research: Appraisal, synthesis, and generation of evidence (pp. 371–394, 6th ed.). St. Louis, MO: Elsevier Saunders.
Cheng Y. Y., Chen L. M., Liu K. S., Chen Y. L. (2011). Development and psychometric evaluation of the School Bullying Scales: A Rasch measurement approach. Educational and Psychological Measurement, 71, 200–216. doi: 10.1177/0013164410387387.
Chou Y. T., Wang W. C. (2010). Checking dimensionality in item response models with principal component analysis on standardized residuals. Educational and Psychological Measurement, 70, 717–731. doi: 10.1177/0013164410379322.
Colvin G., Tobin T., Beard K., Hagan S., Sprague J. (1998). The school bully: Assessing the problem, developing interventions, and future research directions. Journal of Behavioral Education, 8, 293–319.
Cornell D. G., Sheras P. L., Cole J. C. (2006). Assessment of bullying. In Jimerson S. R., Furlong M. J. (Eds.), Handbook of school violence and school safety (pp. 587–602). Mahwah, NJ: Erlbaum.
Dao T. K., Kerbs J. J., Rollin S. A., Potts I., Gutierriz R., Choi K., Prevatt F. (2006). The association between bullying dynamics and psychological distress. Journal of Adolescent Health, 39, 277–282. doi: 10.1016/j.jadohealth.2005.11.001.
DeAyala R. J. (2008). The theory and practice of item response theory. New York, NY: Guilford Press.
Espelage D. L., Swearer S. M. (2003). Research on school bullying and victimization: What have we learned and where do we go from here? School Psychology Review, 32, 365–383.
Fekkes M., Pijpers F. I. M., Verloove-Vanhorick S. P. (2005). Bullying: Who does what, when and where? Involvement of children, teachers and parents in bullying behavior. Health Education Research, 20, 81–91. doi: 10.1093/her/cyg100.
Felix E. D., Sharkey J. D., Green J. G., Furlong M. J., Tanigawa D. (2011). Getting precise and pragmatic about the assessment of bullying: The development of the California Bullying Victimization Scale. Aggressive Behavior, 37, 234–247. doi: 10.1002/ab.20389.
Forkmann T., Boecker M., Wirtz M., Glaesmer H., Brahler E., Norra C., Gauggel S. (2010). Validation of the Rasch-based Depression Screening in a large scale German general population sample. Health and Quality of Life Outcomes, 8, 105. doi: 10.1186/1477-7525-8-105.
Furlong M. J., Sharkey J. D., Felix E. D., Tanigawa D., Green J. G. (2010). Bullying assessment: A call for increased precision of self-reporting procedures. In Jimerson S. R., Swearer S. M., Espelage D. L. (Eds). Handbook of bullying in schools: An international perspective (pp. 329–345). New York, NY: Routledge.
Gauggel S., Boecker M., Heinemann A. W., Lammler G., Borchelt M., Steinhagen-Thiessen E. (2004). Patient–staff agreement on Barthel Index Scores at admission and discharge in a sample of elderly stroke patients. Rehabilitation Psychology, 49, 21–27. doi: 10.1037/0090-5550.49.21.
Greif J. L., Furlong M. J. (2006). The assessment of school bullying: Using theory to inform practice. Journal of School Violence, 5, 33–50. doi: 10.1300/J202v05n03_04.
Hambleton R. K., Swaminathan H., Rogers H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Harris D. (1989). Comparison of 1-, 2-, and 3-parameter IRT models. Educational Measurement: Issues and Practice, 8, 35–41.
Hattie J., Jaeger R. M., Bond L. (1999). Persistent methodological questions in educational testing. Review of Research in Education, 24, 393–446.
Hawker D. S. J., Boulton M. J. (2000). Twenty years’ research on peer victimization and psychosocial maladjustment: A meta-analytic review of cross-sectional studies. Journal of Child Psychology and Psychiatry, 41>, 441–455.
Horowitz J. A., Vessey J. A., Carlson K. L., Bradley J. F., Montoya C., McCullough B., David J. (2004). Teasing and bullying experiences of middle school students. Journal of the American Psychiatric Nurses Association, 10, 165–172. doi: 10.1177/1078390304267862.
Hunter S. C., Borg M. (2006). The influence of emotional reaction on help seeking by victims of school bullying. Educational Psychology, 26, 813–826. doi: 10.1080/01443410600941946.
Jansen D. E. M. C., Veenstra R., Ormel J., Verhulst F. C., Reijneveld S. A. (2011). Early risk factors for being a bully, victim, or bully/victim in late elementary and early secondary education. The longitudinal TRAILS study. BMC Public Health, 11, 440. doi: 10.1186/1471-2458-11-440.
Kim Y. S., Boyce W. T., Koh Y., Leventhal B. L. (2009). Time trends, trajectories, and demographic predictors of bullying: A prospective study in Korean adolescents. Journal of Adolescent Health, 45, 360–367. doi: 10.1016/j.jadohealth.2009.02.005.
Kumpulainen K., Rasanen E. (2000). Children involved in bullying at elementary school age: Their psychiatric symptoms and deviance in adolescence. An epidemiological sample. Child Abuse & Neglect, 24, 1567–1577. doi: 10.1016/S0145-2134(00)00210-6.
Kyriakides L., Kaloyirou C., Lindsay G. (2006). An analysis of the Revised Olweus Bully/Victim Questionnaire using the Rasch measurement model. British Journal of Educational Psychology, 76, 781–801. doi: 10.1348/000709905X53499.
Lee T., Cornell D. (2010). Concurrent validity of the Olweus Bully/Victim Questionnaire. Journal of School Violence, 9, 1–18. doi: 10.1080/15388220903185613.
Linacre J. M. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7, 28.
Linacre J. M. (2009). Winsteps (Version 3.69.0) [Computer Software]. Beaverton, OR: Winsteps.com. Retrieved from http://www.winsteps.com
Liu J., Graves N. (2011). Childhood bullying: A review of constructs, concepts, and nursing implications. Public Health Nursing, 28, 556–568. doi: 10.111/j.1525-1446.2011.00972.x.
Ludlow L. H. (1985). A strategy for the graphical representation of Rasch model residuals. Educational and Psychological Measurement, 45, 851–859. doi: 10.1177/0013164485454015.
Ludlow L. H. (1986). Graphical analysis of item response theory residuals. Applied Psychological Measurement, 10, 217–229. doi: 10.1177/014662168601000301.
Ludlow L. H. (2010). Classical test theory, item response theory, and Rasch measurement principles: Transforming psychometric theory into practice. Chestnut Hill, MA: Larry H. Ludlow.
Ludlow L. H., Enterline S. E., Cochran-Smith M. (2008). Learning to Teach for Social Justice-Beliefs Scale: An application of Rasch measurement principles. Measurement and Evaluation in Counseling and Development, 40, 194–214.
Masters G. (1988). Item discrimination: When more is worse. Journal of Educational Measurement, 25, 15–29. doi: 10.1111/j.1745-3984.1988.tb00288.x.
Nansel T. R., Overpeck M. D., Haynie D. L., Ruan W. J., Scheidt P. C. (2003). Relationships between bullying and violence among US youth. Archives of Pediatrics & Adolescent Medicine, 157, 348–353.
Nickerson A. B., Brock S. E., Chang Y., O’Malley M. (2006). Responding to children victimized by their peers. Journal of School Violence, 5, 19–32. doi: 10.1300/J202v05n03_03.
Nunnally J., Bernstein I. (1994). Psychometric theory (3rd ed.). New York, NY: McGraw Hill.
Olweus D. (1978). Aggression in the schools: Bullies and whipping boys. Washington, DC: Hemisphere.
Olweus D. (1993). Bullying at school. What we know and what we can do? Oxford, England: Blackwell.
Rasch G. (1960). Probabilistic models for some intelligence and attainment tests. Chicago, IL: University of Chicago Press.
Reynolds W. (2003). Reynolds Bully Victimization Scales. San Antonio, TX: The Psychological Corporation, Harcourt Assessment.
Roland E. (2002). Bullying, depressive symptoms and suicidal thoughts. Educational Research, 44, 55–67. doi: 10.1080/00131880110107351.
Slee P. (1995). Peer victimization and its relationship to depression among Australian primary school students. Personality and Individual Differences, 18, 57–62. doi: 10.1016/0191-8869(94)00114-8.
Smith R. M. (1993). Guessing and the Rasch model. Rasch Measurement Transactions, 6, 262–263.
Solberg M. E., Olweus D. (2003). Prevalence estimation of school bullying with the Olweus Bully/Victim Questionnaire. Aggressive Behavior, 29, 239–268. doi: 10.1002/ab.10047.
Stein N. (2003). Bullying or sexual harassment? The missing discourse of rights in an era of zero tolerance. Arizona Law Review, 45, 783–799.
Swearer, S. M. (2001). The bully survey
. Unpublished manuscript, The University of Nebraska-Lincoln.
Traub R. E. (1997). Classical test theory in historical perspective. Educational Measurement: Issues and Practice, 16, 8–14. doi: 10.1111/j.1745-3992.1997.tb00603.x.
Vessey J. A., Horowitz J. A., Carlson K. L., Duffy M. (2008). Psychometric evaluation of the Child–Adolescent Teasing Scale. Journal of School Health, 78, 344–350. doi: 10.1111/j.1746-1561.2008.00312.x.
Vessey J. A., O’Neill K. (2011). Helping students with disabilities better address teasing and bullying situations: A MASNRN study. Journal of School Nursing, 27, 139–148. doi: 10.1177/1059840510386490.
Vessey J. A., Walker A., DiFazio R. L., Strout T. D. (2012). Measuring schoolyard bullying: A systematic review. Manuscript submitted for publication.
Wolke D., Woods S., Stanford K., Schultz H. (2001). Bullying and victimization of primary school children in England and Germany: Prevalence and school factors. British Journal of Psychology, 92, 673–696.
Wright B. D., Stone M. H. (2003). Five steps to science: Observing, scoring, measuring, analyzing, and applying. Rasch Measurement Transactions, 17, 912–913.
This article has been cited 1 time(s).
Nursing OutlookCommentary on: Gun control, gun rights, and the role of nurses and the professionNursing Outlook
bullying; children; psychometrics
© 2012 Lippincott Williams & Wilkins, Inc.
Highlight selected keywords in the article text.