# Psychometric Evaluation and Calibration of Health-Related Quality of Life Item Banks: Plans for the Patient-Reported Outcomes Measurement Information System (PROMIS)

Background: The construction and evaluation of item banks to measure unidimensional constructs of health-related quality of life (HRQOL) is a fundamental objective of the Patient-Reported Outcomes Measurement Information System (PROMIS) project.

Objectives: Item banks will be used as the foundation for developing short-form instruments and enabling computerized adaptive testing. The PROMIS Steering Committee selected 5 HRQOL domains for initial focus: physical functioning, fatigue, pain, emotional distress, and social role participation. This report provides an overview of the methods used in the PROMIS item analyses and proposed calibration of item banks.

Analyses: Analyses include evaluation of data quality (eg, logic and range checking, spread of response distribution within an item), descriptive statistics (eg, frequencies, means), item response theory model assumptions (unidimensionality, local independence, monotonicity), model fit, differential item functioning, and item calibration for banking.

Recommendations: Summarized are key analytic issues; recommendations are provided for future evaluations of item banks in HRQOL assessment.

From the *National Cancer Institute, NIH, Bethesda, Maryland; †UCLA Division of General Internal Medicine & Health Services Research, Los Angeles, California; ‡QualityMetric Inc., Lincoln, Rhode Island, and the Health Assessment Lab, Waltham, Massachusetts; §University of Washington, Seattle; ¶Columbia University Stroud Center and Faculty of Medicine; New York State Psychiatric Institute, and Research Division, Hebrew Home for the Aged in Riverdale, New York, New York; ∥Psychology Department, University of North Carolina at Chapel Hill; **Center for Health Outcomes Research, United BioSource Corporation, Bethesda, Maryland; ††Psychology Department, University of Minnesota, Minneapolis; ‡‡Center for Educational Assessment, University of Massachusetts at Amherst; and §§Northwestern University Feinberg School of Medicine and Evanston Northwestern Healthcare, Evanston, Illinois.

Preparation of this work by non-NIH employees was supported by the National Institutes of Health through the NIH Roadmap for Medical Research Grant (AG015815), PROMIS Project.

Reprints: Bryce B. Reeve, PhD, Outcomes Research Branch, National Cancer Institute, NIH, EPN 4005, 6130 Executive Blvd. MSC 7344, Bethesda, MD. 20892-7344. E-mail: reeveb@mail.nih.gov.

The Patient-Reported Outcomes Measurement Information System (PROMIS) project provides a unique opportunity to use advanced psychometric methods to construct, analyze and refine item banks, from which improved patient-reported outcome (PRO) instruments can be developed.^{1,2} PRO measures include instruments that measure domains like health-related quality of life (HRQOL) and satisfaction with medical care. Presented in this report are the methodological considerations for analyzing both existing data from a number of sources and new data to be collected by the PROMIS. These methods and approaches were adopted by the PROMIS network. The PROMIS project will produce item banks that will be used for both computerized-adaptive testing (CAT)^{3} and nonadaptive (ie, fixed length) assessment of HRQOL domains including pain, fatigue, emotional distress, physical functioning, and social-role participation as the initial focus.

In the beginning, PROMIS investigators identified available datasets containing more than 50,000 respondents (n >1000 per dataset) and multi-item PRO responses in cancer, heart disease, HIV disease, diabetes, gastrointestinal disorders, hepatitis C, mental health, and other chronic health conditions. Results from analyses of these datasets were used to refine the proposed methods and offer candidate item banks before the development of the PROMIS item banks. In particular, secondary data analyses allowed PROMIS investigators to examine the dimensionality of domains; identify candidate items that represent the domains of interest; and evaluate the optimal number of response categories to field in the PROMIS data collection phase. The secondary analyses also allowed PROMIS researchers to anticipate psychometric challenges in developing the PROMIS item banks. For example, analyses suggested substantial floor and/or ceiling effects for many domains, which underscored the importance of identifying items that discriminated well at very low and at very high levels of the traits being measured. The same psychometric considerations apply to analysis of newly collected PROMIS data: confirm assumptions about dimensionality of the data; examine item properties; test for differential item functioning (DIF)^{4} across sociodemographic or clinical groups; and calibrate the items for CAT and short forms.

Because researchers recognize the many challenges in analyzing HRQOL data, the plan provides flexibility with respect to the methods used to explore psychometric properties. Some methods were identified as primary and others as exploratory. The results obtained using exploratory methods will be evaluated based on whether they add substantively to the results obtained using the primary methods. Examples of applying the methods discussed in this analysis plan can be found in articles included in this supplement.^{5,6} Further, the psychometrics field is evolving for measuring PROs, both in terms of methods development (eg, recent advances in the bi-factor model and the full information factor analysis for polytomous response items discussed later in this article) and application (eg, advances in electronic-PRO assessment). As the state of the measurement field changes, the PROMIS network will adapt their analytic plans.

## PROMIS DATA COLLECTION AND SAMPLING PLAN

From July 2006 to March 2007, the PROMIS research sites collected data from the US general population (∼n = 7523) and multiple disease populations including those with cancer (∼n = 1000), heart disease (∼n = 500), rheumatoid arthritis (∼n = 500), osteoarthritis (∼n = 500), psychiatric conditions (∼n = 500), spinal cord injury (∼n = 500), and chronic obstructive pulmonary disease (∼n = 500). The general population sample will be constructed to ensure adequate representation with respect to key demographic characteristics such as gender (50% each), age (20% of each age group in years: 18–29, 30–44, 45–59, 60–74, 75+), ethnicity (12% black, 12% Hispanic), and education (25% with high school education or less). A health condition checklist will also be included in the assessment. Beyond demographic and clinical characteristics, PRO data in the areas of pain, fatigue, emotional distress, physical functioning, and social-role participation will be collected for inclusion in item banks developed within the PROMIS network. All candidate items for the PROMIS item banks have been thoroughly examined using qualitative methods such as cognitive testing and expert item review panels.^{7} The first wave of data are being collected via a computer or laptop linked to a web-based questionnaire.

A detailed data sampling plan was developed for collecting initial item responses to the candidate items from the targeted PROMIS domains. This sampling plan was designed to accommodate best a number of purposes: (1) create item calibrations for all of the items in each of the subdomains; (2) estimate profile scores for various disease populations; (3) create linking metrics to legacy questionnaires (eg, SF-36); (4) confirm the factor structure of the primary and subdomains; and (5) conduct item and bank analyses. However, because of the large total number of items (>1000), it is not possible for participants to respond to the entire set of items in each pool. Based on an estimate of 4 questions per minute, the length of the PROMIS questionnaires in the first wave of testing is limited to approximately 150 items which are expected to take about 40 minutes to answer. Two data collection designs (“full bank” and “block administration”) will be implemented during wave 1 to address the 5 purposes listed previously.

“Full bank” testing will be conducted using the general population sample (n = 3507). Each respondent will answer all of the items in 2 of the primary item banks, for example, depression and anxiety or fatigue impact and fatigue experience. Data collected from full bank testing will be analyzed to confirm the factor structure of the PROMIS domains, test for DIF, and to perform CAT simulations.

For “blocked administration” in both the general population sample and the samples of individuals with chronic diseases, a balanced incomplete blocked design will be used in which a subset of items from each item pool is administered to every person.^{8,9} Item blocks will be designed to allow simultaneous item response theory (IRT)-based estimation of item parameters, and of population mean differences and standard deviation ratios.

## ANALYTIC METHODS

Advanced psychometric methods will be used throughout the instrument development process to inform our understanding of the latent constructs, particularly with respect to the populations studied, and to develop adaptive and nonadaptive instruments with appropriate psychometric properties for implementation in a range of research applications.

This process, outlined in Table 1, will include the analysis of item and scale properties using both traditional (ie, classic) and modern (ie, IRT) psychometric methods. Factor analysis will be used to examine the underlying structures of the measured constructs and to evaluate the assumptions of the IRT model. DIF testing will evaluate whether items perform differently across key demographic or disease groups when controlling for the underlying level of the trait assessed by the scale. Finally, items will be calibrated to an IRT model and used in CAT. The plan builds on previous PRO item bank development work by different research groups^{10–12}; however, the scale of the PROMIS project is a more extensive testing strategy than performed previously. The steps noted in Table 1 are presented sequentially, but often many steps can be carried out in parallel and results from later steps may suggest returning to earlier steps to re-evaluate findings based on different interpretations or methods. Herein, we describe each of these methods, review available analytic options and, when evidence supports it, suggest preferred methods and criteria. Decisions about model selection, fit, and, or satisfaction of assumptions will not be based solely on statistical criteria, but will incorporate expert judgment from both psychometric and content experts who will review the evidence to make interpretations and to determine the next steps.

## DESCRIPTIVE STATISTICS

A variety of descriptive statistics will be used, including measures of central tendency (mean, median), spread (standard deviation, range), skewness and kurtosis, and response category frequencies. Patterns and frequency of missing data will be examined to identify the likelihood of systematic or random patterns. For example, if missing data were more prevalent later in the sequence of administered items, this would suggest that the cause may be response burden or lack of time for completing the questionnaire. The content of items that draw substantial missing responses will be examined by content experts to evaluate whether missing responses may be due to sensitive item content.

Several basic classic test theory statistics will be estimated to provide descriptive information about the performance of the item set. These include inter-item correlations, item-scale correlations, and internal consistency reliability. Cronbach’s coefficient alpha^{13} will be used to examine internal consistency with 0.70 to 0.80 as an accepted minimum for group level measurement and 0.90 to 0.95 as an accepted minimum for individual level measurement. Internal consistency estimates are based on the assumption that the item set is homogeneous; because high internal consistency can be achieved with multidimensional data, this statistic does not provide sufficient evidence of unidimensionality.

## EVALUATE ASSUMPTIONS OF THE IRT MODEL

Before applying IRT models, it is important to evaluate the core assumptions of the model: unidimensionality, local independence, and monotonicity. To follow are the described methods for testing these assumptions. The order in which assumptions are tested can vary.

### Unidimensionality

One critical assumption of IRT models is that a person’s response to an item that measures a construct is accounted for by his/her level (amount) on that trait, and not by other factors. For example, a highly depressed person is more likely to endorse “true” for the statement “I don’t care what happens to me” than a person with low depression. The assumption is that a person’s depression level is the main factor that gives rise to his/her response to the item. No item set will ever perfectly meet strictly defined unidimensionality assumptions.^{14} Thus, one wants to assess whether scales are “essentially” or “sufficiently” unidimensional^{15} to permit unbiased scaling of individuals on a common latent trait. One important criterion is the robustness of item parameter estimates, which can be examined by removing items that may represent a significant dimension. If the item parameters (in particular the item discrimination parameters or factor loadings) significantly change, then this may indicate insufficient unidimensionality.^{16,17} A number of researchers have recommended methods and considerations for evaluating essential unidimensionality as reviewed below.^{14,15,18–20}

### Factor Analytic Methods to Assess Unidimensionality

Confirmatory factor analysis (CFA) will be performed to evaluate the extent that the item pool measures a dominant trait that is consistent with the content experts’ definition of the domain. CFA was selected over an exploratory analysis as the first step because each potential pool of items was carefully selected by experts to represent a dominant PRO construct through an exhaustive literature review and feedback from patients through focus groups and cognitive testing.^{7} Because of the ordinal nature of the PRO data, appropriate software (eg, MPLUS^{21} or LISREL^{22}) is required to evaluate polychoric correlations using an appropriate estimator (eg, the weighted least squares with adjustments for the mean and variance (WLSMV^{23} in MPLUS^{21}) estimator or the diagonally weighted least squares (DWLS in LISREL^{22}) estimator) for factor analysis.

CFA model fit will be assessed by examining multiple indices. Noting that statistical criteria like the χ^{2} statistic are sensitive to sample size, a range of practical fit indices will be examined such as the comparative fit index (CFI >0.95 for good fit), root mean square error of approximation (RMSEA <0.06 for good fit), Tucker-Lewis Index (TLI >0.95 for good fit), standardized root mean residuals (SRMR <0.08 for good fit), and average absolute residual correlations (<0.10 for good fit).^{15,24–28} If the CFA shows poor fit, then we will conduct an exploratory factor analysis and examine the magnitude of eigenvalues for the larger factors (at least 20% of the variability on the first factor is especially desirable), differences in the magnitude of eigenvalues between factors (a ratio in excess of 4 is supportive of the unidimensionality assumption), scree test, parallel analysis, correlations among factors, and factor loadings to determine the underlying structural patterns.

An alternate method to determine whether the items are “sufficiently” unidimensional is McDonald’s bifactor model^{15} (see also Gibbons^{29,30}). McDonald’s approach to assessing unidimensionality (which he terms “homogeneity”) is to assign each item to a specific subdomain based on theoretical considerations. A model is then fit with each item loading on a common factor and on a specific subdomain (group factor). The common factor is defined by all the items, whereas each subdomain is defined by a subset of items in the pool. The factors are constrained to be mutually uncorrelated so that all covariance is partitioned either into loadings on the common factor or onto the subdomain factors. If the standardized loadings on the common factor are all salient (defined as >0.30) and substantially larger than loadings on the group factors, the item pool is thought to be “sufficiently homogeneous.”^{15} Furthermore, one can compare individual scores under a bifactor and unidimensional model. If scores are highly correlated (eg, *r* >0.90), this is further evidence that the effects of multidimensionality is ignorable.^{31}

To illustrate the active evolution of psychometric procedures applicable to the analysis of PROs, during the writing of this article an implementation of full information (exploratory) factor analysis for polytomous item responses became available in version 8.8 of the computer software LISREL^{22,32} In addition, Edwards^{33} has illustrated the use of a Markov chain Monte Carlo (MCMC) algorithm for CFA of polytomous item responses such as those obtained in measurement of PROs. It is likely that those procedures and others that may become available soon, will also be useful in the examination of dimensionality of the PROMIS scales.

### Local Independence

Local independence assumes that once the dominant factor influencing a person’s response to an item is controlled, there should be no significant association among item responses.^{34–36} The existence of local dependencies that influence IRT parameter estimates poses a problem for scale construction or CAT implementation. Further, scoring respondents based on miss-specified models will result in inaccurate estimates of their level on the underlying trait. In other words, uncontrolled local dependence (LD) among items in a CAT assessment could result in a score different from the HRQOL construct being measured.

Identification of LD among polytomous response items includes examining the residual correlation matrix produced by the single factor CFA. High residual correlations (greater than 0.2) will be flagged and considered as possible LD. In addition, IRT-based tests of LD will be used; among them are Yen’s Q3 statistic^{37} and Chen and Thissen’s LD indices.^{38} These statistics are based on a process that involves fitting a unidimensional IRT model to the data, and then examining the residual covariation between pairs of items, which should be zero if the unidimensional model fits. For example, Steinberg and Thissen^{34} described the use of Chen and Thissen’s G^{2} LD index to identify locally dependent items among 16 dichotomous items on a scale measuring history of violent activity.

The modification indices (MIs) of structural equation modeling (SEM) software may also serve as statistics to detect LD. When inter-item polychoric correlations are fitted with a one-factor model, the result is a limited information parameter estimation scheme for the graded normal ogive model. The MIs for such a model are 1 degree of freedom χ^{2} scaled statistics that suggest un-modeled excess covariation between items, which in the context of item factor analysis is indicative of LD. Hill, Edwards, Thissen, et al describe the use of MIs to detect LD in the PedsQL™ Social Functioning Scale and other examples.^{6}

Items that are flagged as LD will be examined to evaluate their effect on IRT parameter estimates. One test is to remove one of the items with LD, and to examine changes in IRT model parameter estimates and in factor loadings for all other items.

One solution to control the influence of LD on item and person parameter estimates is omitting one of the items with LD. If this is not feasible because both items provide a substantial amount of information, then LD items can be marked as “enemies,” preventing them from both being administered in a single assessment to any individual. Further, the LD must be controlled in the calibration step to remove the influence of the highly correlated items. In all cases, the LD items should be evaluated to understand the source of the dependency. LD may exist for nonsubstantive reasons such as structural similarity in wording or content when the wording of 2 or more item stems are so similar that the respondent can’t differentiate what the questions are asking. Thus, they will mark the same response for both items.

### Monotonicity

The assumption of monotonicity means that the probability of endorsing or selecting an item response indicative of better health status should increase as the underlying level of health increases. This is a basic requirement for IRT models for items with ordered response categories. Approaches for studying monotonicity include examining graphs of item mean scores conditional on “rest-scores” (ie, total raw scale score minus the item score) or fitting a nonparametric IRT model^{39} to the data that yields initial IRT probability curve estimates, using programs such as Mokken scale analysis for polytomous items (MSP^{40}) software. A nonparametric IRT model fits trace lines for each response to an item without any *a priori* specification of the order of the responses. The data analyst may then examine those fitted trace lines to determine which response alternatives are (empirically) associated with lower levels of the domain and which are associated with higher levels. The shapes of the trace lines may also indicate other departures from monotonicity, such as bimodality, if they exist. Although nonparametric IRT may not be the most (statistically) efficient way to produce the final item analysis and scores for a scale, it can be very informative about the tenability of the assumptions of parametric IRT.

## FIT ITEM RESPONSE THEORY MODEL TO DATA

Once the assumptions have been confirmed, IRT models are fit to the data both for item and scale analysis and for item calibration to set the stage for CAT. IRT refers to a family of models that describe, in probabilistic terms, the relationship between a person’s response to a survey question and his or her standing (level) on the PRO latent construct (eg, pain) that the scale measures.^{41,42} For every item in a scale, a set of properties (item parameters) are estimated. The item slope or discrimination parameter describes how well the item performs in the scale in terms of the strength of the relationship between the item and the scale. The item difficulty or threshold parameter(s) identifies the location along the construct’s latent continuum where the item best discriminates among individuals. This information can be used to evaluate properties of the items in the scale or used by the CAT algorithm to select items that are appropriately matched to the respondent’s estimated level on the measured trait, based on their responses to previously administered items.

Although there are well more than 100 varieties of IRT models^{41–43} to handle various data characteristics such as dichotomous and polytomous response data, ordinal and nominal data, and unidimensional and multidimensional data, only a handful have been used in item analysis and scoring. In initial analyses of existing data sets, the PROMIS network evaluated both a general IRT model, Samejima’s Graded Response Model^{44,45}(GRM), and 2 models based on the Rasch model framework, the Partial Credit Model^{46} and the Rating Scale Model.^{47,48} On the basis of these analyses, the PROMIS network decided to focus on the GRM in future item bank development work.

The GRM is a very flexible model from the parametric, unidimensional, polytomous-response IRT family of models. Because it allows discrimination to vary item by item, it typically fits response data better than a one-parameter (ie, Rasch) model.^{43,49} Further, compared with alternative 2-parameter models such as the generalized partial credit model, the model is relatively easy to understand and illustrate to “consumers” and retains its functional form when response categories are merged. Thus, the GRM offers a flexible framework for modeling the participant responses to examine item and scale properties, to calibrate the items of the item bank, and to score individual response patterns in the PRO assessment. However, the PROMIS network will examine further the fit and added value of alternate IRT models using PROMIS data.

The unidimensional GRM is a generalization of the IRT 2-parameter logistic model for dichotomous response data. The GRM is based on the logistic function that describes, given the level of the trait being measured, the probability that an item response will be observed in category k or higher. For ordered responses *X* = *k, k* = 1,2,3,…, *mi*, where response *m* reflects the highest θ value, this probability is defined^{44,45,50} as:

This function models the probability of observing each category as a function of the underlying construct. The subscript on *m* above indicates that the number of response categories does not need to be equal across items. The discrimination (slope) parameter *ai* varies by item *i* in a scale. The threshold parameters *bik* varies within an item with the constraint *b* _{k − 1} <*bk* <*b* _{k+1}, and represents the point on the θ axis at which the probability passes 50% that the response is in category *k* or higher.

Figure 1 presents the category response curves (CRCs) for a 4-response category item with IRT GRM parameters: *a* = 2.26, *b* _{1} = −1.00, *b* _{2} = 0.00, and *b* _{3} = 1.50. Each curve (one for each response category) represents the probability of a respondent selecting category *k*, given his/her level (θ) on the underlying construct. If a person’s estimated θ is less than −1.00, then he/she is more likely to endorse the first response category. Likewise, if a person’s estimated θ is between −1.00 and 0.00, then he/she is more likely to endorse the second category. A person with estimated θ above 1.50 will have the greatest likelihood of endorsing the fourth response category. In a calibration of the GRM to the item responses, category response curves such as those shown in Figure 1 are estimated for every item.

Once these response curves are estimated on a group of respondents from the first wave of PROMIS data collection, the curves are then used to estimate the θ levels of new respondents to the PROMIS questionnaires. For example, if a person selects response 3 for the item in Figure 1, it is likely their θ level is between 0.0 and 1.5. Using this kind of information for additional items, a person’s θ level is estimated by identifying which response they chose for each administered item. Thus, a person’s level on the trait (θ) and an associated standard error are estimated, using maximum likelihood or Bayesian estimation methods, based on the complete pattern of responses given by each person in conjunction with the probability functions associated with each item response.

IRT model fit will be assessed using a number of indices, recognizing that universally accepted fit statistics do not exist. Also note that if model assumptions are supported by the data, then strict adherence to model fit statistics is not vital, given the limits of acceptable fit indices. Residuals between observed and expected response frequencies by item response category will be compared based on analyses of the size of the differences (residuals). Common fit statistics such as Q_{1}, Bock’s χ^{2}, and others^{43,51} will be examined; also considered will be generalizations of Orlando and Thissen’s *S* − *X* ^{2} to polytomous data.^{52,53} The ultimate issue is to what degree misfit affects model performance in terms of the valid scaling of individual differences.^{54}

Once analysts are satisfied with the fit of the IRT model to the response data, attention is shifted to analyzing the item and scale properties of the PROMIS domains. The psychometric properties of the items will be examined by review of their item parameter estimates, CRCs, and item information curves.^{55,56} Information curves indicate the range of θ where an item is best at discriminating among individuals. Higher information denotes more precision for measuring a person’s trait level. The height of the curves (denoting more information) is a function of the discrimination power (*a* parameter) of the item. The location of the information curves is determined by the threshold (*b*) parameter(s) of the item. Information curves indicate which items are most useful for measuring different levels of the measured construct. This is critical for the item selection process in CAT and in the development of short-forms.

Poorly performing items will be reviewed by content experts before the item bank is established. Misfitting items may be retained or revised when they are identified as clinically relevant and no better-fitting alternative is available. Low discriminating items in the tails of the theta distribution (at low or at high levels of the trait being measured) also may be retained or revised to add information for extreme scores where they would not have been retained in better-populated regions of the continuum. It is at the extremes of the trait continuum that CAT is most effective, but only if items exist that provide good measurement along these portions of the continuum.

Future research by PROMIS will examine the added value of more complex models including multidimensional IRT models.^{57–63} The attraction of these methods is reduced respondent burden and a more realistic model for the underlying measurement model. Multidimensional models take advantage of the correlations among subdomains to inform the measurement of the target constructs, thus precise theta estimates are obtained with fewer items. However, it should be noted that multidimensional IRT has all the rotation problems and complexity that factor analysis does; it greatly complicates DIF analyses, and the meaning of scores is often unclear when subscales are highly correlated. In addition, essentially unidimensional constructs are more often desirable from a theoretical and clinical perspective.

## EVALUATION OF DIFFERENTIAL ITEM FUNCTIONING

According to the IRT model, an item displays differential item functioning (DIF) if the probabilities of responding in different categories vary across studied groups, given equivalent levels of the underlying attribute.^{4,41,43,64} In other words, DIF exists when, for example, women at moderate levels of emotional distress are more likely to report crying than are men at the same moderate level of distress. One reason that instruments containing items with DIF may have reduced validity for between-group comparisons is because their scores indicate attributes other than the one the scale is intended to measure.^{64} The impact of DIF on CAT may be greater than in fixed-length assessments because only a small item set is administered.

In the context of PROMIS, DIF may occur across groups of different races, gender, age groups, or disease conditions. The question of whether or not DIF should be tested with respect to a specific disease category is one that should be considered by content experts. Roussos and Stout^{18} recommended a first step in DIF analyses that includes substantive (qualitative) reviews in which DIF hypotheses are generated, and it is decided whether or not unintended “adverse” DIF is present as a secondary factor. Because this process is largely based on judgment, there may be some error at this step. Substantive reviewers may use 4 sources to inform the DIF hypotheses: previously published DIF analyses; substantive content considerations and judgment regarding current items; review of archival data—review of contexts present in other similar data; using archival or pretest data for testing bundles of items according to some organizing principle. The stage 2 statistical analyses are comprised of confirmatory tests of DIF hypotheses. This type of procedure can be extended to health-related quality of life measures through use of qualitative methods proposed in the PROMIS effort, including the use of expert review, focus groups, cognitive interviews and the generation of possible hypotheses regarding subgroups for which DIF might be observed.

IRT provides a useful framework for identifying items with DIF. The category response curves of an item calibrated based on the responses of 2 different groups can be displayed simultaneously. If the model fits, IRT item parameters (ie, threshold and discrimination parameters) are assumed to be linearly invariant with respect to group membership. Therefore, differences between the CRCs, after linking the θ metric between each group, indicate that respondents at the same level of the underlying trait, but from different groups, have different probabilities of endorsing the item. DIF can occur in the threshold or discrimination parameter. Uniform DIF refers to DIF in the threshold parameter of the model, which indicates that the focal and reference groups have uniformly different response probabilities for the tested item. Nonuniform DIF appears in the discrimination parameter and suggests interaction between the underlying measured variable and group membership; that is, the degree to which an item relates to the underlying construct depends on the group being measured.^{41,64,65}

Determination of DIF is optimized when the samples are as representative as possible of the populations from which they are drawn. Most DIF procedures rely on the identification of a core set of *anchor* items that are thought to be free of DIF and are used to link the 2 groups on a common scale. DIF detection methods use scores based on these items to control for underlying differences between the comparison groups while testing for DIF in the item under scrutiny. There are numerous approaches to assessing DIF.^{66} Herein, we describe the DIF methods being considered by the PROMIS analytic team. It is prudent to evaluate DIF using multiple methods and flag those items identified consistently.

There are basically 2 IRT-based methods that will be used to identify DIF; these are the log-likelihood IRT approach accompanied by byproducts of differential functioning of items and tests (DFIT) to examine DIF magnitude, and the IRT/ordinal logistic regression (OLR) approach with built-in tests of magnitude. The approach recommended is that used by PROMIS investigators, in which significant DIF is first identified using either likelihood ratio (LR)-based significance tests (IRT-LR), or significance tests and changes in beta coefficients (IRT/OLR). The IRT-LR approach also incorporates a correction for multiple comparisons. Finally, both approaches examine magnitude of DIF in the determination of the final items that are flagged. If either method flags an item as having DIF according to these rules, the item will be considered as having DIF. Details regarding the steps in the analyses can be found elsewhere.^{67–69}

The IRT-LR test^{64} will be used to identify both uniform and nonuniform DIF. The procedure compares hierarchically nested IRT models; with 1 model that fully constrains the IRT parameters to be equal between the 2 comparison groups and other models that allow the item parameters to be freely estimated between groups. One key difference between the IRT-LR method and many other DIF methods is how differences between comparison groups are estimated from the anchor items. Other DIF methods use the simple summed score of the anchor set, but the IRT-LR procedure estimates a person’s theta score based on his/her responses to the anchor set. This approach is similar to that used in CAT. Thus, IRT-LR procedures make an easy transition to the detection of DIF for data collected in a CAT environment.^{70}

Used in conjunction with IRT-LR, and based on IRT, are Raju’s signed and unsigned area tests, combined with the Differential Functioning of Items and Tests (DFIT) framework.^{71} This framework includes a noncompensatory DIF (NCDIF) index which reflects the average squared difference between the item-level scores for the focal and reference groups. Several magnitude measures are available in the context of area statistics and the DFIT methodology developed by Raju and colleagues.^{71–73} For binary items, the exact area methods compare the areas between the item response functions estimated in 2 different groups; Cohen et al^{74} extended these area statistics for the graded response model.

The second DIF method is ordinal logistic regression (OLR)^{75} in which a series of 3 logistic models predicting the probability of item response are compared. The independent variables in Model 1 are the trait estimate (eg, raw scale score or theta estimate), group, and the interaction between group and trait. Model 2 includes only the main effects of trait and group, and Model 3 includes only the trait estimate. Nonuniform DIF is detected if there is a statistically significant difference between the likelihood values for Model 1 and Model 2. Uniform DIF is evident if there is a significant difference between the likelihood values for Models 2 and 3. Crane et al^{76} suggested that, in addition to statistical significance, the relative change in beta coefficients between Model 2 and 3 should be considered. On the basis of simulations by Maldonado and Greenland,^{77} a 10% change in beta has been recommended as a criterion for uniform DIF.

PROMIS also will evaluate items for DIF using a hybrid approach that combines the strengths of OLR and IRT.^{68,78} This iterative approach uses IRT theta estimates in OLR models to determine whether items have uniform or nonuniform DIF. To account for spurious DIF (false-positive or false-negative DIF found due to DIF in other items), demographic-specific item parameters are estimated for items found on the initial run to have DIF; items free of DIF serve as anchor items. DIF detection is repeated using these updated IRT estimates, and these procedures (DIF detection and IRT estimation) are repeated until the same items are identified on successive runs.

Advantages of the techniques reviewed above include the rapid empirical identification of anchor items and the determination of the presence and magnitude of DIF. Another advantage is the possibility of using demographic-specific item parameters in a CAT context if that is considered a viable option.

The multiple-indicator, multiple cause models (MIMIC) offer an attractive framework for examining DIF in the context of evaluation of the impact.^{79} Based on a modification of structural equation modeling, the single group MIMIC model permits examination of the direct effect of background variables on items, while controlling for the level of the attribute studied.^{80} The MIMIC model also allows background variables like demographic characteristics to be used as covariates to account for differences among the comparison populations when examining DIF. Although the MIMIC model does not permit tests of nonuniform DIF, an advantage is that impact can be examined by comparing the estimated group effects in models with, and without, adjustment for DIF.^{81}

There are several options for treating items with DIF. One extreme option is to eliminate the item from the bank. If the analyses suggest that there are large numbers of items without consequential DIF, this option will be considered. On the other hand, if many items have DIF, especially in key areas of the trait continuum that are sparsely populated by items, or if content experts determine that the items with DIF are central to the meaning of the construct, other options are to ignore DIF if it is small, to revise items to be free of DIF, to tag items that should not be administered to specific groups, or to control for DIF by using demographic-specific item parameters.

## ITEM CALIBRATION FOR BANKING

After a comprehensive review of the item properties, including evaluation of DIF across key demographic and clinically different groups, the final selected item set will be calibrated using the GRM and CAT algorithms developed. One set of IRT item parameters will be established for all items unless DIF evidence suggests that some items should have different calibrations based on key groups to be measured by the PROMIS system. The item pools for each unidimensional PROMIS domain will include a large set of items with most pools containing more than 50 items.

To identify the metric for the PROMIS item parameter estimates, the scale for person parameters must be fixed in some manner—typically by specifying that the mean in the reference population is 0 and the standard deviation is 1. The PROMIS network has selected the reference population to be the US general population. This will allow interpretation of difficulty (threshold) parameter(s) relative to the general US population mean and the discrimination parameters relative to the population standard deviation. Calibrated in this manner, in the dichotomous response case, an item with a difficulty parameter estimate of *b* = 1.5 suggests that a person who is 1.5 standard deviations above the mean will have a 50% probability of endorsing the item. Population mean differences and standard deviation ratios will be computed for each disease population tested within PROMIS to allow benchmarking. Thus, a person can compare his/her symptom severity or functioning to people with similar disease or to the general US population.

This standardized metric will facilitate the conversion of the IRT z-score metric to the T-score distribution adopted by the PROMIS steering committee. For the purposes of computing the proportion of the norming/calibration sample that score below each theta level and identifying the z-score corresponding to that percentage from a normal distribution, the IRT scale score estimates will be treated as raw scores. These pseudo-normalized z-scores will be converted to T-scores with mean of 50 and standard deviation of 10. For PRO domains where the normal distribution is not appropriate, theta estimates will be converted to T-scores by a linear conversion.

Each of the PROMIS item banks calibrated from the wave one data will be examined for its ability to provide precise measurement across the construct continuum, as assessed by scale information and standard error of measurement curves. Further, CAT simulations will examine the discriminative ability of the item bank at any level of the construct continuum.^{82,83} The ideal is to have high precision and discrimination ability across the continuum of symptom severity or functional ability. Likely, there will be less precision in the extremes of the distributions (eg, high physical functioning or absence from depression); however, the PROMIS content experts are taking great care to write items that may help reduce floor and ceiling effects. The PROMIS network will review the findings from these analyses, and will follow-up with additional work to: (1) write new items to fill gaps in the construct continuum; (2) examine alternate psychometric methods that may improve precision or efficiency; (3) evaluate the items and scales for clinical application; and (4) review the bank items to ensure its relevance in different disease and demographic populations not covered or poorly covered in the calibration data.

## CONCLUSIONS

This report has presented an overview of the psychometric methods that will be used in the PROMIS project, both to examine the properties of the items and domains and to calibrate items with properties that will allow the CAT procedure to select the most informative set of items to estimate a person’s level of health. The PROMIS project is faced with an enormous challenge to create psychometrically sound and valid banks in a short amount of time. Multiple item banks will be developed and at least 7 disease populations and a general US population that vary across a range of key demographic characteristics will be represented in the initial calibration sample collected in wave one. The enormity of the project requires the PROMIS psychometric team to be flexible in terms of the methods used. The design presented herein was developed to be robust to violations of the assumptions required to reach project goals. It is also expected that a large-scale evaluation phase will follow the initial wave of testing to examine alternative methods that may yield more interpretable and efficient results.

## REFERENCES

*Med Care*. 2007;45(Suppl 1):S1–S2.

*Med Care*. 2007;45(Suppl 1):S3–S11.

*Health Services Res*. 2005;40(Part II):1694–1711.

*J Mental Health Aging*. 2001;7:31–40.

*Med Care*. 2007;45(Suppl 1):S32–S38.

*Med Care*. 2007;45(Suppl 1):S39–S47.

*Med Care*. 2007;45(Suppl 1):S12–S21.

*Applied Linear Statistical Model*. 5th ed. New York, NY: McGraw-Hill/Irwin; 2005:664–665, 1173–1183.

*Statistical Methods in Medical Research*. 4th ed. Malden, MA: Blackwell Science; 2002:261–264.

*Med Care*. 2000;38;II73–II82.

*Quality of Life Res*. 2003;12:913–933.

*Quality of Life Res*. 2003;12:485–501.

*Psychometrika*. 1951;16:297–334.

*Br J Mathematical Stat Psychol*. 1981;34:100–117.

*Test Theory: A Unified Treatment*. Mahwah, NJ: Lawrence Erlbaum; 1999.

*Appl Psychol Measure*. 1983;7:189–199.

*J Educational Stat*. 1986;11:91–115.

*Appl Psychol Measure*. 1996;20:355–371.

*Psychometrika*. 1987;52:589–617.

*Qual Life Res*. 2006;15:1179–1190.

*Mplus User’s Guide*. Los Angeles, CA: Muthen & Muthen; 1998.

*LISREL 8: New Statistical Features*. Third printing with revisions. Lincolnwood: Scientific Software International; 2003.

*Psychometrika*. 1997.

*Principles and Practice of Structural Equation Modeling*. New York, NY: Guilford Press; 1998.

*Psychol Bull*. 1990;107:238–246.

*Structural Equation Modeling: Concepts Issues and Applications*. Thousand Oaks, CA: Sage Publications; 1995;56–75.

*Structural Equation Modeling*. 1999;6:1–55.

*Testing Structural Equation Models*. Newbury Park, CA: Sage Publications; 1993.

*Psychometrika*. 1992;57:423–436.

*Appl Psychol Measure*. In press.

*J Personality Assess*. 2005;84:228–238.

*A Markov Chain Monte Carlo Approach to Confirmatory Item Factor Analysis*. [dissertation]. Chapel Hill, NC: University of North Carolina; 2005.

*Psychol Methods*. 1996;1:81–97.

*Educational Measure*. 1996;15:22–29.

*J Educational Measuret*. 1993;30:187–213.

*Appl Psychol Measure*. 1984;8:125–145.

*Educational Behav Stat*. 1997;22:265–289.

*Handbook of Modern Item Response*Theory. New York, NY: Springer; 1997:381–394.

*Users Manual MSP5 for Windows: A Program for Mokken Scale Analysis for Polytomous Items*[software manual]. Groningen, the Netherlands: iec ProGAMMA; 2000.

*Fundamentals of Item Response Theory*. Newbury Park, CA: Sage; 1991.

*Item Response Theory for Psychologists*. Mahwah, NJ: Lawrence Erlbaum; 2000.

*Handbook of Modern Item Response Theory*. New York, NY: Springer-Verlag; 1997.

*Psychometrika Monogr*. 1969;No. 17.

*Handbook of Modern Item Response Theory*. New York, NY: Springer; 1997:85–100.

*Psychometrika*. 1982;47:149–174.

*Psychometrika*. 1978;43:561–573.

*Rating Scale Analysis*. Chicago, IL: MESA Press; 1982.

*Test Scoring*. Mahwah, NJ: Lawrence Erlbaum; 2001:73–140.

*Test Scoring*. Mahwah, NJ: Lawrence Erlbaum; 2001:141–186.

*Appl Psychol Measure*. 1981;5:245–262.

*Appl Psychol Measure*. 2000;24:50–64.

^{2}, an item fit index for dichotomous item response theory models.

*Appl Psychol Measure*. 2003;27:289–298.

*Advances in Health Outcomes Research Methods, Measurement, Statistical Analysis, and Clinical Applications*. Washington, DC: International Society for Quality of life Research; 2005:57–78.

*Exp Rev Pharmacoeconomics Outcomes Res*. 2003;3:131–145.

*Assessing Quality of Life in Clinical Trials: Methods of Practice*. 2nd ed. Oxford, NY: Oxford University Press; 2005:55–73.

*J Appl Measure*. 2003;4:87–100.

*Educational Psychol Measure*. 2006;66:5–34.

*Psychometrika*. 1996;61:331–354.

*J Educ Behav Stat*. 1999;24:398–412.

*Med Care*. 2002;40:812–823.

*Educational Measure*. 2003;22:37–51.

*Quality Life Res*. 2006;15:315–329.

*Differential Item Functioning*. Hillsdale, NJ: Lawrence Erlbaum Associates; 1993:67–113.

*Stat Med*. 2000;19:1651–1683.

*ApplPsychol Measure*. 1993;17:297–334.

*Med Care*. 2006;44(Suppl 3):S134–S142.

*Med Care*. 2006;44(Suppl 3):S115–S123.

*Med Care*. 2006;44(Suppl 3):S152–S170.

*Appl Psychol Measure*. 1995;19:353–368.

*Appl Psychol Measure*. 1999;23:309–326.

*Appl Psychol Measure*. 1993;17:335–350.

*A Handbook on the Theory and Methods of Differential Item Functioning (DIF): Logistic Regression Modeling as a Unitary Framework for Binary and Likert-type (Ordinal) Item Scores*. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense; 1999.

*Stat Med*. 2004;23:241–256.

*Am J Epidemiol*. 1993;138:923–936.

*J Clin Epidemiol*. 2006;59:478–484.

*Psychometrika*. 1984;49:115–132.

*Med Care*. 2003;41:III-75–III-86.

*J Gerontol*. 2000;55B:273–282.

*Quality Life Res*. 2005;14:2277–2291.

*J Epidemiol*. 2006;59:290–298.

**Keywords:**

item response theory; unidimensionality; model fit; differential item functioning; computerized adaptive testing