Psychosomatic medicine draws knowledge primarily from the disciplines of medicine and psychology. Given its interdisciplinary nature, attention to reliability shows up in various forms because the disciplines of psychology and medicine have different traditions. But the basic issue is the same: reliable measurement is necessary for accurate inference. Reliability is generally defined as the consistency of measurement procedures and is estimated as the proportion of variability in the measurements that is attributable to true individual differences. True individual differences result from differences in the construct of interest and other systematic characteristics. The proportion of variability in measurement attributable to the construct of interest exclusive of other systematic characteristics is its validity. Therefore, reliability is a necessary but not sufficient condition for validity and sets the ceiling.
Psychologists conduct interviews to make clinical judgments about psychological illnesses or administer self-report measures to assess personality, moods, coping styles, distress, and many other properties in the form of states or traits. Similarly, a physician takes a patient’s blood pressure to diagnose hypertension or a nurse collects blood samples to assay for a variety of hormones or immune markers. These are examples of how researchers and practitioners in the medical field make inferences about processes or characteristics inside the body from observations made in an office or laboratory. Because we do not directly observe these underlying characteristics, they are referred to as latent traits or latent variables.
When observed variables are reliable and valid measurements, the inferences we make are accurate. But if the observed variables are not reliable indicators of the latent variable, we may be fooling ourselves into thinking that we have captured the property of interest. In research contexts, the bias introduced by fallible measurement may be reduced by applying corrections (1). It is also possible to build the measurement model into the analysis using latent variable modeling.
Latent variables have been conceptualized and defined in different ways, and we adopt the broad formal definition proposed by Bollen (2) because it allows for assessment of both reliability and validity and subsumes all the models presented here:
“A latent random (or nonrandom) variable is a random (or nonrandom) variable for which there is no sample realization for at least some observations in a given sample.” (p.612)
This definition does not differentiate between the informal platonic view of latent variables as real phenomena that are not observed with current tools and practices and the view that latent variables are representations of properties of individuals. (For a discussion of the different definitions, see Bollen (2).) It also does not differentiate between formative or reflective measurement with cause or effect indicators, respectively. In this article, however, we consider only reflective measurement models. Whether the latent variable is a real phenomenon of interest, hidden from the naked eye, or a representation of properties of persons (hypothetical construct) depends on the specific sample realization. This distinction may have implications for whether the latent variables are used in causal models as argued by Sobel (3) and may need to be considered in a given application. However, we do not address the reality of latent variables in this article but illustrate different manifestations of latent variable models in psychosomatic medicine.
Latent variable models are becoming popular in behavioral research because of their flexibility and generality. A unique feature of latent variable models, different from generalized linear models, is their ability to account for measurement error, both random (reliability) and systematic (validity). The value of this feature is easy to appreciate in psychological research where the constructs of interest tend to be individual characteristics that are unobserved, such as depression. The variables of interest in medical studies, such as the presence of disease, may seem to be easier to detect but are subject to similar problems as psychological constructs (4). Furthermore, many physiological measures (e.g., blood pressure) vary as a function of factors such as posture, time of day, and stress, to name a few. In fact, any measurement procedure is intrinsically influenced by multiple random sources of variability.
However, the medical field has been more skeptical in accepting latent variable models, and although articles using these models appear in statistical (5,6) and applied journals in medicine (7), their use is not as common as it is in the social sciences, particularly as a means to address reliability. One possible reason for this skepticism is that procedures for assessing reliability in medical studies are typically introduced separately from the analytic methods used to answer research questions. In other words, the assessment of reliability is treated as a separate analysis from the test of a hypothesis. Another reason is that latent variables do not have an easy or immediate translation into medical decisions for a single patient because they are based on group information, and a physician is making a decision for an individual relative to normative data. Thus, the value of latent variable models may be lost to many in the medical community and by association in psychosomatic medicine research. It is also the case that initial introductions of latent variable models emphasized “causality” and inappropriate applications have been criticized (8). In this article, we argue that latent variable models have a lot to offer psychosomatic medicine research with respect to both psychological and medical variables.
The primary purpose of this article was to integrate the reliability traditions of medicine and psychology and describe how a latent variable framework may be used more broadly to examine a variety of measurement models in psychosomatic research. Our aim was to demonstrate how latent variables would improve research in terms of reliability in measurement, sensitivity, and specificity and in instances of missing data. In doing so, we address how using latent variables in research may inform clinical practice. We begin by presenting classical reliability and generalizability theory models specified within a latent variable framework. Then we consider sensitivity and specificity models that are common in medicine and present the applicability of a latent class model. The last model we cover is item response theory (IRT), which is most applicable when developing questionnaires to measure a health outcome. We also illustrate how working with latent variables, in addition to addressing measurement error, may help deal with instances of missing data. Throughout the article, analyses and results from published articles that have applied latent variable modeling will be used to illustrate the points we emphasize.
THE CLASSICAL PSYCHOMETRIC EQUATION
Measurement in psychology has a long tradition beginning with classical test theory. Classical test theory (9–11) considers that the observations we make (labeled X) are not completely precise indicators of the underlying true latent trait (labeled T) but, in fact, have some type and amount of measurement error (labeled E). A simple equation stating this idea is
As an example, let us consider systolic blood pressure. We assume there is a continuum of systolic blood pressure levels (as shown in Fig. 1) and that each individual has a unique position along that continuum. At some point along that continuum, there may be a threshold, where the label for the blood pressure changes from normal to hypertensive. But the existence of this threshold does not take away from the key notion that systolic blood pressure occurs on a continuum. The unique position of each person on that continuum is his/her true blood pressure level, T. When blood pressure is assessed by a physician, nurse, technician, or automated device, the reading we see is not T but the observed value, X. X could be exactly equal to T, higher than T, or lower than T. The difference between the true level and the observed level is the amount of measurement error that exists in that specific observation. Under classical test theory, this error is random and normally distributed with mean equal to 0 and a constant variance (σ2e). Thus, we expect the blood pressure readings we take on a given person, if we were to take many of them repeatedly, to form a normal distribution around the person’s true level, with the mean exactly at T. These ideas are illustrated in Figure 1, where each T represents the true scores for three individuals, and for each individual, there could be different observed scores, the X’s, obtained from repeated measurements on that individual.
CLASSICAL APPROACH TO EVALUATING MEASUREMENT ERROR
When we observe a single individual at one time, as is done in clinical practice, it is not possible to partition the true score from the measurement error. Both parts are combined into the single observation. But if we were to measure a group of individuals multiple times, it would then be possible to partition the variability in the observations into variability due to true individual differences and error variability. This variance partitioning is expressed in the equation:
One way of evaluating the amount of measurement error is to calculate the proportion of variance in the observations that is associated with true score variance and the proportion that is associated with error variance. This is the conventional approach to estimating reliability because, in the classical theory framework, reliability is defined as the proportion of total variance that is due to true score variance,
where rxx represents the reliability coefficient. When reliability is close to 1, the error variance is very small relative to the true score variance, and we can feel confident about the precision of our observed measures. Although there is no established cutoff value, over time, a value of 0.8 or higher has come to be associated with reliable measurement for making decisions about groups, as is done in research settings, whereas for individual decisions in clinical practice, values greater than 0.90 are desirable (12).
The notation used to indicate a reliability coefficient (rxx) with two x’s in the subscript gives us a clue as to how the coefficient is often estimated in practice. Reliability is typically estimated from the correlation between two repeated measurements. If those two repeated measurements represent the same true score with the same metric for each individual and have equal error variances, they are considered to be parallel measurements. It can be shown algebraically that the correlation between two parallel measurements is equivalent to the proportion of true variance. Depending on how a reliability study is designed, the two parallel measurements could represent the exact same measurement procedure implemented twice over an interval (test-retest or stability reliability), the same measurement procedure implemented by two different individuals (interrater reliability), or the measurements made from two different forms (alternative form reliability). For self-report questionnaires consisting of multiple items, an alternative form of reliability (internal consistency) is estimated by a formula, Cronbach α, which essentially estimates the average of the correlations between two subsets of the items, averaged across all possible subsets. This formula does not assume parallel measurement, only that the metrics of the items are the same (referred to as tau equivalent). Several well-known psychometric textbooks describe these ideas more comprehensively (12–14).
THE LATENT VARIABLE MODEL APPROACH TO CLASSICAL RELIABILITY
The parallel measurement model under the classical theory may be specified using latent variables in the context of structural equation modeling. This can be done using confirmatory factor analysis (CFA), where a latent variable is specified for the construct of interest (in this case, the true score), and the parallel measures represent the observed indicators of the construct (i.e., the X values). For an introduction to CFA, see the Statistics Corner article by Babyak and Green (15). The parallel model assumptions (equal metrics and equal variances) imply the loadings for the indicators all equal 1, and the error variances are equal for all indicators. As illustrated in Figure 2, the latent construct (T) is responsible for the covariation between the observed indicators (X). Because the errors are independent from one another and from T, they don’t contribute to such covariation. Reliability can be calculated by estimating the variance of the latent variable (true score variance) and the variance of the errors (constrained equal between indicators) and substituting those estimates into Equation 3.
Thus, reliability under the classical theory is a special case that can be specified as a restricted CFA with parallel measurement. Under this restricted model, the latent variable is the true score. But a reflective measurement model, using latent variables, is broader and can accommodate congeneric measures, where the loadings are estimated freely and the error variances not constrained. These broader models, when properly specified, also allow for estimation of validity by accounting for both random and systematic errors, as illustrated below.
A study of the relationship between adherence to medication and viral load in adults positive for human immunodeficiency virus (HIV+) infection (16) illustrates the improvement in estimation of an anticipated effect (adherence to medication leading to reduction in viral load) that can be gained when measurement error is modeled. The reader is reminded that unstandardized regression coefficients are attenuated by unreliable predictors, and standardized coefficients, such as correlations, are attenuated by unreliability in either variable. In the example, adherence was measured using three different methods: two self-report procedures and electronic bottle caps. Measures of adherence and viral load were taken at multiple times. Latent variables were specified using the repeated indicators for each method of adherence (self-report or caps) and for viral load to account for stability reliability using a parallel measurement model. Stability reliability for measures of adherence was 0.36 to 0.51, depending on the method; stability reliability of viral load was 0.73. Correlations between individual observed measures of adherence (i.e., X) and viral load ranged between 0 and −0.38. The −0.38 was also the highest value obtained from a linear composite of observed adherence variables correlated with the mean viral load. But when the latent variables with multiple indicators were used, the correlations between adherence and viral load improved to −0.45 to −0.49, depending on the method. A second-order factor model that combined the three latent variables, one for each method, was then specified to further account for systematic method variance and increase validity, which further improved the correlation to −0.67. Thus, in a research context, working with a latent variable model yielded an estimate closer to the anticipated effect of medication adherence on HIV viral load. In contrast, conventional approaches would have suggested only a relatively modest association between medication adherence (a standard of care in HIV management) and viral load (a clinically meaningful indicator of disease activity/status). This example illustrates how using a single indicator can miss or underestimate an important association.
GENERALIZABILITY THEORY WITHIN A LATENT VARIABLE MODEL
Classical theory does not specify what the source of error might be. In fact, there could be multiple sources of error, some more problematic than others. When measuring blood pressure, the error could come from the individual taking the reading or the occasion of measurement. Other sources are possible such as the device used to take the blood pressure, the posture of the person while the reading is taken, or the psychological state of the person at the time the reading is taken. Therefore, the simplicity of Equation 1 is deceiving because the error component, E, could come from a single source or from a combination of sources, but because there is only one E term in the equation, those sources cannot be decomposed. Generalizability (G) theory (17) provides a flexible framework allowing the specification of multiple sources of error in a measurement procedure. Full explanation of the theory is beyond the scope of this article, and readers are referred to the works of Cronbach et al. (17), Brennan (18), and Llabre et al. (19) for more details. At a very basic level, instead of a single error term, multiple sources of error may be specified.
With the appropriate data collection design, the variability in the multiple sources may be estimated, and this estimation can be done within the latent variable framework (20). As an example, imagine a study where we have readings of blood pressure taken by different nurses (a random facet) on a sample of participants (the objects of measurement) on multiple occasions (a second random facet). If there are two nurses taking readings on four separate occasions, the result will be eight readings per participant. Those eight readings are the indicators in a model that specifies a “true score” latent variable, two latent variables for the nurse effect, and four latent variables for the occasion effect, plus other random errors as represented in Figure 3. By fixing all loadings to 1 and constraining the variances of the nurse latent variables equal to each other, the variances of the occasion latent variables equal to each other, and the variances of the errors all equal to each other, the necessary independent variance components may be estimated. To ensure independence, the covariances among the latent variables are fixed at 0. Llabre et al. (19) reported on a similar design to determine the number of blood pressure readings needed to obtain reliable measurement using analysis of variance. But the same estimation can be accomplished in a latent variable framework. In this manner, applications of G-theory can use the latent variables to examine their associations with other variables controlling for measurement error.
RELIABILITY OF MEASUREMENT IN MEDICINE
The classical psychometric tradition is recognized in the medical field (21), 1 but medical research emphasizes categorical outcomes (22,23) because decisions made in medical settings often involve the classification of a patient as having a disease or not. It is this emphasis on classification, mostly dichotomies, that is partly responsible for approaching reliability in medicine with statistical indices of interrater agreement. One of those statistical indices is the κ coefficient. A previous study (23) suggested that the interclass κ is also equal to the proportion of variance in the observed score accounted for by true score variance, as expressed in Equation 3, and meets the classical reliability definition. But these links to measurement theory are seldom made explicit. Rather the focus, as exemplified by how the topic of reliability is covered in current epidemiology methods textbooks (24), is on the computation and interpretation of coefficients. It is important to note that interrater agreement on a decision is a separate issue from the reliability of a measurement procedure. Two physicians could have perfect agreement on a diagnosis but be incorrect if their decisions are based on unreliable measures. Thus, consideration of the reliability of the measurement per se is an essential step.
SENSITIVITY AND SPECIFICITY
Prominent in the medical field are indices of sensitivity and specificity of measures—related statistics in the context of dichotomies that address validity. The calculation of sensitivity and specificity usually involves a comparison between a newly developed measure and a reference or “gold” standard. Sensitivity is the proportion of correct positive classifications, and specificity is the proportion of correct negative classifications. The reference standards are typically assumed to be free of measurement error and true measures of the construct being assessed. In reality, however, reference standards are subject to the same measurement error as newly developed measures, and these errors need accounting in the statistical modeling to avoid bias. Models that include measurement error when assessing sensitivity and specificity have been proposed within a latent variable framework in the form of latent class analysis (25,26). These models address the inherent limitations by not requiring the use of a “gold” standard to compare to newly developed measures but instead allow the use of new and established measures in the same model to assess sensitivity and specificity for each measure simultaneously.
USING LATENT CLASS ANALYSIS TO EXAMINE SENSITIVITY AND SPECIFICITY
Latent class analysis (LCA) is an extension of latent variable modeling where the latent variable is categorical and the indicators are also categorical (27). In LCA, individuals are placed into unobserved or latent classes based on a set of observed indicators whose associations are explained by the classes. A key assumption to LCA is the idea of conditional or local independence. Conditional independence means that the observed indicators within a given latent class are independent. In other words, within a condition or class, the results of one test do not inform the results of the other tests. However, this assumption does not suggest that the variables are independent within the data set. In fact, the relationships among the observed variables are explained by the classes. As is the case with most dichotomous variables used in medicine, LCA models assume that the classes are mutually exclusive and exhaustive (i.e., each individual belongs to one class). LCA recognizes the uncertainty of classification and the possible violation of the conditional independence assumption, but this uncertainty can be reduced by the use of more reliable measures that are used to create the latent classes.
When assessing sensitivity and specificity using an LCA approach, a two-class solution is first tested to determine whether it is consistent with the data. Once the number of classes is established (two in the case of a dichotomous diagnosis, but more are possible), two parameters are estimated using maximum likelihood. The estimated parameters in LCA include the latent class prevalence, which is the proportion of individuals within a given class, and item-response probabilities, the probabilities of endorsing a response given the class membership. The sensitivity of a measure is expressed as a response probability (i.e., the probability of having a positive diagnosis according to that measure given membership in the positive condition class). The probability of having the condition based on a specific measure, yet being a member in the negative condition class is considered a false-positive. Subtracting this probability from 1 gives the specificity of the test. Positive and negative predictive values of a test can also be examined using an LCA approach because the information needed to calculate these values (i.e., sensitivity, specificity, and prevalence) can be obtained.
Sensitivity and specificity for several cutoffs of a given measure can be examined using the LCA approach in separate LCA models. An example, LCA could be used to test the sensitivity and specificity for the different cutoff values for triglycerides specified in the literature to detect metabolic syndrome in children and adolescents (28). For instance, the first LCA model would include four dichotomous indicators indicating yes or no diagnosis (i.e., elevated blood pressure, overweight as indicated by body mass index or waist circumference, low level of high-density lipoprotein, and insulin resistance) and whether the adolescent has triglyceride levels of 110 mg/dL or greater as specified in the criteria of Cook et al. (29). After obtaining the sensitivity and specificity of triglycerides using the criteria of Cook et al. (29), then we would test a second LCA model that is exactly the same as the first model, but the criteria for high triglycerides would be 100 mg/dL or greater using the criteria of De Ferranti et al. (30). One could then determine which cutoff value for triglycerides maximizes some function of sensitivity and specificity such as the Youden index (31) when detecting metabolic syndrome in children and adolescents. When using multiple indicators of the same construct, LCA can also be used for reliability analysis (32,33). Once the latent classes have been established to distinguish healthy from diseased, they can be incorporated into more complex models that compare them on the basis of risk factors or the classes may be used to predict subsequent disease states. Working with the latent classes rather than the observed measures provides a way to account for measurement error in analytic models.
To present an example from the literature, Rindskopf and Rindskopf (25) used LCA to test the sensitivity and specificity of four measures commonly used to diagnose myocardial infarction (MI) in hospitalized patients: presence or absence of a new positive Q-wave, presence or absence of major risk factors for MI, presence or absence of a high level of creatine phosphokinase–myocardial band test, and whether the patient flipped a high level of lactate dehydrogenase. A diagnosis of an MI can be determined from each of the four individual measures. However, there may be discrepancy or lack of sensitivity or specificity for a given measure so a response pattern based on results from each individual measure is used to classify the individual. In other words, each response pattern is a class. An individual can have a response pattern of “yes” across all four measures indicating that an MI has indeed occurred or “no” across all of the measures suggesting that the individual did not experience an MI. There is also the possibility that only three of the four measures suggest a positive diagnosis, and this response pattern would be considered a third class. Using an LCA approach, each different response pattern represents a potential class; however, model fit usually suggests fewer classes than the expected number of different response patterns. In their article, a two-class model was accepted, meaning that one class had a “no” for each individual measure and the other class had a “yes” for each individual measure. In the sample, about 46% (latent class prevalence) were considered to have had an MI based on the positive diagnosis from each of the four measures. Item response probabilities were examined to determine the sensitivity (using factor loadings from the MI class) and specificity (subtracting 1 from the factor loadings in the non-MI class) of each test. More recently, this approach was applied in the area of substance abuse (34).
ITEM RESPONSE THEORY
In anticipation of obtaining reliable measurement, researchers using questionnaires may use IRT to calibrate a set of items that appropriately target a population’s level on a given construct (i.e., latent trait). IRT may also involve categorical indicators; however, these indicators reflect an underlying continuous latent variable. Model parameters include item difficulty and item discrimination. Item difficulty refers to how difficult it is for a person to affirm an item or the difficulty in endorsing one item response over another when using measures with ordinal categorical responses. Item discrimination is how well an item can discriminate between persons with relatively higher or lower trait levels than the difficulty of the item. Factor loadings may be interpreted as indices of discrimination and thresholds (intercepts) as indices of difficulty. Among the many models available in IRT, the Rasch model, one-parameter logistic model (1-PL), and two-parameter logistic model (2-PL) are common (35).
In the medical field, IRT has been used to develop a quality-of-life item bank for the Patient-Reported Outcomes Measurement Information System (36). A recent study (37) used a sample of more than 3000 patients with long-term illnesses to examine item difficulty (i.e., how limited a person is in physical functioning), item discrimination, and category threshold discrimination of 15 physical functioning items by applying a graded response model and comparing a 2-PL to a 1-PL to test nested models. A graded response model is used when item responses can be described as an ordered categorical response (e.g., 0 = not at all, 1 = sometimes, 2 = often). Comparison of a 2-PL (freely estimated slope or discrimination parameter) to a 1-PL (fixed slope or discrimination parameter) model can help researchers assess whether the relationship with the latent variable is similar or different across items.
When running an IRT model, the difficulty of the item is placed on the same scale as a participant’s latent trait level. Because IRT places item properties on the same metric as person scores, researchers can assess if the measure covers the appropriate range of difficulty and levels of the construct of interest, which has implications for reliability and confidence in the scores obtained (38). An item-person map places item difficulties and/or response category thresholds and participants’ latent trait (e.g., physical functioning) on the same metric using a logit scale, which is centered at a mean of zero. Items located at zero would be considered moderately difficult to endorse (e.g., can do household chores) and participants would be limited a little in physical functioning. Very difficult physical activities to endorse (e.g., can walk more than 30 minutes) and participants with no limitations at all have positive logits. Easy items to endorse (e.g., can get out of bed) and participants who are limited a lot in physical functioning have negative logits. In Figure 4, the item difficulties map well onto the distribution of the participant’s latent trait levels, with the exception of Item 4, which is extremely easy. In this case, researchers may consider removing Item 4 from the measure because such an item will result in no variability among participants and, therefore, will not contribute to their differentiation. IRT may also be used to assess item bias as differential item functioning and/or generate individual factor scores.
ADVANTAGE WHEN DESIGN HAS MISSING DATA
In addition to handling measurement error, latent variable models have other applications not covered here. One such application is particularly useful in research situations in medicine that anticipate missing data resulting from procedures that are costly and/or intrusive. Such situations are common in medical research where a large sample may be needed to obtain a certain level of power, but where the best measures of the desired outcome are impractical to implement on a large scale. Take for example the use of the euglycemic clamp to measure insulin resistance. Although this procedure is considered the gold standard, it requires the presence of a physician and is extremely burdensome on the patient. Alternative measures such as homeostasis model assessment (HOMA), insulin sensitivity index (ISI) (39), or measures of fasting insulin and glucose derived from an oral glucose tolerance test (OGTT) are more practical for large-scale research projects. Rather than selecting a large sample and doing only OGTT or selecting a small sample and doing the clamp, one could design a study (40) where a large sample of participants is assessed with derived measures from the OGTT and a random subset is additionally given the clamp. Working with the latent variable of insulin resistance, the researcher can then include it as a predictor or outcome in a model and estimate the effect parameters using full information maximum likelihood (41,42). In this manner, the researcher capitalizes on the availability of clamp data on some participants, thus increasing the validity of the construct.
To illustrate, we analyzed simulated data from a sample of 500 participants with measures of insulin resistance from the euglycemic clamp and OGTT, waist circumference, and low-density lipoprotein (LDL) cholesterol. A simple mediation model was tested, where insulin sensitivity (latent variable) mediated the relation between waist circumference and LDL as shown in Figure 5. Insulin sensitivity was a latent variable with three indicators: euglycemic clamp, and two derived indicators from the OGTT, HOMA, and ISI. In the context of a mediation model, where causality is implied, the latent variable (mediator) is viewed as a real phenomenon in the platonic sense, able to produce an effect. When we used the complete data on the total sample, the mediation model fit the data (χ2 = 9.22, df = 5, p = .10) and the standardized indirect effect was 0.30, explaining 19% of the variance in LDL. We then repeated the analysis with the same sample and tested the same model but this time used clamp data only from a random subset of 50% of the sample. The rest of the variables were based on the complete data set. The mediation model based on partial data for the clamp again fit the data. The indirect effect was estimated at 0.32, explaining 21% of the variance. Similar results were obtained when the percent missing was increased up to 80%, but with only 10% of the available sample, we observed some positive bias in the estimate of the effect. When we excluded the clamp data and worked with the total sample but only the observed variable from HOMA as the mediator, the mediation model did not fit the data (χ2 = 29.97, df = 1, p < .01), and the standardized indirect effect dropped to 0.093, explaining only 9% of the variance. Thus, working with a latent variable allowed us to incorporate a gold standard on which we had missing data, yet take advantage of information from the total sample while producing a model with clinically relevant information on the influence of waist circumference on LDL via insulin resistance. More work is needed to establish the amount of missingness tolerated under various conditions and guide users in selecting optimal designs.
LATENT VARIABLES AND CLINICAL PRACTICE
By definition, a latent variable is measured in the context of group information because it is derived from the covariance among a set of indicators. In addition, the metric or scaling of a latent variable is arbitrary and may be set to that of any specific indicator or to a standard scale. Thus, latent variables do not have a unique translation to a score for an individual. This is not a limitation in a research setting because research findings are based on analyses of samples selected from populations. Latent variables thus allow the estimation of parameters to inform research questions while accounting for measurement error.
The translation of a latent variable model to a clinical diagnosis is not straightforward and could take multiple forms. An example is the measurement of metabolic syndrome. Several definitions are currently used, all based on cutoff scores from indicators of the components of the syndrome (43–45). The lack of a standard definition and the arbitrary nature of the cutoff scores have been criticized (46). Latent variable models of metabolic syndrome have been specified and shown to generalize across populations (47) but remain removed from use in clinical practice. Are those latent variable models not useful? We would argue that before establishing a set of criteria that may be used in clinical practice, a progression of evidence is needed, starting with a latent variable model. Test of such a model answers the question of whether such a syndrome exists (i.e., CFA to address construct validity). Once the question has been answered, we must address the extent of the utility of the syndrome (predictive validity). Does it predict disease? This question can and should be answered with latent variables so as to obtain unbiased effect parameters. Translation to specific cutoff scores can then be addressed by determining whether there are subpopulations of individuals with and without the syndrome. This can be done within a latent class analysis framework as discussed above. Lastly, specificity and sensitivity for various cutoff scores can be assessed with a latent class analysis. We contend that just because a latent variable cannot be observed in a clinical setting does not detract from its value in research or its use in the progression from research to clinical practice.
Composite scores can also play a role in clinical practice. These are factor scores (composites based on the indicators) derived from the latent variables to obtain information about each individual. Results from the latent variable model can be used to create a scoring system using observed variables that can in turn be used for clinical prediction. Factor scores obtained from the Patient-Reported Outcomes Measurement Information System (IRT-based items) are converted into the metric of T scores for ease of interpretation. The limitation is that a single factor score will necessarily contain measurement error. But if a distribution of factor scores is generated, then the mean of such distribution will approximate the true score. In educational measurement, it is common practice to report, in addition to a person’s score, a confidence interval around that score based on the measure’s standard error of measurement. Recent developments in latent variable modeling software (48) have introduced the concept of random draws or plausible values (i.e., multiple imputations of factor scores to create a distribution) using a Bayesian approach in addition to standard errors of those factor scores. With this approach, clinical researchers can better identify where an individual lies on a given continuum of disease. But clinical interpretation of the meaning of a factor score depends on the understanding of its distribution in populations of healthy and unhealthy individuals. Thus, a process for generating normative data would be needed.
In this article, we described a latent variable modeling approach to the specification of measurement error models in psychosomatic medicine research, building on more traditional methods of examining measurement models in psychology and medicine. This approach, widely used in the psychological literature, has seen more limited application in the medical field. In a research context, latent variable models can be very useful for estimating effect parameters by separating sources of measurement error from the constructs of interest. In other words, researchers can be more confident about their research findings when they include the measurement model in their analysis. Several examples have been provided to illustrate the point that incorporating latent variable models in psychosomatic medicine research can help advance the field. More specifically, basic latent variable modeling can be used to examine reliability of parallel measures, account for multiple sources of error, and handle missing data. Advanced latent variable models such as IRT and LCA can be applied when developing new measures or examining the sensitivity and specificity of existing ones. But implementation of latent variable models can be complicated and require a level of statistical background that may be beyond the typical researcher.
We recommend that psychosomatic medicine researchers consider the use of latent variable modeling early in the development of the research study. For example, consider the inclusion of multiple indicators to cover a particularly important source or sources of measurement error. One can do this by taking multiple measurements (e.g., blood pressure readings) in a single day or taking multiple readings over a number of days. Furthermore, when including multiple measures of a construct, consider their specification as indicators of a latent variable (i.e., an indirect way to measure the construct of interest) and incorporating the latent variable in the analysis, rather than analyzing the measures individually. If the study design involves the development of a questionnaire (multiple choice, rating scale, etc), consider the use of IRT to examine the psychometric properties. Strengthening the measurement aspect of a study during the design phase will permit accounting for those aspects in the analysis phase. Implementing the analyses may require learning more about these models and/or collaborating with a quantitative methodologist trained in these methods. The field of psychosomatic medicine will benefit from increased familiarity with latent variable models and their proper implementation to address a variety of measurement issues.
1. Guolo A. Robust techniques for measurement error correction: a review. Stat Methods Med Res 2007; 17: 555–80.
2. Bollen K. Latent variables in psychology and the social sciences. Annu Rev Psychol 2002; 53: 605–34.
3. Sobel ME. Causal inference in latent variable models. In: von Eye A, Clogg CC, editors. Latent Variables Analysis: Applications for Developmental Research. Thousand Oaks, CA: Sage Publications; 1994: 3–35.
4. Feinstein AR. Hard science, soft data, and the challenges of choosing clinical variables in research. Clin Pharmacol Ther 1977; 22: 486–98.
5. Donaldson GW. General linear contrasts on latent variable means: structural equation hypothesis tests for multivariate clinical trials. Stat Med 2003; 22: 2893–917.
6. Dendukuri NN, Hadgu AA, Wang LL. Modeling conditional dependence between diagnostic tests: a multiple latent variable model. Stat Med 2009; 28: 441–61.
7. Jones RN, Fonda SJ. Use of an IRT-based latent variable model to link different forms of the CES-D from the Health and Retirement Study. Soc Psychiatry Psychiatr Epidemiol 2004; 39: 828–35.
8. Ling RF. Correlation and causation. J Am Stat Assoc 1982; 77: 489–91.
9. Spearman C. The proof and measurement of association between two things. Am J Psychol 1904; 15: 72–101.
10. Guilford JP. Psychometric Methods. New York, NY: McGraw-Hill; 1936.
11. Gulliksen H. Theory of Mental Tests. New York, NY: Wiley; 1950.
12. Nunnally JC, Bernstein IH. Psychometric Theory. 3rd ed. New York, NY: McGraw-Hill; 1994.
13. Lord FM, Novick MR. Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley; 1968.
14. Crocker L, Algina J. Introduction to Classical and Modern Test Theory. New York, NY: Holt, Rinehart, & Winston; 1986.
15. Babyak MA, Green SB. Confirmatory factor analysis: an introduction for psychosomatic medicine researchers. Psychosom Med 2010; 72: 587–97.
16. Llabre MM, Weaver K, Duran R, Antoni M, McPhearson-Baker S, Schneiderman N. A measurement model of medication adherence to Highly Active Antiretroviral Therapy and its relation to viral load in HIV+ adults. AIDS Patient Care 2006; 20: 701–11.
17. Cronbach LJ, Gleser GC, Nanda H, Rajaratnam N. The Dependability of Behavioral Measurements: Theory of Generalizability Scores and Profiles. New York, NY: Wiley; 1972.
18. Brennan RL. Elements of Generalizability Theory. Iowa City, IA: American College Testing; 1983.
19. Llabre MM, Ironson G, Spitzer S, Gellman MD, Schneiderman N. How many blood pressure measurements are enough? An application of generalizability theory to the assessment of blood pressure reliability. Psychophysiology 1988; 25: 97–106.
20. Marcoulides GA. Estimating variance components in generalizability theory: the covariance structure analysis approach. Struct Equat Modeling 1996; 3: 290–9.
21. Lachin JM. The role of measurement reliability in clinical trials. Clin Trials 2004; 1: 553–66.
22. Kraemer H. Ramifications of a population model for κ as a coefficient of reliability. Psychometrika 1979; 44: 461–72.
23. Kraemer H, Periyakoil VS, Noda A. kappa coefficients in medical research. Stat Med 2002; 21: 2109–29.
24. Szklo M, Nieto FJ. Epidemiology: Beyond the Basics. Sudbury, MA: Jones & Bartlett; 2006.
25. Rindskopf D, Rindskopf W. The value of latent class analysis in medical diagnosis. Stat Med 1986; 5: 21–7.
26. Uebersax JS, Grove WM. Latent class analysis of diagnostic agreement. Stat Med 1990; 9: 559–72.
27. Collins LM, Lanza ST. Latent Class and Latent Transition Analysis: With Applications in the Social, Behavioral, and Health. New York, NY: Wiley; 2010.
28. Reinehr T, de Sousa G, Toschke AM, Andler W. Comparison of metabolic syndrome prevalence using eight different definitions: a critical approach. Arch Dis Child 2007; 92: 1067–72.
29. Cook S, Weitzman M, Auinger P, Nguyen M, Dietz WH. Prevalence of a metabolic syndrome phenotype in adolescents: findings from the third National Health and Nutrition Examination Survey, 1988–1994. Arch Pediatr Adolesc Med 2003; 157: 821–7.
30. De Ferranti SD, Gauvreau K, Ludwig DS, Neufeld EJ, Newburger JW, Rifai N. Prevalence of the metabolic syndrome in American adolescents: findings from the third National Health and Nutrition Examination Survey. Circulation 2004; 110: 2494–7.
31. Youden WJ. Index for rating diagnostic tests. Cancer 1950; 3: 32–5.
32. Clogg CC, Manning WD. Assessing reliability of categorical measurements using latent class models. In: von Eye A, Clogg CC, editors. Categorical Variables in Developmental Research: Methods of Analysis. San Diego, CA: Academic Press, Inc; 1996: 169–82.
33. Flaherty BP. Assessing reliability of categorical substance use measures with latent class analysis. Drug Alcohol Depend 2002; 68: 7–20.
34. Pence BW, Miller WC, Gaynes BN. Prevalence estimation and validation of new instruments in psychiatric research: an application of latent class analysis and sensitivity analysis. Psychol Assess 2009; 21: 235–39.
35. Embretson SE, Reise SP. Item Response Theory for Psychologists. Mahwah, NJ: Lawrence Erlbaum Associates, Inc; 2000.
36. Cella D, Yount S, Rothrock N, Gershon R, Cook K, Reeve B, Ader D, Fries JF, Bruce B, Rose M, PROMIS Cooperative Group. The Patient-Reported Outcomes Measurement Information System (PROMIS). Med Care 2007; 45: S3–11.
37. Hays RD, Liu H, Spritzer K, Cella D. Item response theory analyses of physical functioning items in the medical outcomes study. Med Care 2007; 45: S32–8.
38. Dunn AL, Resnicow K, Klesges LM. Improving measurement methods for behavior change interventions: opportunities for innovation. Health Educ Res 2006; 21: i121–4.
39. Gutt M, Davis CL, Spitzer SB, Llabre MM, Kumar M, Czarnecki EM, Schneiderman N, Skyler JS, Marks JB. Validation of the insulin sensitivity index (ISI): comparison with other measures. Diabetes Res Clin Pract 2000; 47: 177–84.
40. Graham JW, Taylor BJ, Olchowski AE, Cumsille PE. Planned missing data designs in psychological research. Psychol Methods 2006; 11: 323–43.
41. Little RJA, Rubin DB. Statistical Analysis With Missing Data. 2nd ed. New York, NY: Wiley; 2002.
42. Shafer JL, Graham JW. Missing data: our view of the state of the art. Psychol Methods 2002; 7: 147–77.
43. World Health Organization. Definition, Diagnosis, and Classification of Diabetes Mellitus and Its Complications: Report of a WHO Consultation. Geneva, Switzerland: World Health Organization; 1999.
44. Expert Panel on the Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults. Executive summary of the Third Report of the National Cholesterol Education Program (NCEP) Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults (Adult Treatment Panel III). JAMA 2001; 285: 2486–497.
45. Grundy SM, Brewer HB Jr, Cleeman JI, Smith SC Jr, Lenfant C, National Heart, Lung, and Blood Institute, American Heart Association. Definition of metabolic syndrome: report of the National Heart, Lung, and Blood Institute/American Heart Association conference on scientific issues related to definition. Circulation 2004; 109: 433–38.
46. Kahn R, Buse J, Ferrannini E, Stern M. The metabolic syndrome: time for a critical appraisal: joint statement from the American Diabetes Association and the European Association for the Study of Diabetes. Diabetes Care 2005; 28: 2289–304.
47. Shen BJ, Goldberg RB, Llabre MM, Schneiderman N. Is the factor structure of the metabolic syndrome comparable between men and women and across three ethnic groups: the Miami Community Health Study. Ann Epidemiol 2006; 16: 131–37.
48. Muthén LK, Muthén BO. Mplus User’s Guide. 6th ed. Los Angeles, CA: Muthén & Muthén; 1998–2010.
In biomedical research, the coefficient of variation is often obtained as a measure of quality control or reproducibility within a sample pool or specimen. It should be noted that this statistic is not a measure of reliability in the classical sense and has serious limitations (21).