Longitudinal studies are used in nursing research to examine changes over time in health indicators. The most common longitudinal designs in nursing research are observational and experimental studies. Examples of longitudinal nursing studies include an observational study to examine how often nursing diagnoses are made during nursing home care visits over 34 months of follow-up (Morales-Asencio et al., 2009) and a longitudinal, quasi-experimental study in two care settings (small-scale facilities and regular psychogeriatric wards) in traditional nursing homes for older people with dementia over a year (Verbeek et al., 2009).
Traditional approaches to longitudinal analysis of means such as analysis of variance with repeated measures (R-ANOVA) are limited to cases with complete data or on imputation of missing data. The R-ANOVA does not allow inclusion of cases with missing data. For R-ANOVA, all subjects must have complete data to be included in the analysis. In a longitudinal study, participants may not complete the full study. During the study, they may drop out, move, or die. Cases with missing data at only one time point must be excluded from R-ANOVA analysis. Withdrawal bias affects the study results if data are missing in a systematic manner (Hartman, Forsen, Wallace, & Neely, 2002; Puma, Olsen, Bell, & Price, 2009). For example, depressed people or those with low social support may be less likely to complete the study. They may drop out in the middle of the study because they are distressed psychosocially. Excluding these persons could bias the results of the study, because the data obtained are only from the less-depressed or less-distressed participants.
At a minimum, this bias reduces the generalizability of the findings to people who were willing to complete the study; unfortunately, this population often would not be the group of greatest interest. Even more dangerous is the situation in which the reader is not aware of the withdrawal bias and therefore interprets the data as if the entire population were included, including the depressed or distressed patients who might represent a particularly vulnerable group.
Withdrawal bias is a particular problem in randomized clinical trials. Participants who are not responding to treatment or who are experiencing side effects may drop out. The need for a complete evaluation of the real world efficacy of treatment led to the standard of intent to treat analysis in treatment evaluation (Friedman, Furberg, & DeMets, 2010). The participants who are excluded from the study due to missing data need to be a random sample of the participants to obtain unbiased estimation of the impact of an intervention or characteristic on the outcome.
Data imputation may be used as an alternative way to deal with missing data. The informativeness of missing data has implications for the appropriateness of estimates derived from multiple imputation. When missingness is informative, multiple imputation results in data that are skewed based on the pattern of missingness. Even when data are missing at random, but not missing completely at random, multiple imputation often leads to bias toward the null (White & Carlin, 2010).
Over the last several years, the advantages of modern methods for longitudinal data analysis have led to their use in increasing numbers of studies. One of the major advantages of linear mixed models (LMMs) is their ability to analyze longitudinal data with unequal numbers of measurements for different participants (Little, 1995). In LMMs, data from different cases that are assessed at different numbers of times can be included in the analysis. This is a great advantage when using clinical data, because some participants may have more frequent appointments and, therefore, more frequent assessments than others. Data from all appointments, rather than data obtained with the lowest common frequency, can be included in LMM analysis. The ability to include different numbers of assessments for different cases in analyses also means that all data obtained from cases with data missing at one or more time points can be included in the analysis. Because more participants are available for analysis, LMMs provide increased power.
Longitudinal studies include more opportunities to have incomplete data than cross-sectional studies, because they require continued participation in the study with multiple assessments over extended periods of time. Missing data occur in various ways in longitudinal studies (Hedeker & Gibbons, 1997): Individuals move or die and are lost to follow-up after baseline assessment, individuals choose not to complete follow-up assessments, or individuals do not answer questions about some sensitive issues. Missing data can be classified as missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR; Rubin, 1976).
For data that are MCAR or MAR, missingness is noninformative and ignorable (Little & Rubin, 1987) and will not bias the results. For example, if data were lost due to a mistake by a data collector (MCAR), the incomplete data do not give any information. When data are MAR, the measurement after dropout can be predicted based on past measurements and provides unbiased results if the model is specified correctly (Post, Buijs, Stolk, de Vries, & le Cessie, 2010). Data that are MNAR have informative missingness that can lead to inappropriate conclusions in LMMs. Data that are missing based on initial values of the predictors or on outcomes can lead to vulnerability and omitted data biases.
In longitudinal studies, missingness that is informative can be related to variables not in the model, which makes the participants unrepresentative of the population due to omitted variable bias. If this unrepresentativeness is equal in subgroups, such as treatment and usual care groups, the results of the study will be skewed toward the type of participant that is represented more commonly in the study than in the population. If the impact of an intervention (or characteristic) is greater on the less represented group, it will appear that the intervention (or characteristic) has less impact than it really does. Analysis of the data will lead to a reduced or downward biased estimate of the effect from the intervention or characteristic, a vulnerability bias. In this situation, the impact is not skewed systematically with respect to the unmeasured characteristic, and the impact of the intervention or characteristic being evaluated is assessed in an unintended population. Continuing the example with depression, depressed individuals would have more missing data and would be distributed equally in the treatment groups. Thus, the proportion of depressed individuals in the analysis would be underrepresented. If the intervention was more effective among depressed people, then the results of the study would show less effectiveness than if data from more depressed people were in the analysis. Of course, the reverse could also occur if the group that was less susceptible to the treatment was overrepresented, also resulting in vulnerability bias.
Missingness that is informative may be due to variables in the model, such as initial values of the predictors or on outcomes, and can lead to omitted variable bias. For example, if more female participants did not respond to questionnaires about depression, results indicating low average depression among respondents may be unrepresentative of the population. This situation results in skewed estimates in the desired population.
Missingness that is informative may also be related to variables in the model and to variables not in the model. In this situation, unrepresentativeness is unequal in treatment and usual care subgroups. The treatment and usual care groups will be differentially skewed; there will be a higher proportion of depressed patients in one group than in the other. This situation leads to skewed estimates in the wrong population. The omitted data bias in this situation can lead to over- or underestimates of the effects of interventions or characteristics being studied. For example, participants who are more depressed may be more common in the intervention group. In that case, if the intervention is less effective for those who are depressed, the estimates of the effectiveness of the intervention will be skewed negatively due to omitted variable bias. Different patterns of missingness and informativeness can lead to positive or negative skew in outcomes. Therefore, it is important to examine sensitivity of inferences based on data wherein informative missingness is suspected (Paddock, Edelen, Wenzel, Ebener, & Mandell, 2006).
Little (1995) introduced pattern mixture models for longitudinal data that are used to identify whether missing data patterns are informative and to include the missing data patterns as covariates in the model to control for the effect of missing data patterns on the outcome (Raghunathan, 2004). Pattern mixture models have been used to examine the possibility of longitudinal nonignorable missing data as a concern (Hedeker & Gibbons, 1997; Hogan & Laird, 1997; Pauler, McCoy, & Moinpour, 2003). Using pattern mixture models stratifies incomplete data based on the pattern of missing values and formulates distinct models within each stratum (Little & Wang, 1996). Subjects are grouped into missing data patterns for an outcome of interest, and then models are developed based on the differences in the distribution of the outcome variable over these patterns (Little, 2008).
Pattern mixture models are useful for the analysis of the informativeness of missing data in longitudinal studies (Little, 1995). Pattern mixture models explicitly model the missing data distribution by identifying different missing data patterns. Nonignorable missing data patterns can be included as predictors in models to evaluate the contribution of missing data patterns to the outcome (Paddock et al., 2006). Pattern mixture models can be used to evaluate the contribution of missingness to other predictors in the model by examining interactions between the missing value pattern indicator and the other predictor. When researchers are using longitudinal data to examine changes in an outcome over time, the interaction between the missing value pattern and time will allow evaluation of whether trends over time differ according to the presence of a missing value pattern. Pattern mixture models are conceptually and computationally straightforward and do not require specific software (Atkins, 2005). Overall estimates of the outcome can be calculated from fixed effects in LMMs. They provide separate information about estimated outcomes according to missingness. Pattern mixture models are used in longitudinal studies from various fields including nursing, medicine, psychology, and education. Few researchers have described the procedure for using pattern mixture models in a straightforward way and demonstrate how to accomplish it using SPSS.
The purpose of this presentation is to illustrate how to apply a pattern mixture model for assessing the informativeness of missing values in a longitudinal study using SPSS. Also presented is a strategy for incorporating missing data patterns into LMMs when missing data are MNAR and, therefore, are informative and not ignorable.
A longitudinal data set was used to demonstrate the application of the pattern mixture model to assess missingness and the inclusion of missing data patterns into an analysis of data wherein data are MNAR. The data set came from the Patients’ and Families’ Psychological Response to Home Automated External Defibrillator Trial (PRHAT; Thomas et al., 2011). The PRHAT study was conducted as an ancillary study of the Home Automated Defibrillator Trial. The Home Automated Defibrillator Trial was designed to determine whether having an automatic external defibrillator in the home improves survival of patients at intermediate risk of sudden cardiac arrest. The primary purpose of the PRHAT study was to examine the long-term (more than 2 years) psychological responses, including anxiety, depression, family coping skills, and social support, and to compare the effects of CPR training and CPR/automatic external defibrillator training groups on these changes in post-myocardial infarction patients and their spouses or companions. Details of the PRHAT study regarding recruitment, data collection, and instruments are described in another article (Thomas et al., 2011). Psychosocial variables were assessed at baseline and at 1-year and 2-year follow-ups. The use of pattern mixture models was shown to examine missing data patterns in the longitudinal variable of patient coping.
At each study time point, whether each variable of interest was missing was coded into a separate dummy variable. For example, for the Family Crisis-Oriented Personal Evaluation Scale (FCOPES) completed by the patient, the presence or absence of missing data at baseline was coded into the variable miss_COPE0, the presence or absence of missing FCOPES data at the 1-year follow-up was coded into the variable miss_COPE1, and the presence or absence of missing FCOPES data at the 2-year follow-up was coded into the variable miss_COPE2. In each missing data variable, data were coded as follows: 0 = not missing and 1 = missing. The syntax in Figure 1 and screen captures for point and click application of SPSS in Supplemental Digital Content 1, http://links.lww.com/NRES/A70, are used to illustrate the creation of the dummy variable miss_COPE1 as an example. Similar syntax and statistical procedures were used for creating the dummy variables for the other two study time points.
For patient coping (outcome variable) that was measured with the FCOPES at baseline, 1 year, and 2 years, missing data patterns were identified. In SPSS, this process can be performed with Missing Value Analysis, an option within the analysis menu (Figure 2; Supplemental Digital Content 2, http://links.lww.com/NRES/A71). In the PRHAT data set, among the 417 patients, 237 cases had data missing at either one or two study time points (Table 1). Three missing data patterns were identified: data missing only at 2 years (n = 118), data missing at both 1 and 2 years (n = 82), and data missing at 1 year but not at 2 years (n = 37). No cases with data missing at baseline were in the data set. Measurement of outcome variables at baseline is generally required for inclusion of participants in longitudinal LMM analyses.
Dummy variables were created for each missing data pattern. The pattern of patient coping data missing only at the 2-year follow-up (Pattern 1) was recoded into a new variable named MDP1 (0 = not missing with Pattern 1, 1 = missing with Pattern 1). The pattern of patient coping data missing at both the 1- and 2-year follow-ups (Pattern 2) was recoded into a new variable named MDP2 (0 = not missing with Pattern 2, 1 = yes, missing with Pattern 2). The pattern of patient coping data missing only at the 1-year follow-up (Pattern 3) was recoded into a new variable named MDP3 (0 = not missing with Pattern 3, 1 = missing with Pattern 3; Figure 3; Supplemental Digital Content 3, http://links.lww.com/NRES/A72).
The final step in the evaluation was to examine whether scores on the outcome variable were related to missing data patterns. This involves use of LMMs. Prior to conducting LMMs, the data set must be restructured into a vertical format so that each time of assessment of the predictors and the outcome variable along with any other variables included in the analysis is on a separate line in the data set. This new structure for a data set is called by various names, including person-period data set (Singer & Willett, 2003), time-as-a-case format (Norusis, 2006), long versus wide, and vertical versus horizontal data structure (Heck, Thomas, & Tabata, 2010). The data set in the wide format before restructuring is shown in Figure 4-1. There is one line per case with separate variables for coping measured at three times (COPE0, COPE1, and COPE2) in wide format. The VARSTOCASES procedure in SPSS was used to restructure the data set from the wide to the long format (Figure 4-2). This process can be performed with Restructure, an option within the data menu (see Supplemental Digital Content 4, http://links.lww.com/NRES/A73). The restructured data file with the case identification variable, seven variables (patient coping, patient anxiety, patient depression, patient social support amount, spouse anxiety, spouse depression, and spouse social support amount) that were measured repeatedly, and several other variables that remained constant during the study are presented in Supplemental Digital Contents 5 (http://links.lww.com/NRES/A74) and 6 (http://links.lww.com/NRES/A75).
The example shows the restructuring procedure. In this example, the case number was used to create a case identifier (a new variable) named id1. In the syntax, ID=id1 is used to give this information to SPSS. In this study, some variables took on different values each time they are measured. The command MAKE uses repeated measurements of the same variable to create one variable that is measured on different occasions. For example, the variable coping was restructured from one line for each case with separate variable names for patient coping scores at baseline, 1 year, and 2 years into three separate lines in the data file for each case: one line for baseline, one line for 1-year follow-up, and one line for 2-year follow-up. The variable time is an index variable that takes three levels. Time indicates baseline (time = 1) and 1-year (time = 2) and 2-year (time = 3) follow-ups. The variables assumed to be constant and measured once (id1, age, gender, missingness at baseline, missingness at 1 year, missingness at 2 years, and each missing data pattern) needed to be present in every line in the new data file. The command KEEP was used to place these variables in every line. As a result, the data set was restructured into the long format with three lines per case (Figure 4-3). After restructuring the data, the time index variable was recentered so that the baseline was time = 0, 1-year follow-up was time = 1, and 2-year follow-up was time = 2. This restructuring starts the estimate for baseline at the intercept and keeps all time estimates within the range of the real data. Without centering, the intercept for coping would be placed at time = 0, which would represent a time before the study started (Singer & Willett, 2003).
Linear mixed models were conducted to evaluate if each missing data pattern predicts the outcome variable (coping) or interacts with time to predict changes in the outcome variable over time (Figure 5). The variable time in the model examines the effect of time on the outcome; it must be a predictor because it is included in the interaction between the missing data pattern and time (Table 2). The dependent variable is patient coping (coping), predictors (MDP1, MDP2, or MDP3) are the missing data patterns, and time is the time of assessment (0, 1, and 2 for baseline, 1 year, and 2 years, respectively). Pattern 1 (data missing at 2-year follow-up) did not predict patient coping (p = .245). There was no significant interaction between Pattern 1 and time (p = .833).
In the analysis of Pattern 2, the pattern was not a significant predictor of patient coping (p = .578) and time was also not significant (p = .395) when this pattern was included in the model. The interaction of Pattern 2 with time was not calculated but was included in the model as an example. An interaction with time is redundant in this case because Pattern 2 indicates data missing at the 1- and 2-year follow-ups. Therefore, there can be no trajectory of change over time for spouse coping for cases with Pattern 2. In contrast, time significantly predicted the outcome (p = .021) and there was significant interaction between Pattern 3 and time (p = .029), indicating that having data missing at 1 year on patient coping predicted change in patient coping over time. Pattern 3 did not predict patient coping (p = .244).
Because there was a significant interaction between Pattern 3 and time, estimates of the outcome variable with Pattern 3 present or absent were calculated. Outcomes were calculated based on the betas obtained from the estimates of the LMM. Estimates of the outcome variable, patient coping, were calculated for a case with Pattern 3 being present and absent at each time. Note that in SPSS, MDP3 = 1.00 was used as the reference group. Estimated coping scores were calculated by substituting the numbers 0 (time = 0), 1 (time = 1), or 2 (time = 2) into the equation for time and 0 (MDP3 = 1) or 1 (MDP3 = 0) into the equation for the presence of Pattern 3.
The SPSS output from the LMM evaluating the contributions of Pattern 3, time, and the interaction of these two variables to patient coping that was used to calculate the estimated scores is shown in Table 3. In the SPSS output for LMMs, estimate is equivalent to beta in general linear regression. The significant interaction between Pattern 3 and time is graphed in Figure 6-1.
The equation to estimate patient coping is y = constant + betaMDP3 + betatime + betatime*MDP3. The equation for cases without Pattern 3 (MDP3 = 0) is as follows:
The equation for cases with Pattern 3 (MDP3 = 1) is as follows:
The slope of change in estimated patient coping scores differed according to whether or not Pattern 3 was present. The slope for cases without Pattern 3 was more positive than the slope for cases with Pattern 3. Coping of patients without missing data did not change over 2 years, whereas coping of patients with missing data started higher but decreased over the 2 years (Figure 6-1). The presence of Pattern 3 is informative and MNAR.
Because there was a difference in estimated changes over time in patient coping scores between those with and without Pattern 3, analyses of hypotheses should include examination of the contribution of Pattern 3 for patient coping when conducting analyses of the contribution of coping to other outcome variables. In a longitudinal study, time is an important predictor comprising the change in an outcome variable. As an example, Pattern 3 is included as a predictor in a study using LMMs to examine the contribution of coping to changes in spouse social support amount over time. To achieve normality of the outcome variable, spouse social support amount was square rooted.
The syntax and the outcome of the analyses to evaluate the effect of patient coping on changes in spouse social support amount while controlling for and examining the contribution of Pattern 3 are shown in Figure 7 and Table 4. There was a significant interaction between Pattern 3 and time (p = .008). For individuals with Pattern 3 for patient coping, spouse social support amount increased over time. For those without the pattern, spouse social support amount decreased over time (Figure 8). The interaction of patient coping and Pattern 3 did not predict spouse social support amount (p = .696). Patient coping did not predict spouse social support (p = .938) or change in social support over time (p = .162).
The pattern mixture model as described is applicable only to longitudinal studies in which there are defined waves of data. Each case is expected to be measured in specific time intervals. In applications wherein there are multiple unspecified numbers of observations, the presence or absence of data over periods can be coded as present or absent and the analysis can be conducted in a similar manner to that described above. For example, whether a case had data present at all for Months 1–6 could be coded as data present or absent in Wave 1, and then whether the same case had data present at all for Months 7–12 could be coded as data present or absent in Wave 2. In studies with little missing data, coding complete versus not complete data could be appropriate (Hedeker & Gibbons, 1997).
The usefulness of the pattern mixture approach to missing data depends on the nature of the data from the individual study. If there are few missing data, using this method is unlikely to discover data that are MNAR, and even a few cases with data MNAR are unlikely to have a significant impact on the results of the study. The more missing data that are in the study, the more likely the patterns of missing data will be detected to be MNAR. Use of pattern mixture models in LMMs can lead to concerns about power because they involve inclusion of additional predictors in the model. Additional predictors generally require increases in numbers of subjects for equivalent power because of testing for the significance of more variables than were previously in the equation. For example, instead of testing for the contribution of one predictor variable, the contributions of the predictor variable, missing data pattern, and the interaction between the predictor variable and missing data pattern were being tested. For this reason, it is important to check power when including missing data patterns in the analyses before concluding that missing data patterns are not predictors of outcomes.
A longitudinal design is the most appropriate research method for examining changes in health indicators over time. Longitudinal data sets are more likely to be unbalanced and incomplete. The R-ANOVA does not allow inclusion of cases with missing data, but excluding these cases can bias the study findings. The best method available to assess longitudinal changes in data measured on the interval or ratio level is LMM, which allows inclusion of more participants for analysis and produces increased power. It also takes into account intercorrelations among successive measurements. This analysis reduces the biases introduced through inclusion of data only from complete cases, using multiple imputation, or ignoring correlations between measures within cases. Linear mixed models analysis outcomes can be biased if missing data patterns are informative. Pattern mixture models provide a method to assess the informativeness of missing data in longitudinal data analysis (Hedeker & Gibbons, 1997). Missing data patterns can be incorporated as fixed effects into LMMs to control for and evaluate the effects of informative missingness on outcomes.
Atkins D. C. (2005). Using multilevel models to analyze couple and family treatment data: Basic and advanced issues. Journal of Family Psychology, 19, 98–110. doi:10.1037/0893-322.214.171.124.
Friedman L. M., Furberg C. D., DeMets D. L. (2010). Fundamentals of clinical trials (4th ed.). New York, NY: Springer.
Hartman J. M., Forsen J. W. Jr, Wallace M. S., Neely J. G. (2002). Tutorials in clinical research. Part IV: Recognizing and controlling bias. The Laryngoscope, 112, 23–31. doi:10.1097/00005537-200201000-00005.
Heck R. H., Thomas S. L., Tabata L. N. (2010). Multilevel and longitudinal modeling with IBM SPSS
. New York, NY: Routledge.
Hedeker D., Gibbons R. (1997). Application of random-effects pattern mixture models
for missing data
in longitudinal studies. Psychological Methods, 2, 64–78.
Hogan J. W., Laird N. M. (1997). Model-based approaches to analysing incomplete longitudinal and failure time data. Statistics in Medicine, 16, 259–272.
Little R. (2008). Selection and pattern mixture models
. In Verbeke G., Davidian M., Fitzmaurice G., Molenberghs G. (Eds.), Longitudinal data analysis (pp. 409–431). Boca Raton, FL: CRC Press.
Little R. J. A. (1995). Modeling the drop-out mechanism in repeated-measures studies. Journal of the American Statistical Association, 90, 1112–1121.
Little R. J. A., Rubin D. B. (1987). Statistical analysis with missing data
. New York, NY: Wiley.
Little R. J. A., Wang Y. (1996). Pattern-mixture models for multivariate incomplete data with covariates. Biometrics, 52, 98–111.
Morales-Asencio J. M., Morilla-Herrera J. C., Martin-Santos F. J., Gonzalo-Jimenez E., Cuevas-Fernandez-Gallego M., Bonill de Las Nieves C., Rivas-Campos A. (2009). The association between nursing diagnoses, resource utilisation and patient and caregiver outcomes in a nurse-led home care service: Longitudinal study. International Journal of Nursing Studies, 46, 189–196. doi:10.1016/j.ijnurstu.2008.09.011.
Norusis M. J. (2006). SPSS
15.0 statistical procedures companion. Upper Saddle River, NJ: Prentice Hall.
Paddock S., Edelen M., Wenzel S., Ebener P., Mandell W. (2006). Pattern-mixture models for addressing nonignorable nonresponse in longitudinal substance abuse treatment studies. Santa Monica, CA: RAND Corporation.
Pauler D. K., McCoy S., Moinpour C. (2003). Pattern mixture models
for longitudinal quality of life studies in advanced stage disease. Statistics in Medicine, 22, 795–809. doi:10.1002/sim.1397.
Post W. J., Buijs C., Stolk R. P., de Vries E. G., le Cessie S. (2010). The analysis of longitudinal quality of life measures with informative drop-out: A pattern mixture approach. Quality of Life Research: An International Journal of Quality of Life Aspects of Treatment, Care and Rehabilitation, 19, 137–148. doi:10.1007/s11136-009-9564-1.
Puma M. J., Olsen R. B., Bell S. H., Price C. (2009). What to do when data are missing in group randomized controlled trials. NCEE 2009-0049. Jessup, MD: National Center for Education Evaluation and Regional Assistance.
Raghunathan T. E. (2004). What do we do with missing data
? Some options for analysis of incomplete data. Annual Review of Public Health, 25, 99–117. doi:10.1146/annurev.publhealth.25.102802.124410.
Rubin D. B. (1976). Inference and missing data
. Biometrika, 63, 581.
Singer J. D., Willett J. B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. New York, NY: Oxford University Press.
Thomas S. A., Friedmann E., Lee H. J., Son H., Morton P. G., Investigators . (2011). Changes in anxiety and depression over 2 years in medically stable patients after myocardial infarction and their spouses in the Home Automatic External Defibrillator Trial (HAT): A longitudinal observational study. Heart, 97, 371–381. doi:10.1136/hrt.2009.184119.
Verbeek H., van Rossum E., Zwakhalen S. M., Ambergen T., Kempen G. I., Hamers J. P. (2009). The effects of small-scale, homelike facilities for older people with dementia on residents, family caregivers and staff: Design of a longitudinal, quasi-experimental study. BMC Geriatrics, 9, 3. doi:10.1186/1471-2318-9-3.
White I. R., Carlin J. B. (2010). Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Statistics in Medicine, 29, 2920–2931. doi:10.1002/sim.3944.
Keywords:© 2012 Lippincott Williams & Wilkins, Inc.
missing data; pattern mixture models; informative missingness; SPSS