# Guided Multiple Imputation of Missing Data: Using a Subsample to Strengthen the Missing-at-Random Assumption

Multiple imputation can be a good solution to handling missing data if data are missing at random. However, this assumption is often difficult to verify. We describe an application of multiple imputation that makes this assumption plausible. This procedure requires contacting a random sample of subjects with incomplete data to fill in the missing information, and then adjusting the imputation model to incorporate the new data. Simulations with missing data that were decidedly not missing at random showed, as expected, that the method restored the original beta coefficients, whereas other methods of dealing with missing data failed. Using a dataset with real missing data, we found that different approaches to imputation produced moderately different results. Simulations suggest that filling in 10% of data that was initially missing is sufficient for imputation in many epidemiologic applications, and should produce approximately unbiased results, provided there is a high response on follow-up from the subsample of those with some originally missing data. This response can probably be achieved if this data collection is planned as an initial approach to dealing with the missing data, rather than at later stages, after further attempts that leave only data that is very difficult to complete.

From Loma Linda University, Loma Linda, California.

Submitted 19 December 2005; accepted 2 October 2006.

Supported by NIH/NCI Grant 5 R01 CA094594 07/01/2001-06/30/2007.

Correspondence: Gary Fraser, 24785 Stewart Street, Room 203, Loma Linda, CA 92350. E-mail: gfraser@llu.edu

Missing data can be a serious problem in multivariate epidemiologic analysis. There are several methods for handling this problem, some of which have the potential to provide biased results. The most common approach is to exclude cases with missing data from the analysis. This may mean that a sizeable proportion of the population is lost, with the attendant reduction in statistical power. Moreover if the missing data are nonrandom the results will be nonrepresentative.

Another method is, for each main effect variable, to model a binary dummy variable equal to one if the value for that variable is missing, and zero if it is present.^{1} Unfortunately, in multivariate work this approach typically produces biased results.^{2,3} Others have assumed that subjects who, for instance, skip a dietary question do so because they do not eat that food; a zero is then imputed. However, such an approach may not be warranted.^{4}

Multiple imputation is a strategy that uses observed data to impute missing data, ideally when data are “missing at random.” This term designates a missingness pattern such that the probability of a data point being missing depends only on the data that are observed.^{5} The target analysis can then proceed incorporating both originally observed and imputed data to produce unbiased results. The standard error of such an estimate is underestimated, however, unless further action is taken. Combining results of target analyses from several such partially imputed data sets,^{6,7} will, when data are missing at random, produce unbiased coefficient estimates with correct standard errors. The standard errors are of course greater than those that would result from an analysis of an initially complete data set.

A central problem is whether the missing data are in fact missing at random. For instance, if the means of nonmissing values (conditional on covariates) differ from those of the true values of the data of the same variables, these missing data are not missing at random. Yet, it is often difficult to be certain just what has caused the missing data, and on which variables the probability of “missing” may depend.

In this paper, we slightly extend a survey strategy of obtaining more information about missing survey subjects^{8,9} and note the natural application to epidemiologic studies in which some study members have provided incomplete data. This approach strengthens the claim that the missing data are missing at random. The goal is to fill in a random subsample of the initially missing data and adjust the imputation model using this extra information as a guide. Such a subsample survey can usually be completed more successfully among epidemiologic study participants than among nonresponders to a survey.

## METHODS

Consider a data set *Y* of J variables on each of N subjects *i* = 1,…, *N*. Y includes both outcome and predictor variables. For each of the J variables, *V* (*j*), there is a number *d* (*j*), *j* = 1,…, J, that are missing. Divide the subjects into those who have missing data on at least one variable, set *S*(1), and those with no missing data, set *S*(2). Take a random sample (*R*) of subjects in set *S*(1) and expend the necessary resources to contact a high proportion of subjects in this sample to fill in their missing data. All subjects in R with missing data for any combination of variables that includes missingness for a specific variable, say *V* (*j*), will also be a random sample of those in Y with missing data for that variable.

For each variable *V* (*j*) that has a nonzero *d* (*j*), create dummy variables *M* (*i,j*), which takes values 1 to represent original missingness, otherwise zero. Thus, subjects in set R with originally missing data for *V* (*j*) will have a value of 1 for *M*(*i,j*), but will also have a filled-in value for *V* (*j*). Other subjects in Y but not R with missing data for *V* (*j*) will also have a value of one for *M* (*i,j*) but their *V* (*j*) remains missing.

Then *pr* (*M*|*Y*,θ) = *pr* (*M*|*Y*(*obs*), *Y*(*miss*),θ) = *pr* (*M*|*Y*(*obs*), *Y*(*R*), ψ,θ) where M is a missingness indicator set to 1 to indicate initial missing status, otherwise 0; *Y* (*obs*) is that part of Y that is observed; *Y* (*miss*) is that part of Y that was originally missing; *Y* (*R*) is the part of *Y* (*miss*) that, although originally missing, is now filled in; M is the matrix of *M* (*i,j*); θ is the vector of parameters; and ψ is the sampling proportion, R/S(1), for the subsample. This probability no longer depends systematically on the part of Y(*miss*) that continues to be missing, as it is properly represented by *Y*(*R*), a random sample of *Y* (*miss*). By definition, the remaining missing data are missing at random, assuming correct specification of the likelihood.

After assuming that parameters of the distributions of M and Y have distinct spaces, the imputation can proceed considering only the likelihood conditioned on observed data to estimate parameters.^{10} In the applications that follow, covariates used to predict missing data are modeled as a linear sum, h, that includes the categorical M variable. Missing data for continuous variable V(j), for instance, is imputed by a draw from the probability or density function *f*(h), where

is set to 1; and *f* depends on the form of the likelihood function for the data.^{11} The *Vik* may be continuous or categorical in form, and should include the dependent variable of the final regression of interest.

Where the missing *Vj* is a categorical variable, the imputation is by a draw from the probability distribution that includes a log-linear component, and a component incorporating a design matrix that models a link between the categorical and continuous variables. We chose to include product terms between all categorical variables (including the M variables) in the log-linear component. Thus the imputation is guided both by data initially present and data filled in later in the random subsample.

Further details are found in chapter 9 in the book by Schafer.^{12} We implemented his mixed model with code available free on the world wide web.^{12} Continuous variables are assumed to be normally distributed.

The linear sum, h, implies that imputed values of *Vj*, conditional on covariates, differ systematically from those observed by only a different constant, γ*j*, for each variable. More complicated scenarios, such as different scaling coefficients among nonresponders for some variables, could be modeled. However, in an epidemiologic study, often it will be reasonable to assume that those who are missing some data (usually a high percentage of subjects) do not differ systematically from others in the relationships between variables.

The imputation is performed Q times to provide Q plausible complete data sets, and the desired analysis is run on each. Then the Q sets of coefficients and variance estimators are combined as described by Little and Rubin^{13} to provide unbiased estimates of the variances of coefficients. In the examples below we set Q = 5, which should be sufficient to provide stable variance estimates.^{13}

The Adventist Health Study is a large cohort study that aims to enroll more than 100,000 subjects. As the major hypotheses are dietary, missing data are a significant problem, even when each food has only a small proportion of missing values. This is problematic because not only are analytic models often highly multivariate, but individual variables (eg, nutrient indices) will usually be the weighted sum of information from many foods. The food frequency questionnaire contains approximately 130 hard-coded food items and allows up to 50 write-ins. This questionnaire was based on extensive pilot work in this population. Each item has a given standard portion size, but allows subjects to nominate a larger portion (≥1.5 standard) or a smaller portion (≤0.5 standard). There are 7–9 frequency categories for each food.

As a test situation, we identified a sequential sample of the first 24,000 questionnaires and a set of key variables that include the 4 most influential foods in each of 18 indices for vitamins, nutrients, and minerals. These indices had been constructed in a pilot database (320 subjects), using cross-validation techniques^{14} to select the influential foods.

Within 6 months of receipt of the questionnaires, we attempted to contact a random 20% of all subjects who had one or more of the key variables missing. This contact was by telephone. Such contacts were possible for 90–95% of the study members, depending on the variable. The resulting filled-in data are used in the third illustrative model described below.

The first linear regression analysis has log(BMI) as the dependent variable and the following independent variables: meat (4 categories), nuts (5 categories), exercise (5 categories), age and sex. “Meat” and “nuts” are composite variables, each the sum of several questions about meats and nuts. The frequency data from each of these component questions were summed using midpoint values of categories, and then the composite variable was categorized.

The second model has the same dependent variable, but with the following independent variables: energy-adjusted folate (*ea-folate*), energy-adjusted polyunsaturated fatty acids (*ea-PUFA*), exercise (5 categories), age, and sex. Energy-adjustment used the residual method,^{15} and missing residuals were imputed.

To demonstrate that the method works, 2 quite artificial situations were set up (Tables 1 and 2), which did, however, start with real data. There were 20,126 subjects who had no missing data for the variables in these 2 models. This is the reference population for a complete data analysis and the parent population in subsequent analyses.

The imputation software^{12} required that continuous variables be normally distributed. However, log(BMI) was not normal, with skewness of 0.50 and standardized kurtosis of 0.47 (both equal 0 in a normal distribution). Thus, a new set of normally distributed body mass index (BMI) values was substituted. These were calculated from the regression E(log[BMI]| independent variables in the model) + e. The expected values were based on regression coefficients, where e′ is the error and e∼ N(0, var(e′)). Next, an artificial missing data mechanism that is decidedly not missing at random is imposed. For the first analysis (Table 1), all 2823 subjects in the higher meat categories 3 and 4 who had BMI values greater than or equal to 27 had their meat data converted to missing (15% of the total). Thus the association between BMI and meat in remaining subjects with observed values of 3 and 4 for meat will poorly predict meat consumption in those with missing data.

In Table 2, we use similar methods for the second model where the 2 energy-adjusted nutrients *ea-folate* and *ea-PUFA*, rather than a food, have missing data. Again, the data were transformed so that log(BMI), *ea-folate* and *ea-PUFA* are multivariate normal conditional on the covariates age, sex and exercise. Values of error variances and covariances observed in the non-normal data were used as parameters in the multivariate normal distribution.

Artificial missing data mechanisms are imposed on *ea-folate* and *ea-PUFA*, resulting in 21% missing for *ea-folate* and 24% missing for *ea-PUFA*. These were

- for
*ea-folate*, pr(missing|eafolate > mean (*ea-folate*), BMI ≥ 27) = 1, otherwise 0; - for
*ea-PUFA*, pr(missing|*ea-PUFA*> mean (*ea-PUFA*))= K, otherwise 0; where K = −0.21 + 0.0071 * age, the coefficient chosen so K = 0 at age 30 and 0.50 at age 70 years.

The mechanisms are not missing at random because for both variables pr(missing) depends on the value of data that may be missing, specifically the requirement that missing values be greater than the mean for that variable.

Finally, in a third model we use the actually observed and naturally missing data from all 2752 black subjects within this database. Indicators of missing data are constructed for consumption of nuts and meat. The frequency of originally missing data for nut variables is 12%, and for meat variables is 9%. Log(BMI) is the dependent variable of the regression. The very few subjects with missing sex, age, or BMI were excluded. About 95% of the 20% sample of initially missing data has since been filled in, and was used to guide the imputation.

We performed the imputation using software available free on the internet,^{12} which uses the Monte Carlo Markov Chain method^{16} to estimate parameters of the likelihood function. Specifically, Schafer’s MIXED routine was used for this imputation, as BMI and age are continuous but some other variables are categorical.

## RESULTS

Table 1 shows the beta coefficients and their standard errors for the reference model, where BMI is transformed to normality. A second analysis simply excludes the missing data, a third imputes all missing meat data assuming missing at random (patently incorrect here), and finally the imputations are guided by subsample data. In Table 2, similar analyses are shown for the second model, where missing data for *ea-folate* and *ea-PUFA* are also decidedly not missing at random.

These results show that, as expected, simply excluding cases with missing data, or assuming that data were missing at random when they were not, produced beta coefficients for meat, and to a lesser extent, other variables, that were very biased. However, when the random new data were used to guide the imputation, even with these rather extreme missing ness mechanisms, the point estimates were almost identical to the reference results. This was true not only for the meat, *ea-folate* and *ea-PUFA* coefficients, but also for other variables with no missing data whose coefficients were sometimes distorted if subjects with missing data were simply excluded. Such a case is the coefficient for exercise 3 in model 1, which, when missing data are excluded, changes by nearly 2 standard deviations. As expected, the standard errors for *meat3, meat4, ea-folate* and *ea-PUFA* are a little wider with imputation than the reference.

Sensitivity of these results to moderate non-normality in a variable assumed to be normal (such as BMI) is shown when using the non-normal log(BMI) as dependent variable. For variables not needing imputation, there is still good agreement between the reference results and those after imputation guided by the random sample. However, the new reference coefficients (with SE) for *meat 2, meat 3*, and *meat 4* are 0.043 (0.0035), 0.076 (0.0037), and 0.102 (0.0049), whereas after imputation the results are 0.043 (0.0035), 0.082 (0.0041), 0.090 (0.0061). Thus, the meat3 and meat4 coefficients are nearly 2 standard errors removed from the reference results. Repeating the analysis several times with a different random 20% sample consistently produces similar differences. Correcting the non-normality removed these discrepancies (Table 1).

Similarly, in the second model when non-normalized data are used for log(BMI), *ea-folate, ea-PUFA*, the new reference coefficients for the last 2 variables are –0.032 (0.0044) and 0.251 (0.095), but after imputation, using the 20% filled-in missing data, the coefficients are –0.039 (0.0053) and 0.209 (0.100). These also differ moderately from the reference data, presumably due to the non-normality, as normalized data do not show these discrepancies (Table 2).

The results of Table 3 are those where the naturally missing data were imputed using the random 20% of filled-in missing data in the real subsample. There are modest changes in the beta coefficients when a missing at random assumption is made, as against the reference result where the new filled-in data are used. These changes are often 5–10% of the reference values. Imputing zeroes produced moderate changes for meat, but only small changes in the nut variables. Excluding cases with missing data produced greater changes in nut variables but only small changes in meat variables. Standard errors were usually higher in this last case as there was less nonmissing data retained in the data set.

In these analyses, we somewhat arbitrarily chose to fill in 20% of missing data. We now investigate the effects of filling in only 10% or 5% instead (Table 4). This change should not affect validity, but decreases precision. However, as the filled-in data affects only imputed values for the remaining missing subset, effects on precision will also depend on the proportions of originally-missing data. If there is very little missing data, the standard error of the coefficients will change little regardless of the proportion of missing data filled in.

We show the effects of varying the proportion of data filled in for the artificial-missing situations (the models of Tables 1 and 2) where 15–24% of data was missing, as well as for the natural missing situation of Table 3, where 8–12% of data was originally missing. Where the proportion of missing data is greater (Models 1 and 2), filling in only 5% of missing data adversely affected statistical power, increasing standard errors by about 50% for some variables. However, where the proportion missing is smaller (Model 3), the smaller changes in standard errors suggest that filling a lower proportion of the missing data (a less costly choice) would be appropriate.

## DISCUSSION

When a random sample of originally missing data is filled in and the imputation model properly adjusted, the results of a reference analysis with no missing data can be accurately restored by guided multiple imputation. This result is of course expected from statistical theory. Even so, multiple imputation has rarely been used in epidemiology, despite the pervasive problem of missing data. Instead, most studies exclude subjects with much missing data, assume that missing data in food frequency questionnaires represent zeroes, or impute mean values from those not having missing data.^{17–19} While these procedures may perform satisfactorily^{20} in some situations, there can be no assurance of that without gathering further data. In other situations, such strategies are likely to be suboptimal. It is probable that more accurate imputation especially for frequently-eaten foods may efficiently improve the validity of many nutrient indices.

Information from several nutritional data sets^{4,18,20} suggest that, on average, about half the true values of initially missing nutritional data are not zero. This is particularly so for more common foods,^{4,18} and their missing data may not always be represented well by imputing the mean or median from the data that were initially observed. Further, the data necessary to evaluate the adequacy of a particular strategy are usually exactly that which we have used to give the more accurate imputation.

When different analytic approaches are applied to a real dataset with naturally-missing data (Table 3), results show less distortion than with the contrived situations of Tables 1 and 2. This is due both to smaller proportions of missing data, and to data that are probably closer to missing at random. Even so, the biases observed are of moderate magnitude in some instances, and probably worth some effort to avoid.

Glynn et al^{8,9} demonstrate the success of multiple imputation when incorporating information from a secondary nonrespondent survey. They used data only from this survey to impute the missing data, thus mandating different scaling coefficients for the nonrespondents; only the outcome variable was assumed to have missing values. While no doubt appropriate in their application, we argue that the coefficients used for imputation do not need to differ between the “nonrespondent” and “respondent” groups in the context of an epidemiologic study. All are actually respondents who happen to omit a usually small proportion of data. If the analyst prefers to make different imputations for the 2 groups, the imputation model can be adjusted to allow product terms between the predictor variables and the missing-ness indicator. The need for these can be evaluated using the subsample data. In the data of the analysis in Table 3, there was no convincing evidence that beta coefficients differed between data initially present and that filled in later. The new data can also be used to evaluate the adequacy of an assumption of missing at random, for data that were initially missing.

In the context we describe, all subjects in the subsample have filled out a long questionnaire at least moderately well and intend to continue to participate in the study. Indeed, in our cohort study it has proven realistic to fill in about 95% of the originally missing data from a random sample of 20% of those with at least one missing key variable. Thus, the ideal of complete responses from such a random sample can be approached quite closely. This approach is in contrast to the use of a secondary sample of nonrespondents to add more subjects to a survey, perhaps using shorter surveys or incentives to prompt a response.^{21}

Ideally, the proposed application should be planned as an initial strategy to deal with missing data. If great efforts have already been made to fill missing data, completing a random sample of those still remaining will fall short by far, as these by definition are all very difficult-to-contact or poor responders. Usually, the resources to contact all subjects with missing data in a large cohort are not available. Hence, it is preferable to contact a random sample of these as an initial strategy. (Note that the method is not appropriate for missing data in 24-hour recalls or food diaries in which the missing data are specific to a time period and cannot be supplied retrospectively by a subsample.)

This more satisfying method of dealing with missing data is at some cost, as even filling a random sample of missing data will take some effort. Indeed it will usually be impossible to completely fill such data, because during the interval between original data collection and recontact some people will die, a few may develop serious illness, and others will decide not to cooperate with the subsample contact. Thus, it is acknowledged that the method will not allow extrapolation with confidence to those cohort members who correspond to (hopefully few) nonresponding subsample subjects.

How complete does the response in the random sample of missing data need to be? The proportional bias (Appendix) in the mean of subsample R when representing this by an incomplete subset, is P/(1−P)[1−E(R′)/E(R)],whereE(R′)is the mean of those in the subsample those data cannot be filled in, and P is the corresponding proportion. Thus, as expected, this proportional bias depends on P, and also on how unusual the missing portion is, reflected here by the ratio E(R′)/E(R). Most would probably be satisfied with filling in 90% or more. For instance, if P = 0.1, and even if E(R′)/E(R) is as improbably small as 0.5, the proportionate bias in the subsample is only 0.055. In fact, the imputation may also depend heavily on data not initially missing.

The size of the random sample R, for which missing data should be filled in, will depend in part on the required precision. A variable with only a small proportion of data initially missing will cause little loss in precision when using the multiple imputation procedure. For variables with more missing data it will be wise to ensure that a sufficient proportion is filled in (Table 4). Filling a tiny random percentage will produce unbiased results, but the standard errors will be wide. Calculation of the statistics r and lambda proposed by Schafer^{7} revealed similar trends. A more quantitative but somewhat heuristic approach to this question is found in the Appendix.

One could cautiously conclude that, when less than 10% are missing, it is important to impute to avoid losing data, but within reason the form of the imputation may not make a lot of difference to the outcome. Other results^{18} are consistent with this approach. Our experience in Adventist Health Study-1,^{22} and more recently in Adventist Health Study-2, as well as the experience of others,^{23} is that 3–10% of values typically are missing for particular food frequency items in an older population. The frequent use of compound indices in nutritional epidemiology exacerbates this situation. It is when the proportion of missing values is moderate or large that the effects of different imputation models on bias and precision are more influential.

Multiple imputation can handle large proportions of missing data where necessary, but assuming that data are missing at random when they are not would then be particularly hazardous. The application described here will be helpful when the missing at random assumption is not clearly warranted, and where there is at least a moderate amount of missing data.

In summary, where the proportion of missing data is modest (typical of many epidemiologic applications), and where it is not clearly missing at random, filling in a random sample of 10% of the missing may be sufficient to guide the imputation. In our experience this process does not create a large budgetary burden, and allows satisfactory bias correction without appreciable loss of power. If normality is assumed, there is some sensitivity to deviation from this assumption. Appropriate transformations, such as the Box-Cox,^{24} can be used. Alternatively, such variables could be converted to categorical forms.

## REFERENCES

*Int J Epidemiol*. 1991;20:379–383.

*Am J Epidemiol*. 1991;134:895–907.

*Am J Epidemiol*. 1995;142:1255–1264.

*Am J Epidemiol*. 2004;159(Suppl):S52.

*Biometrika*. 1976;63:581–592.

*Statistical Analysis With Missing Data*. 2nd ed. Hoboken, NJ: John Wiley and Sons; 2002.

*Analysis of Incomplete Multivariate Data*. Bury St. Edmunds, Suffolk UK, Chapman and Hall; 1997.

*J Am Stat Assoc*. 1993;88:984–993.

*Drawing Inferences From Self-Selected Samples*. New York: Springer-Verlag; 1986.

*Statistical Analysis With Missing Data*. 2nd ed. Hoboken, NJ: John Wiley and Sons; 2002;117–119.

*Statistical Analysis With Missing Data*. 2nd ed. Hoboken, NJ: John Wiley and Sons; 2002;200–201.

*Statistical Analysis With Missing Data*. 2nd ed. Hoboken, NJ: John Wiley and Sons; 2002;210–211.

*An Introduction to the Bootstrap*. New York: Chapman and Hall; 1993.

*Nutritional Epidemiology*. 2nd ed. New York: Oxford University Press; 1998.

*Markov Chain Monte Carlo in Practice*. London: Chapman and Hall; 1996.

*Int J Epidemiol*. 1997;26(Suppl 1):S59–S70.

*Nutr Cancer*. 2000;36:1–6.

*Nutritional Epidemiology*. 2nd ed. New York: Oxford University Press;1998:322.

*Epidemiology*. 1991;2:430–436.

*Epidemiol Rev*. 1995;17:192–204.

*Cancer*. 1989;64:570–581.

*Am J Epidemiol*. 1998;147:74–82.

*J Royal Stat Soc (B)*. 1964;26:211–252.

## APPENDIX

##### Proportional Bias in Subsample Data Created by Incomplete Response

For variable X, let subscripts c and ic indicate successfully completed and finally incomplete subsample data; n indicates number of subsample subjects, and P is the proportion of subsample subjects whose data is not filled-in due to nonresponse.

Then [*E*(*Xc*) − *E*(*X*)]/*E*(*X*) is the proportionate bias considering only the completed subsample data.

Dividing numerator and denominator by *n* this becomes

##### Why Both the Proportion of Initially Missing Data, and the Proportion of This That is Filled in Affects the Precision of the Final Analysis

The number of nonmissing data points is one measure of the information available to the analysis. For large numbers, *Var*() is approximately proportional to 1/[*N*(1 − *D*(1 − *L*)) − *m*], where D is the proportion initially missing, L is the proportion of these filled in later, and there are m variables in θ. Hence, *L* = *R/DN*. Obviously *Var*(θ) is smaller as L increases or as D becomes smaller. A smaller *Var*() will result in Q imputation data sets that are increasingly similar. Consequently *Var*[*E*(β)], over the Q runs, is smaller for a given D, where β is the coefficient vector of the analysis performed after the imputation. As *Var*[*E*(β)] is the quantity by which the missing data increases the variance of β when using multiple imputation,^{13} *Var*() is smaller when L is larger or D is smaller. A related observation is that with either of these trends there will also be fewer data points to impute. As these carry the influence of *Var*(), the corresponding imprecision can have less impact. Another implication is that when D is large, choosing a larger L will reduce the resulting imprecision.