The imputation software12 required that continuous variables be normally distributed. However, log(BMI) was not normal, with skewness of 0.50 and standardized kurtosis of 0.47 (both equal 0 in a normal distribution). Thus, a new set of normally distributed body mass index (BMI) values was substituted. These were calculated from the regression E(log[BMI]| independent variables in the model) + e. The expected values were based on regression coefficients, where e′ is the error and e∼ N(0, var(e′)). Next, an artificial missing data mechanism that is decidedly not missing at random is imposed. For the first analysis (Table 1), all 2823 subjects in the higher meat categories 3 and 4 who had BMI values greater than or equal to 27 had their meat data converted to missing (15% of the total). Thus the association between BMI and meat in remaining subjects with observed values of 3 and 4 for meat will poorly predict meat consumption in those with missing data.
In Table 2, we use similar methods for the second model where the 2 energy-adjusted nutrients ea-folate and ea-PUFA, rather than a food, have missing data. Again, the data were transformed so that log(BMI), ea-folate and ea-PUFA are multivariate normal conditional on the covariates age, sex and exercise. Values of error variances and covariances observed in the non-normal data were used as parameters in the multivariate normal distribution.
Artificial missing data mechanisms are imposed on ea-folate and ea-PUFA, resulting in 21% missing for ea-folate and 24% missing for ea-PUFA. These were
- for ea-folate, pr(missing|eafolate > mean (ea-folate), BMI ≥ 27) = 1, otherwise 0;
- for ea-PUFA, pr(missing|ea-PUFA > mean (ea-PUFA))= K, otherwise 0; where K = −0.21 + 0.0071 * age, the coefficient chosen so K = 0 at age 30 and 0.50 at age 70 years.
The mechanisms are not missing at random because for both variables pr(missing) depends on the value of data that may be missing, specifically the requirement that missing values be greater than the mean for that variable.
Finally, in a third model we use the actually observed and naturally missing data from all 2752 black subjects within this database. Indicators of missing data are constructed for consumption of nuts and meat. The frequency of originally missing data for nut variables is 12%, and for meat variables is 9%. Log(BMI) is the dependent variable of the regression. The very few subjects with missing sex, age, or BMI were excluded. About 95% of the 20% sample of initially missing data has since been filled in, and was used to guide the imputation.
We performed the imputation using software available free on the internet,12 which uses the Monte Carlo Markov Chain method16 to estimate parameters of the likelihood function. Specifically, Schafer’s MIXED routine was used for this imputation, as BMI and age are continuous but some other variables are categorical.
Table 1 shows the beta coefficients and their standard errors for the reference model, where BMI is transformed to normality. A second analysis simply excludes the missing data, a third imputes all missing meat data assuming missing at random (patently incorrect here), and finally the imputations are guided by subsample data. In Table 2, similar analyses are shown for the second model, where missing data for ea-folate and ea-PUFA are also decidedly not missing at random.
These results show that, as expected, simply excluding cases with missing data, or assuming that data were missing at random when they were not, produced beta coefficients for meat, and to a lesser extent, other variables, that were very biased. However, when the random new data were used to guide the imputation, even with these rather extreme missing ness mechanisms, the point estimates were almost identical to the reference results. This was true not only for the meat, ea-folate and ea-PUFA coefficients, but also for other variables with no missing data whose coefficients were sometimes distorted if subjects with missing data were simply excluded. Such a case is the coefficient for exercise 3 in model 1, which, when missing data are excluded, changes by nearly 2 standard deviations. As expected, the standard errors for meat3, meat4, ea-folate and ea-PUFA are a little wider with imputation than the reference.
Sensitivity of these results to moderate non-normality in a variable assumed to be normal (such as BMI) is shown when using the non-normal log(BMI) as dependent variable. For variables not needing imputation, there is still good agreement between the reference results and those after imputation guided by the random sample. However, the new reference coefficients (with SE) for meat 2, meat 3, and meat 4 are 0.043 (0.0035), 0.076 (0.0037), and 0.102 (0.0049), whereas after imputation the results are 0.043 (0.0035), 0.082 (0.0041), 0.090 (0.0061). Thus, the meat3 and meat4 coefficients are nearly 2 standard errors removed from the reference results. Repeating the analysis several times with a different random 20% sample consistently produces similar differences. Correcting the non-normality removed these discrepancies (Table 1).
Similarly, in the second model when non-normalized data are used for log(BMI), ea-folate, ea-PUFA, the new reference coefficients for the last 2 variables are –0.032 (0.0044) and 0.251 (0.095), but after imputation, using the 20% filled-in missing data, the coefficients are –0.039 (0.0053) and 0.209 (0.100). These also differ moderately from the reference data, presumably due to the non-normality, as normalized data do not show these discrepancies (Table 2).
The results of Table 3 are those where the naturally missing data were imputed using the random 20% of filled-in missing data in the real subsample. There are modest changes in the beta coefficients when a missing at random assumption is made, as against the reference result where the new filled-in data are used. These changes are often 5–10% of the reference values. Imputing zeroes produced moderate changes for meat, but only small changes in the nut variables. Excluding cases with missing data produced greater changes in nut variables but only small changes in meat variables. Standard errors were usually higher in this last case as there was less nonmissing data retained in the data set.
In these analyses, we somewhat arbitrarily chose to fill in 20% of missing data. We now investigate the effects of filling in only 10% or 5% instead (Table 4). This change should not affect validity, but decreases precision. However, as the filled-in data affects only imputed values for the remaining missing subset, effects on precision will also depend on the proportions of originally-missing data. If there is very little missing data, the standard error of the coefficients will change little regardless of the proportion of missing data filled in.
We show the effects of varying the proportion of data filled in for the artificial-missing situations (the models of Tables 1 and 2) where 15–24% of data was missing, as well as for the natural missing situation of Table 3, where 8–12% of data was originally missing. Where the proportion of missing data is greater (Models 1 and 2), filling in only 5% of missing data adversely affected statistical power, increasing standard errors by about 50% for some variables. However, where the proportion missing is smaller (Model 3), the smaller changes in standard errors suggest that filling a lower proportion of the missing data (a less costly choice) would be appropriate.
When a random sample of originally missing data is filled in and the imputation model properly adjusted, the results of a reference analysis with no missing data can be accurately restored by guided multiple imputation. This result is of course expected from statistical theory. Even so, multiple imputation has rarely been used in epidemiology, despite the pervasive problem of missing data. Instead, most studies exclude subjects with much missing data, assume that missing data in food frequency questionnaires represent zeroes, or impute mean values from those not having missing data.17–19 While these procedures may perform satisfactorily20 in some situations, there can be no assurance of that without gathering further data. In other situations, such strategies are likely to be suboptimal. It is probable that more accurate imputation especially for frequently-eaten foods may efficiently improve the validity of many nutrient indices.
Information from several nutritional data sets4,18,20 suggest that, on average, about half the true values of initially missing nutritional data are not zero. This is particularly so for more common foods,4,18 and their missing data may not always be represented well by imputing the mean or median from the data that were initially observed. Further, the data necessary to evaluate the adequacy of a particular strategy are usually exactly that which we have used to give the more accurate imputation.
When different analytic approaches are applied to a real dataset with naturally-missing data (Table 3), results show less distortion than with the contrived situations of Tables 1 and 2. This is due both to smaller proportions of missing data, and to data that are probably closer to missing at random. Even so, the biases observed are of moderate magnitude in some instances, and probably worth some effort to avoid.
Glynn et al8,9 demonstrate the success of multiple imputation when incorporating information from a secondary nonrespondent survey. They used data only from this survey to impute the missing data, thus mandating different scaling coefficients for the nonrespondents; only the outcome variable was assumed to have missing values. While no doubt appropriate in their application, we argue that the coefficients used for imputation do not need to differ between the “nonrespondent” and “respondent” groups in the context of an epidemiologic study. All are actually respondents who happen to omit a usually small proportion of data. If the analyst prefers to make different imputations for the 2 groups, the imputation model can be adjusted to allow product terms between the predictor variables and the missing-ness indicator. The need for these can be evaluated using the subsample data. In the data of the analysis in Table 3, there was no convincing evidence that beta coefficients differed between data initially present and that filled in later. The new data can also be used to evaluate the adequacy of an assumption of missing at random, for data that were initially missing.
In the context we describe, all subjects in the subsample have filled out a long questionnaire at least moderately well and intend to continue to participate in the study. Indeed, in our cohort study it has proven realistic to fill in about 95% of the originally missing data from a random sample of 20% of those with at least one missing key variable. Thus, the ideal of complete responses from such a random sample can be approached quite closely. This approach is in contrast to the use of a secondary sample of nonrespondents to add more subjects to a survey, perhaps using shorter surveys or incentives to prompt a response.21
Ideally, the proposed application should be planned as an initial strategy to deal with missing data. If great efforts have already been made to fill missing data, completing a random sample of those still remaining will fall short by far, as these by definition are all very difficult-to-contact or poor responders. Usually, the resources to contact all subjects with missing data in a large cohort are not available. Hence, it is preferable to contact a random sample of these as an initial strategy. (Note that the method is not appropriate for missing data in 24-hour recalls or food diaries in which the missing data are specific to a time period and cannot be supplied retrospectively by a subsample.)
This more satisfying method of dealing with missing data is at some cost, as even filling a random sample of missing data will take some effort. Indeed it will usually be impossible to completely fill such data, because during the interval between original data collection and recontact some people will die, a few may develop serious illness, and others will decide not to cooperate with the subsample contact. Thus, it is acknowledged that the method will not allow extrapolation with confidence to those cohort members who correspond to (hopefully few) nonresponding subsample subjects.
How complete does the response in the random sample of missing data need to be? The proportional bias (Appendix) in the mean of subsample R when representing this by an incomplete subset, is P/(1−P)[1−E(R′)/E(R)],whereE(R′)is the mean of those in the subsample those data cannot be filled in, and P is the corresponding proportion. Thus, as expected, this proportional bias depends on P, and also on how unusual the missing portion is, reflected here by the ratio E(R′)/E(R). Most would probably be satisfied with filling in 90% or more. For instance, if P = 0.1, and even if E(R′)/E(R) is as improbably small as 0.5, the proportionate bias in the subsample is only 0.055. In fact, the imputation may also depend heavily on data not initially missing.
The size of the random sample R, for which missing data should be filled in, will depend in part on the required precision. A variable with only a small proportion of data initially missing will cause little loss in precision when using the multiple imputation procedure. For variables with more missing data it will be wise to ensure that a sufficient proportion is filled in (Table 4). Filling a tiny random percentage will produce unbiased results, but the standard errors will be wide. Calculation of the statistics r and lambda proposed by Schafer7 revealed similar trends. A more quantitative but somewhat heuristic approach to this question is found in the Appendix.
One could cautiously conclude that, when less than 10% are missing, it is important to impute to avoid losing data, but within reason the form of the imputation may not make a lot of difference to the outcome. Other results18 are consistent with this approach. Our experience in Adventist Health Study-1,22 and more recently in Adventist Health Study-2, as well as the experience of others,23 is that 3–10% of values typically are missing for particular food frequency items in an older population. The frequent use of compound indices in nutritional epidemiology exacerbates this situation. It is when the proportion of missing values is moderate or large that the effects of different imputation models on bias and precision are more influential.
Multiple imputation can handle large proportions of missing data where necessary, but assuming that data are missing at random when they are not would then be particularly hazardous. The application described here will be helpful when the missing at random assumption is not clearly warranted, and where there is at least a moderate amount of missing data.
In summary, where the proportion of missing data is modest (typical of many epidemiologic applications), and where it is not clearly missing at random, filling in a random sample of 10% of the missing may be sufficient to guide the imputation. In our experience this process does not create a large budgetary burden, and allows satisfactory bias correction without appreciable loss of power. If normality is assumed, there is some sensitivity to deviation from this assumption. Appropriate transformations, such as the Box-Cox,24 can be used. Alternatively, such variables could be converted to categorical forms.
1. Woodward M, Smith WCS, Tunstall-Pedoe H. Bias from missing values: sex differences in implication of failed venepuncture for the Scottish Heart Health Study. Int J Epidemiol
2. Vach W, Bettner M. Biased estimation of odds ratio in case-control studies due to the use of ad hoc methods of correcting for missing values for confounding variables. Am J Epidemiol
3. Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol
4. Fraser G, Henry D, Knutsen S. Missing data in dietary epidemiology: are imputed zeroes correct? (abstract). Am J Epidemiol
5. Rubin DB. Inference and missing data. Biometrika
6. Little RJA, Rubin DB. Statistical Analysis With Missing Data
. 2nd ed. Hoboken, NJ: John Wiley and Sons; 2002.
7. Schafer JL. Analysis of Incomplete Multivariate Data
. Bury St. Edmunds, Suffolk UK, Chapman and Hall; 1997.
8. Glynn RJ, Laird NM, Rubin DB. Multiple imputation in mixture models for non-ignorable response with follow-up. J Am Stat Assoc
9. Glynn RJ, Laird NM, Rubin DB. Selection modeling versus mixture modeling with nonignorable response. In Wainer H, ed. Drawing Inferences From Self-Selected Samples
. New York: Springer-Verlag; 1986.
10. Little RJA, Rubin DB. Statistical Analysis With Missing Data
. 2nd ed. Hoboken, NJ: John Wiley and Sons; 2002;117–119.
11. Little RJA, Rubin DB. Statistical Analysis With Missing Data
. 2nd ed. Hoboken, NJ: John Wiley and Sons; 2002;200–201.
13. Little RJA, Rubin DB. Statistical Analysis With Missing Data
. 2nd ed. Hoboken, NJ: John Wiley and Sons; 2002;210–211.
14. Efron B, Tibshirani RJ. Cross-validation and other estimates of prediction error. In: Efron B, Tibshirani RJ, eds. An Introduction to the Bootstrap
. New York: Chapman and Hall; 1993.
15. Willett W, Stampfer M. Implications of total energy intake for epidemiologic analysis. In: Willett WC, ed. Nutritional Epidemiology
. 2nd ed. New York: Oxford University Press; 1998.
16. Gilks WR, Richardson S, Spiegelhalter DJ. Markov Chain Monte Carlo in Practice
. London: Chapman and Hall; 1996.
17. Bohlscheid-Thomas S, Hoting I, Boeing H, et al. Reproducibility and relative validity of food group intake in a food frequency questionnaire developed for the German part of the EPIC project. Int J Epidemiol
. 1997;26(Suppl 1):S59–S70.
18. Hansson LM, Galanti MR. Diet-Associated risks of disease and self-reported food consumption: How shall we treat partial non response in a food frequency questionnaire? Nutr Cancer
19. Willett WC. Nutritional Epidemiology
. 2nd ed. New York: Oxford University Press;1998:322.
20. Caan B, Hiatt RA, Owen AM. Mailed dietary surveys: response rates, error rates, and the effect of omitted food items on nutrient values. Epidemiology
21. Kessler RC, Little RJA, Groves RM. Advances in strategies for minimizing and adjusting for survey non-response. Epidemiol Rev
22. Beeson WL, Mills PK, Phillips RL, et al. Chronic disease among Seventh-day Adventists, a low risk group. Cancer
23. Eaker S, Bergstrom R, Bergstrom A, et al. Response rate to mailed epidemiologic questionnaires: a population-based randomized trial of variations in design and mailing routines. Am J Epidemiol
24. Box GEP, Cox DR. An analysis of transformations (with discussion). J Royal Stat Soc (B)
Proportional Bias in Subsample Data Created by Incomplete Response
For variable X, let subscripts c and ic indicate successfully completed and finally incomplete subsample data; n indicates number of subsample subjects, and P is the proportion of subsample subjects whose data is not filled-in due to nonresponse.
Then [E(Xc) − E(X)]/E(X) is the proportionate bias considering only the completed subsample data.
Dividing numerator and denominator by n this becomes
Why Both the Proportion of Initially Missing Data, and the Proportion of This That is Filled in Affects the Precision of the Final Analysis
The number of nonmissing data points is one measure of the information available to the analysis. For large numbers, Var() is approximately proportional to 1/[N(1 − D(1 − L)) − m], where D is the proportion initially missing, L is the proportion of these filled in later, and there are m variables in θ. Hence, L = R/DN. Obviously Var(θ) is smaller as L increases or as D becomes smaller. A smaller Var() will result in Q imputation data sets that are increasingly similar. Consequently Var[E(β)], over the Q runs, is smaller for a given D, where β is the coefficient vector of the analysis performed after the imputation. As Var[E(β)] is the quantity by which the missing data increases the variance of β when using multiple imputation,13 Var() is smaller when L is larger or D is smaller. A related observation is that with either of these trends there will also be fewer data points to impute. As these carry the influence of Var(), the corresponding imprecision can have less impact. Another implication is that when D is large, choosing a larger L will reduce the resulting imprecision.© 2007 Lippincott Williams & Wilkins, Inc.