To the Editors:
In questionnaires applied in epidemiologic surveys, respondents often answer questions about continuous variables in terms of few predefined categories. Examples of such continuous variables commonly treated as categorized (grouped) data are yearly household income, frequency of food intake during 1 week, and hours of intensive physical activity per week. The reproducibility of grouped data is usually estimated by the intraclass correlation coefficient (ICC),1 calculated on the midpoints of predefined categories, or weighted kappa.2 Both methods, however, depend on the choice of categories (cut-off points, number) and seem to underestimate continuous data ICC.3–5 This behavior complicates their interpretation and hinders comparison of questionnaires with differently defined categories. To our knowledge, the maximum likelihood ICC has not been suggested in reproducibility studies with grouped data. We compare it with the midpoint ICC on simulated datasets and on a real-life example, food frequency questionnaire (FFQ) data.
Our simulations mimicked situations when 1000 respondents answer the question with 5 predefined categories on 2 occasions. We performed 1000 simulations for each of the 99 ICC values (0.01–0.99). To investigate the influence of the number of categories, we further experimented with 3, 5, 10, 25, and 50 categories. These experiments included 1 low (0.2), 1 medium (0.5), and 1 high (0.8) ICC value (1000 simulations for each experiment). We separately analyzed cases of equal and unequal category widths. Data were simulated according to 1-way random-effects model and grouped into predefined categories afterwards. Detailed methodology of simulations and derivation of the log-likelihood for grouped data are in the eAppendix (http://links.lww.com/EDE/A807).
The maximum likelihood estimator showed low bias, with a median value of 0.001 (range 0.000–0.011) when categories widths were equal, and 0.002 (0.000–0.017) when categories widths were unequal (simulations with 1000 respondents and 5 categories; eTable1, eFigure1, http://links.lww.com/EDE/A807). In contrast, midpoint ICC underestimated ICC by a median value of 0.067 (0.002–0.111) when categories widths were equal and 0.133 (0.002–0.182) otherwise.
The Figure shows the results of simulations with different number of categories (as described in eTables 2–4, http://links.lww.com/EDE/A807). Maximum likelihood ICC was unaffected by the number of categories and the choice of cut-off points: its mean estimates for data with the same underlying ICC value differed by 0.005, at most. In comparison, midpoint estimates for data with the same ICC value differed by up to 0.24. The midpoint method underestimated ICC more when the number of categories was lower. Furthermore, its bias was higher when ICC was higher and widths of categories unequal. Its bias was low, mainly below 0.01, only when the number of categories was large (25 and higher).
The FFQ example showed a similar pattern, with maximum-likelihood estimates on average higher by 0.09 than midpoint estimates (eTable5, http://links.lww.com/EDE/A807).
Although the bias and dependence of midpoint ICC on the number of categories have been noted previously, maximum likelihood estimation has not been used in reproducibility studies with grouped data, probably because of lack of user-friendly software solutions. Thus, we provide R package iRepro (available from http://www.imi.hr/~jkovacic/irepro.html, including installation and usage guidelines).
To conclude, researchers should be aware of bias related to the commonly used midpoint approach when estimating ICC from continuous grouped data. The maximum likelihood estimator is a better choice, as it showed almost no bias under all tested conditions. Furthermore, it was unaffected by the choice of categories. This enables comparison of questionnaires with different grouping schemes, including questionnaires with data that are not grouped (ie, continuous data). Unless the number of categories is large, such comparisons based on the midpoint method should be avoided.
We thank Jelena Macan for her valuable comments.
Veda Marija Varnai
Institute for Medical Research
and Occupational Health
1. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86:420–428
2. Cohen J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull. 1968;70:213–220
3. Maclure M, Willett WC. Misinterpretation and misuse of the kappa statistic. Am J Epidemiol. 1987;126:161–169
4. Brenner H, Kliebsch U. Dependence of weighted kappa coefficients on the number of categories. Epidemiology. 1996;7:199–202
5. de Mast J, van Wieringen W. Measurement system analysis for bounded ordinal data. Qual Reliab Eng Int. 2004;20:383–395