Various questionnaire-based definitions of chronic obstructive pulmonary disease (COPD) have been applied using the US representative National Health and Nutrition Examination Survey (NHANES), but few have been validated against objective lung function data. We validated two prior definitions that incorporated self-reported physician diagnosis, respiratory symptoms, and/or smoking. We also validated a new definition that we developed empirically using gradient boosting, an ensemble machine learning method.
Data came from 7,996 individuals 40–79 years who participated in NHANES 2007–2012 and underwent spirometry. We considered participants “true” COPD cases if their ratio of postbronchodilator forced expiratory volume in 1 second to forced vital capacity was below 0.7 or the lower limit of normal. We stratified all analyses by smoking history. We developed a gradient boosting model for smokers only; predictors assessed (25 total) included sociodemographics, inhalant exposures, clinical variables, and respiratory symptoms.
The spirometry-based COPD prevalence was 26% for smokers and 8% for never smokers. Among smokers, using questionnaire-based definitions resulted in a COPD prevalence ranging from 11% to 16%, sensitivity ranging from 18% to 35%, and specificity ranging from 88% to 92%. The new definition classified participants based on age, bronchodilator use, body mass index (BMI), smoking pack-years, and occupational organic dust exposure, and resulted in the highest sensitivity (35%) and specificity (92%) among smokers. Among never smokers, the COPD prevalence ranged from 4% to 5%, and we attained good specificity (96%) at the expense of sensitivity (9-10%).
Our results can be used to parametrize misclassification assumptions for quantitative bias analysis when pulmonary function data are unavailable.