Background: Epidemiologic data sets continue to grow larger. Probabilistic-bias analyses, which simulate hundreds of thousands of replications of the original data set, may challenge desktop computational resources.
Methods: We implemented a probabilistic-bias analysis to evaluate the direction, magnitude, and uncertainty of the bias arising from misclassification of prepregnancy body mass index when studying its association with early preterm birth in a cohort of 773,625 singleton births. We compared 3 bias analysis strategies: (1) using the full cohort, (2) using a case-cohort design, and (3) weighting records by their frequency in the full cohort.
Results: Underweight and overweight mothers were more likely to deliver early preterm. A validation substudy demonstrated misclassification of prepregnancy body mass index derived from birth certificates. Probabilistic-bias analyses suggested that the association between underweight and early preterm birth was overestimated by the conventional approach, whereas the associations between overweight categories and early preterm birth were underestimated. The 3 bias analyses yielded equivalent results and challenged our typical desktop computing environment. Analyses applied to the full cohort, case cohort, and weighted full cohort required 7.75 days and 4 terabytes, 15.8 hours and 287 gigabytes, and 8.5 hours and 202 gigabytes, respectively.
Conclusions: Large epidemiologic data sets often include variables that are imperfectly measured, often because data were collected for other purposes. Probabilistic-bias analysis allows quantification of errors but may be difficult in a desktop computing environment. Solutions that allow these analyses in this environment can be achieved without new hardware and within reasonable computational time frames.