You could be reading the full-text of this article now if you...

If you have access to this article through your institution,
you can view this article in

Comparison of Bias Analysis Strategies Applied to a Large Data Set

Lash, Timothy L.a; Abrams, Barbarab; Bodnar, Lisa M.c,d,e

doi: 10.1097/EDE.0000000000000102

Background: Epidemiologic data sets continue to grow larger. Probabilistic-bias analyses, which simulate hundreds of thousands of replications of the original data set, may challenge desktop computational resources.

Methods: We implemented a probabilistic-bias analysis to evaluate the direction, magnitude, and uncertainty of the bias arising from misclassification of prepregnancy body mass index when studying its association with early preterm birth in a cohort of 773,625 singleton births. We compared 3 bias analysis strategies: (1) using the full cohort, (2) using a case-cohort design, and (3) weighting records by their frequency in the full cohort.

Results: Underweight and overweight mothers were more likely to deliver early preterm. A validation substudy demonstrated misclassification of prepregnancy body mass index derived from birth certificates. Probabilistic-bias analyses suggested that the association between underweight and early preterm birth was overestimated by the conventional approach, whereas the associations between overweight categories and early preterm birth were underestimated. The 3 bias analyses yielded equivalent results and challenged our typical desktop computing environment. Analyses applied to the full cohort, case cohort, and weighted full cohort required 7.75 days and 4 terabytes, 15.8 hours and 287 gigabytes, and 8.5 hours and 202 gigabytes, respectively.

Conclusions: Large epidemiologic data sets often include variables that are imperfectly measured, often because data were collected for other purposes. Probabilistic-bias analysis allows quantification of errors but may be difficult in a desktop computing environment. Solutions that allow these analyses in this environment can be achieved without new hardware and within reasonable computational time frames.

Author Information

From the aDepartment of Epidemiology, Rollins School of Public Health, Emory University, Atlanta, GA; bDivision of Epidemiology, School of Public Health, University of California at Berkeley, Berkeley, California; cDepartment of Epidemiology, Graduate School of Public Health, University of Pittsburgh, Pennsylvania, PA; dDepartment of Obstetrics, Gynecology, and Reproductive Sciences, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, PA; and eMagee-Womens Research Institute, Pittsburgh, Pennsylvania, PA.

Submitted 4 September 2013; accepted 13 December 2013; posted 8 May 2014.

Supported by NIH grant R21 HD065807 and the Thrasher Research Fund (9181).

Supplemental digital content is available through direct URL citations in the HTML and PDF versions of this article ( This content is not peer-reviewed or copy-edited; it is the sole responsibility of the author.

Correspondence: Timothy L. Lash, Department of Epidemiology, Rollins School of Public Health, Emory University, 1518 Clifton Rd. NE, Atlanta, GA 30322. E-mail:

© 2014 by Lippincott Williams & Wilkins, Inc