With the advent of inexpensive data storage and broadband networking, some epidemiologists have developed research projects that use enormous data sets.^{1} These large data sources are often queried to answer questions for which they are not ideally suited.^{2} Probabilistic-bias analysis has been suggested as a tool to assess the direction, magnitude, and uncertainty about a bias acting on a study’s result.^{3–6} Probabilistic-bias analysis requires simulations, which can be computationally intensive, often entailing 100,000 or more iterations of a simulation to characterize the bias.^{5}

These iterated simulations can be implemented on summarized data (eg, 2 × 2 tables or several strata of 2 × 2 tables),^{7} by simulating bias terms directly,^{7–9} or by applying the bias model to each record of the data set to simulate the data that a bias-corrected record might contain.^{9},^{10} Selection bias and bias from confounding can be readily modeled and simulated by either of the first 2 strategies because the observed association can be factorized into the expected association and an error term representing the bias.^{9} In a selection-bias problem, for example, the observed relative estimate of effect (*RR*_{obs}) can be written as a product of the true relative estimate of effect (*RR*_{true}) and a bias factor (Bias):

The bias factor is a function of the selection proportions (*P*_{D·,E·}) in the 4 combinations of exposure (*E*+ and *E*−) and disease (*D*+ and *D*−):

In a probabilistic-bias analysis, distributions can be assigned to each of the selection proportions and sampled by Monte Carlo simulation to obtain a distribution of the bias factor. There is no need for data record manipulation or manipulation of summarized data, and so the size of the data set does not affect the computational intensity of the bias analysis.

Most information-bias problems, and especially most misclassification problems, cannot be factorized. For a dichotomous-exposure misclassification problem:

where *a* is the observed number of exposed cases, *c* is the observed number of exposed noncases, *b* is the observed number of unexposed cases, and *d* is the observed number of unexposed noncases. The true relative effect is a function of these frequencies and the positive and negative predictive values for exposure classification (*PPV*_{D} and *NPV*_{D}):

*RR*_{obs} cannot be factored from the equation for *RR*_{true} to obtain an estimate of the bias as a function of the predictive values. This is true for most misclassification problems, with few exceptions.^{7} Monte Carlo simulations must, therefore, operate directly on data. The data may be summarized as a crude 2 × 2 table (as in the equations) or as strata (including strata as finely divided as single records). With stratification, the computational intensity will depend on the size of the data set, which is a function of the number of records and degree of stratification. When simulations are applied to summarized data such as a 2 × 2 table, an analyst may lose the ability to adjust for multiple covariates. A record-level simulation of misclassification bias^{10} is then an option. However, given data sets of hundreds of thousands of records, and the need for at least 100,000 iterations, the computational intensity required may become a barrier, especially for those working with desktop personal computers.

These problems came to the fore when we sought to implement a probabilistic-bias analysis to evaluate the direction, magnitude, and uncertainty of bias arising from a study of the association between prepregnancy body mass index (BMI) and early preterm birth, adjusted for multiple covariates by logistic regression. Using a desktop personal computer to apply the results from a validation substudy to nearly 800,000 eligible birth records by generating 100,000 simulated data sets of equal size immediately raised the specter of a computational problem so intense as to preclude a probabilistic-bias analysis. We therefore designed and implemented a comparison of several possible analytic solutions and compared them with respect to the achieved results and computational resources required for each.

## METHODS

### Penn MOMS Cohort

Penn MOMS is a cohort study designed to investigate the risk of poor pregnancy outcomes associated with maternal BMI, gestational weight gain, and race/ethnicity. Participants were non-Hispanic white and non-Hispanic black mothers of singleton infants, with birth certificates in Pennsylvania from 2003 to 2010 (n = 994,020). We excluded infants with birth certificates that were missing data on birth weight (n = 4,348) or gestational age (n = 15,670), mother’s prepregnancy height or weight (n = 57,721), gestational weight gain (n = 66,384), or covariates in the final model (n = 55,731). Applying these inclusion and exclusion criteria yielded a cohort of 773,625 singleton births. The birth certificate gathers data on self-reported prepregnancy weight and height on the “Mother’s Worksheet,” which is administered by an interviewer before the mother leaves the hospital. Mothers were aware of whether they gave birth at term or preterm, which may have affected their reporting of prepregnancy weight. The prepregnancy BMI in the birth certificate was defined as self-reported prepregnancy weight (kg) divided by self-reported prepregnancy height (m) squared and was categorized according to WHO guidelines.^{11} Gestational age in the birth certificate is based on the best obstetric estimate. Early preterm birth was defined as delivery at <32 weeks’ gestation.

### Validation Substudy

From this cohort, we selected 46,059 singleton births recorded at Magee-Womens Hospital of the University of Pittsburgh Medical Center, a teaching hospital and tertiary care center with the highest delivery volume of any Pennsylvania facility. Using a balanced design,^{12} we selected up to 30 birth certificates at random from those available in 48 strata: 2 race/ethnicity strata (non-Hispanic white, non-Hispanic black), 2 gestational age strata (term = ≥37 weeks’ gestation, preterm = <37 weeks’ gestation), 4 prepregnancy BMI strata (underweight [<18.5 kg/m^{2}], normal weight/overweight [18.5–29.9 kg/m^{2}], obese [30–34.9 kg/m^{2}], severely obese [≥35 kg/m^{2}]), and 3 gestational weight gain strata as previously defined^{13} (<20th percentile, 20–80th percentile, >80th percentile of statewide BMI-strata–specific gestational weight gain distributions). Balanced sampling allowed validation within rare subgroups but precluded direct measurement of sensitivity and specificity.^{14} Medical record abstractors completed standardized, computer-aided reviews of the selected medical records, including data on prepregnancy weight and height, as reported by mothers at their first prenatal visit. Using these weight and height reports, the medical record prepregnancy BMI was calculated and categorized.

### Probabilistic-Bias Analyses

We computed 60 predictive values (*PV*_{j,k,l.m}), one for each combination of *j* = 2 race/ethnicity strata, *k* = 2 gestational age strata (term or early preterm), *l* = 5 BMI categories (underweight, BMI <18.5; normal weight, BMI 18.5–24.9; overweight, BMI 25–29.9; obese class 1, BMI 30–34.9; and obese classes 2 and 3, BMI ≥35), and *m* = 3 gestational weight gain strata, all as measured on the birth certificate. We used 5 categories of BMI for the bias analysis, rather than the 4 categories used to design the validation substudy, to correspond to WHO BMI classifications. The second BMI category in the validation study was divided into 2 categories for the bias analysis, and classification boundaries were otherwise the same. Therefore,

where *r*_{j,k,l,m} is the number of medical records that yielded the same BMI category as the birth certificate within race/ethnicity category *j*, gestational age category *k*, BMI category *l*, and gestational weight gain category *m*, and *s*_{j,k,l,m} is the number of birth certificates for which medical records were reviewed within race/ethnicity category *j*, gestational age category *k*, BMI category *l*, and gestational weight gain category *m*.

For each of the 100,000 simulations used for a particular analytic strategy, we selected a value for each *PV*_{j,k,l,m} by sampling from a binomial distribution with mean *s*_{j,k,l,m}*PV*_{j,k,l,m} and variance *s*_{j,k,l,m}*PV*_{j,k,l,m} (1 − *PV*_{j,k,l,m}) and dividing by *s*_{j,k,l,m}. For each record within a single simulation, we chose the *PV*_{j,k,l,m} corresponding to the race/ethnicity, gestational age, birth certificate BMI, and gestational weight gain category, and we sampled the number of records assumed to be correctly classified with regard to BMI category by the birth certificate data (*x*, the “successes” in the usual binomial lexicon) from a binomial distribution with the probability of success *p* = *PV*_{j,k,l,m} and the number of trials *n* equal to the number of birth certificates represented by the record. The number of birth certificates assumed to be incorrectly classified with regard to BMI category by the birth certificate data (*n* − *x*, the “failures” in the usual binomial lexicon) were redistributed into the remaining BMI categories following a multinomial distribution inferred from the results of the validation substudy pertaining to the particular race/ethnicity, gestational age, birth certificate BMI category, and gestational weight gain stratum.

For example, 23 of 30 white non-Hispanic mothers who gave birth at ≥32 weeks’ gestation, whose birth certificate weight and height data placed them into the obese class 1, and whose gestational weight gain was >80th percentile remained in the obese 1 BMI category following medical record review (*PV*_{j = white non-Hispanic, k = ≥32 weeks, l = obese 1, m = >80th percentile} = 23/30 = 0.77). Of the remaining 7 mothers in this stratum whose medical record data placed them in a different prepregnancy BMI category, 2 were overweight according to the medical record review and 5 were obese classes 2 and 3 according to the medical record review. For the purpose of bias analysis, in a particular iteration, *PV*_{j = white non-Hispanic, k = ≥32 weeks, l = obese 1, m = >80th percentile} would be selected from a binomial with mean 30 × 0.77 and variance 30 × 0.77 × (1 – 0.77), and divided by 30. If a particular record in the analytic data set represented 100 white non-Hispanic mothers with term birth (≥32 weeks), obese 1 prepregnancy BMI category according to the birth certificate, and >80th percentile gestational weight gain, and *PV*_{j = white non-Hispanic, k = ≥32 weeks, l = obese 1, m = >80th percentile} equaled its expectation, then one would expect 77 mothers to remain classified as obese 1 in the simulation. Of the remaining 23, one would expect 6.6 (23 × 2/7) to be reclassified as overweight and the remaining 16.4 to be reclassified as obese 2 and 3, following the multinomial distribution inferred from the validation substudy. Over 100,000 iterations, deviations from this expectation will be multinomially distributed.

### Analytic Strategy

We used logistic regression to estimate the association between prepregnancy BMI category and delivering early preterm (<32 weeks gestation), adjusted for race/ethnicity, age category, education category, marital status, hospital type, urban/rural residence, nulliparous or parous, smoking status, and type of health insurance.

We implemented our probabilistic-bias analysis strategy with 3 analytic data sets. The first analytic data set contained records for each mother in the Penn MOMS cohort, restricted to 773,625 in the white non-Hispanic and black non-Hispanic race categories with complete data. The second analytic data set included all mothers who gave early preterm birth, and a 10% sample of the cohort of 773,625, representing a case-cohort design.^{15} For these first 2 analytic data sets, each record has *n* fixed equal to 1. The third analytic data set simultaneously stratified the records by the validation-study-design variables: race/ethnicity, gestational age, prepregnancy BMI category, and gestational weight gain category, and by the categories of all the remaining adjustment covariates. Each record represented a unique combination of these variables and was weighted by the frequency with which that combination appeared among the mothers of restricted Penn MOMS cohort. For this third data set, therefore, *n* was highly variable. We call this the frequency-weighted full cohort analytic data set.

We measured the computational intensity of 6 analyses: 3 conventional logistic regression analyses applied to each of the 3 analytic data sets, and probabilistic-bias analysis applied to each of the analytic data sets. As described elsewhere,^{5},^{9},^{10} all the probabilistic-bias analyses required simulation, output, and logistic regression analysis of 100,000 iterations that applied the bias model described above to the analytic data set. Results of each analysis were output to a second data set. Estimates of association stored in this second data set were ranked, and the median, 2.5% and 97.5%, were reported as estimates of the central tendency and simulation interval after iterative application of the bias model. Measures of computational intensity were the size of the simulation data set (not applicable to the 3 conventional analyses) and the central processing unit (CPU) time required to complete all steps of the analysis. All analyses were completed using PC SAS version 9.3 (Cary, NC) on the same desktop personal computer (Dell OptiPlex 790, Intel Core i7 2600 Processor [3.4 GHz, 8 MB] with 1 TB internal hard drive; Windows 7 Professional operating system). SAS code for the frequency-weighted full cohort analysis is available in the eAppendix (http://links.lww.com/EDE/A786).

## RESULTS

The Table shows the characteristics of the Penn MOMS cohort (n = 773,625) restricted to singleton births by non-Hispanic white and black mothers with complete birth certificate data and of the validation subcohort (n = 1,212). The balanced sampling design for the validation subcohort allowed more precise estimation of predictive values in substrata that would have been sparsely populated by validating a simple random sample. For example, preterm births comprised 1.4% of births and 39% of the validation subcohort. Figure 1 shows the cross-classification of “birth certificate prepregnancy BMI” group against “medical record prepregnancy BMI” group within categories of the validation study’s design variables. Prepregnancy BMI category derived from medical record review agreed with BMI category derived from the birth certificate for 969 of 1212 reviews (overall agreement of 80%). The 60 predictive values ranged from 25% to 100%, with two-thirds exceeding 75%. In the central 3 BMI groups, more BMI categories were underestimated by self-reported prepregnancy weight on the birth certificate (80) than were overestimated (63).

Figure 2 summarizes the 6 analyses of the association between prepregnancy BMI category and early preterm birth, using normal weight as the reference category. In each of the conventional analyses (the 3-point estimates and intervals on the left side of each grouping), women whose prepregnancy BMI category according to the birth certificate was underweight (OR = 1.39 [95% confidence interval = 1.28–1.52]), obese class 1 (1.21 [1.14–1.29]), or obese classes 2 and 3 (1.43 [1.35–1.52]) had higher risk of early preterm delivery than women whose prepregnancy BMI category according to the birth certificate was normal weight (according to the weighted cohort analysis). Point estimates and intervals of the full cohort and weighted full cohort were exactly the same. For the case-cohort analytic data set, the point estimates were slightly different and the confidence intervals slightly wider than the full cohort analyses, a result of sampling of the cohort.

All 3 bias analyses yielded similar results (the 3-point estimates and intervals on the right side of each grouping). Women in the underweight category, after bias analysis, had estimates of association nearer the null than in the conventional analysis (OR = 1.15 [95% simulation interval = 1.02–1.30]). Women in the obese class 1 category (1.27 [1.18–1.36]) and obese class 2 and 3 category (1.52 [1.42–1.61]) after bias analysis had estimates of association further from the null than in the conventional analysis. All bias analysis results had wider confidence intervals than their corresponding conventional analysis, reflecting the additional error incorporated into the estimates of association by modeling the error arising from misclassification of self-reported BMI.

The computational intensity required to complete the 3 bias analyses varied substantially, and all were much greater than that required to complete any of the conventional analyses. The CPU times required to complete the conventional analyses for the full cohort, case cohort, and weighted full cohort were 4.1, 0.39, and 0.1 seconds, respectively. The CPU times required to complete the corresponding probabilistic-bias analyses were 7.75 days, 15.8 hours, and 8.5 hours, respectively. The simulation data created by the probabilistic-bias analyses were 4 terabytes for the full cohort, 287 gigabytes for the case cohort, and 202 gigabytes for the weighted full cohort.

Given the substantially lower CPU time and simulation data size for the weighted full cohort, compared with the other 2 probabilistic-bias analyses, we sought to better characterize the underlying explanation. The analysis included 12 variables with 39,120 possible combinations of values assigned to the variables, of which only 22,417 were observed in the cohort of 773,625 mothers. The frequency with which each observed combination appeared ranged from 0 to 20,598. About 39% of the records had frequency of 1 or 2; for these records, there was little advantage to the weighted cohort over the full cohort. Only 0.3% of records had frequency of 1,000 or more; it is these records for which the weighted analysis resulted in substantial gains in computational efficiency compared with the full cohort analysis.

## DISCUSSION

The objective of the Penn MOMS cohort study was to investigate the risk of poor pregnancy outcomes associated with maternal BMI, gestational weight gain, and race/ethnicity in a large and complete population. We recognized that self-reported weight collected as part of the birth certificate interview may be subject to error and that these errors may vary depending on the birth outcome. We therefore designed a balanced design validation study,^{12} ultimately comparing prepregnancy weight and derived BMI group from the first prenatal visit to the values obtained from the birth certificate. The validation substudy of the Penn MOMS project consumed a substantial proportion of the project’s resources, but still has limitations. First, we limited the cohort to births with complete birth certificate data and the validation subcohort to medical records with complete validation data. Second, medical records were reviewed at only a single hospital, with the largest number of births annually from among all Pennsylvania hospitals. If errors in reporting prepregnancy weight are different outside this select population, these differences would not be captured by the validation study. Allowing binomial error in the predictive values somewhat accounts for this uncertainty. Third, because of the balanced sampling of births into the 48 strata of the validation sample, we could not directly compute the sensitivity or specificity of the classification of prepregnancy BMI categories derived from the self-reported weight on birth certificates.^{14} Without measurements of sensitivity and specificity, we could not assess whether errors were differential by birth outcome.^{9}

It is imperative that the results of the validation substudy be put to good use, both to achieve the scientific aims and to justify the resources devoted to it. When the analysis plan began to be implemented, it became apparent that the computational burden of applying earlier probabilistic-bias analysis methods^{10} to such a large data set would overwhelm typical desktop computing resources. The first option, which is feasible (and we expect would have been successful), would have been to migrate the analysis to an analytic machine up to the task, such as a networked installation of SAS with distributed computing capability. The second option, which we implemented in part, was to upgrade the desktop computing environment. We ultimately installed a 1-terabyte internal hard drive to accommodate the computing requirements of the probabilistic-bias analysis applied to the full cohort. An external hard drive of this size did not have sufficient data transfer speeds to allow the analysis to be completed in a reasonable amount of time.

We recognized that the information density of the 4-terabyte simulation data set was not great. Approximately the same information could be captured by 2 well-understood strategies: case-cohort analysis and frequency weighting of regression records. We expected the first of these to be less computationally intense, but it was not. In retrospect, the reasons are easy to understand. The case-cohort analysis includes a record for each of the 10,965 cases of preterm delivery and records for 10% of the cohort, totaling 88,328 records, each representing only 1 person. The frequency-weighted full cohort data set included only 22,418 records but represented the entire cohort. Its information density, therefore, was far superior to either the full cohort data set or the case-cohort data set. Had there been more covariates or more categories of the covariates, or had we chosen a lower sampling proportion for the case-cohort analysis, the relative computational efficiencies might have changed.

To our knowledge, application of probabilistic-bias analysis methods to a frequency-weighted regression analysis problem has not been previously reported. It is a hybrid solution. Methods to correct for misclassification problems depicted in 2 × 2 tables date to the 1950s,^{16} and probabilistic-bias analysis tools that extend these methods are available.^{5},^{10} Although this solution can be applied across strata of covariates, and estimates summarized across the strata, this approach is not an option when one wishes to adjust for multiple covariates. The data become too thin, and they cannot easily yield estimates of confidence intervals for the summarized estimates.^{17} Probabilistic-bias analysis solves both problems by applying the same conceptual correction for misclassification to each data record, allowing the covariate information required for adjustment to travel with the corrected value for the potentially misclassified variable. The iterative simulations allow one to construct simulation intervals that incorporate both the additional uncertainty due to information bias and the sampling error encapsulated by the conventional frequentist confidence intervals.^{3},^{5},^{9},^{10} In many applications, these simulation intervals correspond well with the intervals obtained from a full Bayesian analysis.^{18},^{19} Application of probabilistic-bias analysis to frequency-weighted data sets combines the computational advantages of applying bias analysis to frequencies rather than to individual records, with the computational advantage of adjusting for multiple covariates by regression rather than stratification. Relatively simple generation of simulation intervals provides an additional benefit over the 2 × 2 table solution. In this application, the results following bias analysis were similar to those obtained from the conventional result, with small differences in the point estimates and wider intervals. This pattern addresses concerns that the conventional results may have arisen from errors, and especially differential errors, in self-reported prepregnancy body weight.

We considered and rejected 2 additional options for applying probabilistic-bias analysis to this large data set. First, we considered calculating a propensity score^{20} to capture the confounding information encoded by the covariates, and then carrying only this propensity score into the simulation. This strategy would have reduced the dimensions of the data set somewhat, because the number of covariates would have been reduced from 10 to 1. We believed, though, that the computational intensity was being driven by the number of records, not the number of covariates, so we anticipated that the reduction in computational intensity would be negligible. In other large data set problems, where the number of covariates is much larger than we had available, this solution may yield an advantage in computational efficiency.

Second, we considered calculating a distribution-bias factor to capture the confounding information encoded by the covariates, in accordance with earlier methods.^{7},^{8} This factor could then be subtracted from the simulation (allowing application of the bias analysis solution to a 5 × 2 table) and reapplied to the summary estimates obtained from the bias analysis. Probabilistic redistribution of the records (or combinations of records) simulated to have been misclassified required either a record-level approach or a frequency-weighted approach, so there was again little to be gained from this solution. For information bias that involves a dichotomous variable (ie, all misclassified information is reclassified into the other category), we expect this solution would be quite effective, and perhaps most effective, at reducing computational intensity.

As data sets grow larger, the accuracy and granularity of the collected data may sometimes suffer. In pooled data projects, data harmonization presents a new avenue for information bias to affect study results.^{21} New approaches to improve the validity of inferences drawn from the data analyses are needed. Probabilistic-bias analysis provides one strategy to estimate the magnitude, direction, and uncertainty arising from these study imperfections.^{22} Collection of validation data in substudies is a hallmark of good epidemiologic practice and should not be avoided or diminished out of concern about the computational challenges one might face when using the validation data. Computers of sufficient computational capacity to overcome the apparent barrier presented by applying simulations to large data sets are available in most research settings. Even when that computing power is not available, or is too expensive to use, alternative strategies can be derived without much difficulty. In our application, the frequency-weighted full cohort analysis yielded exactly the same results as the full cohort analysis, with substantially less computing time and substantially lower data simulation requirements. The solutions we implemented and others we suggest above may compare differently depending on the nature of the computing environment, data set, and planned analyses.

## ACKNOWLEDGMENT

We thank Sara Parisi and Sarah Pugh for data management support.