Accounting for 1% of all births in the United States,1 in vitro fertilization (IVF) is an increasingly common method of assisted reproductive technology (ART). The growth in popularity of IVF has been accompanied by more research studies aimed at identifying and assessing factors that affect IVF success. However, IVF-related studies often present only two-sided P values calculated from χ2 test, Student t test, or life table analyses.2 These methods are limited in the information conveyed; they do not account for confounding; and—given that censorship by treatment cessation may be predictive of the likelihood of successful outcome (live birth)—the validity of these methods is questionable.3,4
IVF data present several statistical challenges.2,4,5 First, the IVF process involves a sequence of steps, and the outcome at one step may influence the next. Oocyte retrieval must be successful to proceed to fertilization, fertilization must be successful to proceed to embryo transfer, and so on; recent work of Penman et al6,7 describes a method for modeling these stages simultaneously. Second, successful pregnancy depends upon both the health of the mother and the viability of the transferred embryos; the “EU” (embryo/uterus) model8 addresses this dual nature of successful gestation. Third, a majority of women pursuing IVF undergo more than one treatment cycle. Because women tend to have similar outcomes in successive pregnancies (eg, low birth weight) in the natural setting,9 it is likely that the results across IVF cycles will be correlated. This lack of independence of cycles must be taken into account in the data analysis.10,11
We focus on methods for addressing the issue of multiple cycles, applying several statistical methods to a dataset from 3 IVF clinics to illustrate variation in validity and interpretation. This paper does not address the statistical issues arising from the multistep nature of IVF; for simplicity we focus on the last stage only—beginning at the transfer of at least one embryo, with success defined as the delivery of at least one live birth.
Data Collection and Demographics
We defined eligibility as all married couples newly enrolled for IVF treatment from 1994 through 2003 at 3 clinics in the Boston area, excluding those using donor gametes or gestational carriers. Self-administered baseline questionnaires were given to the women and men to assess the following: demographic history; menstrual, contraceptive, fertility, and medical history; physical activity; and lifestyle factors. This study was approved by the Brigham and Women's Hospital and Harvard School of Public Health institutional review boards, and all participants gave informed consent.
ART cycle-specific data were abstracted from medical records, including the planned procedure (IVF or gamete intrafallopian transfer [GIFT]), use of gonadotropin-releasing hormone agonists, gonadotropins used, intracytoplasmic sperm injection (ICSI), and monitoring details (follicle counts and estradiol levels, number of oocytes retrieved, semen characteristics, number and quality of the embryos created, details of the embryo transfer, and posttransfer outcomes). The cycle reference date was defined as the day gonadotropin treatment was begun. Ten percent of records were reabstracted with agreement of >99%.
The original study population included 2687 couples undergoing a variety of ART procedures. For the purposes of this methodologic paper, only IVF cycles with the transfer of at least one embryo were included. Any cycles subsequent to a non-IVF cycle were omitted, as were 69 cycles for which data on covariates were missing, and 3 implausible records.
Typical IVF Cycle Methods
IVF cycles began with gonadotropin-releasing hormone agonists in a long (downregulation) or short (flare) regimen. Dosing of lupron for flare generally commenced on day one of the menstrual cycle and continued during gonadotropin administration, whereas downregulation commenced on day 21 of the prior menstrual cycle and continued for at least 2 weeks before gonadotropin administration. Various formulations of gonadotropins were used at doses usually ranging from 150 to 600 ampules per day by intramuscular or subcutaneous injection.
Ovarian response was monitored by measuring serum estradiol levels and using ultrasound to monitor the number and size of follicles, typically on day 6 of the cycle and then every 1–3 days depending on the patient's response. Generally, when there were 2 or more follicles with a diameter of at least 18 mm, human chorionic gonadotropin (hCG) was administered to replicate the preovulatory surge of luteinizing hormone. Transvaginal oocyte recovery generally occurred 36 hours after hCG administration, and oocytes were inseminated by mixing with ∼50,000–300,000 sperms or by ICSI (intracytoplasmic sperm injection).
If insemination was successful, the embryos were generally cultured for 2–5 days. Some or all of the embryos were transferred to the uterine cavity through transfer catheter. About 18 days after embryo transfer, a serum pregnancy test (β-hCG) was performed.
If the beta-hCG pregnancy test was positive, a pelvic ultrasound was performed to determine whether there was clinical evidence of a pregnancy (at least one fetus with heartbeat). If a clinical pregnancy was detected on ultrasound, follow-up of couples documented whether the pregnancy ended in a live birth or in miscarriage, ectopic pregnancy, molar pregnancy, or stillbirth. The IVF cycle was considered to be successful only if the pregnancy ended in the delivery of at least 1 live newborn.
We surveyed several methods: logistic regression using a single IVF cycle from each subject, generalized estimating equations (GEEs), nonlinear mixed effects models, discrete-time survival analysis, and EU(embryo-uterus) models. All analyses were performed using Statistical Analysis Software version 9.1 (SAS Institute, Inc., Cary, NC).
It is unreasonable to expect IVF outcomes of different cycles from the same woman to be independent. To ignore correlations among cycle outcomes can lead to invalid inferences or loss of statistical power.
A common statistical technique used in the IVF literature is to discard all but one cycle—typically the first or the last. The first-cycle data are viewed as the cleanest, whereas the last-cycle data include every ultimate success; neither considers all IVF treatments. In our dataset, selecting only one cycle would discard approximately half the data, decreasing statistical power to detect associations.
To account for correlations of outcomes, generalized estimating equations (GEEs) can be used to make population-averaged inferences and do not require distributional assumptions about the observations. GEEs can be fit using Proc Genmod in SAS, estimating both model-based and empirical standard errors, and the user may select from a variety of correlation structures. We present the GEE model with an exchangeable correlation structure using empirical standard errors to construct confidence intervals.
In contrast to GEE, nonlinear mixed effects models are used to make subject-specific inferences. These models are well suited for analysis of unbalanced data (in which subjects have different numbers of observations). Mixed effects models can also accommodate multiple layers of data clustering, and Ecochard and Clayton10 suggest the use of those models for ART data with sperm donation. A number of estimation methods can be used to fit these models12; in this study, Gaussian quadrature was used with the NLMIXED procedure.
Despite the capability of GEE and mixed effects models to cope with correlated data, GEE methods can perform poorly in the situation of nonignorable cluster size (eg, when the number of attempts is related to the outcome of interest).13 GEE estimates may be biased when missing data are not missing completely at random (a necessary model assumption),14 although adjustments such as weighted GEE are possible remedies.15 Similarly, the “conditional independence” assumption in mixed models (ie, the independence of a given woman's cycle outcomes from each other), may not be tenable with an IVF dataset, as discussed earlier in the text. Furthermore, as treatment choices such as gonadotropin dose are likely to be guided by clinical responses on previous cycles, the outcome of a particular cycle may affect covariate values at subsequent cycles, adding further to cycle dependence.
Another alternative is the survival analysis approach, considering event time to be the number of cycles until success.16,17 This framework obviates the need to account for within-subject correlation because the woman, rather than the cycle, is the unit of observation. However, instead of the traditional Cox proportional hazards model, which assumes the underlying time scale is continuous, the following discrete survival analysis is tailored to the situation of a discrete time unit (here, IVF attempt number).
Letting pk be the probability that a woman succeeds on the kth IVF attempt, given that she has not succeeded on previous attempts, the model stipulates
The model is equivalent to performing unconditional logistic regression, including cycle number as a categorical (ie, nominal) covariate. Comparatively, the Cox proportional hazards model is related to conditional logistic regression.17 The primary disadvantage to any survival analysis is the independent censoring assumption—eg, women who have 2 failed IVF procedures and then discontinue treatment must be representative of all women in the study who had 2 failed IVF procedures.18 Whether this assumption is tenable for a particular IVF dataset must be considered critically.
We used these 3 methods to fit models for successful IVF that included treatment center (coded 1, 2, 3), study enrollment period (1994–1998, 1999–2003), history of previous live birth (yes, no), woman's age (<35, 35–37, 38–40, >40 years), primary infertility diagnosis (female, male, idiopathic), ampules of gonadotropins divided by 10 (continuous), gonadotropin-releasing hormone agonists regimen (downregulation, flare), ICSI (yes, no), and number of embryos transferred (count).
Generally, the probability of success p is modeled as a function of a linear combination of predictors:
Most commonly, logistic regression, choosing the logit
as the link function g. Proposed by Spiers et al8 and developed further by others,19–21 the EU model posits that successful conception depends on 2 independent factors: the viability of the embryo and the receptivity of the uterus. Let E indicate if an embryo is viable (E = 1 if viable, E = 0 if not) and let e be the probability of viability (e = P [E = 1]). Similarly, let U indicate if the uterus is receptive (U = 1 if receptive, U = 0 if not), and u be the probability of receptivity (u = P [U = 1]). Each of e and u may depend on a combination of covariates:
A successful IVF cycle (the EU method was initially proposed for the outcome of “pregnancy attainment” but can be adapted to model live birth) requires that the uterus be receptive (U = 1) and that at least one of the embryos be viable (Ej = 1 for at least one j). The probability of success is expressed as:
with the product taken over all embryos transferred in that cycle. If embryo-level predictors such as daily cell counts or embryo grade are not available (as in our dataset), then the model simplifies to:
where n is the number of embryos transferred.
With the probability of successful IVF decomposed into the E and U factors, this model facilitates investigation of predictors associated with the 2 primary components of success. Most importantly, the EU model is well suited to IVF data that include cycles with donor eggs or gestational carriers, whereas the covariate values for the woman providing oocytes differ from those for the woman to whom the resulting embryos are transferred. On the other hand, interpretation of overall association with success of IVF is difficult in the case when one predictor is believed to be related to both viability and receptivity. Also, embryo implantation is observed only in the aggregate, and furthermore, the model assumes that embryo viabilities are independent.
We explored how the EU model compares with the simpler model for the probability of success using the first cycle. Either model is difficult if the study includes more than one IVF attempt per subject, and we additionally present results utilizing data from all cycles undertaken.
In contrast with the illustration of repeated-measures statistical methods, we observed that with the EU models, pared-down models were more stable, and therefore fewer covariates are included, based on the lack of observed association and retention for comparative purposes. In the EU model, logit links were used on both the E and U factors, and each linear predictor (equations 3 and 4) included history of previous live birth and woman's age. Embryo viability, but not uterine receptivity, was allowed to depend additionally on ampules of gonadotropins and gonadotropin-releasing hormone agonists regimen (ie, these variables were included in equation 4 but not 3). Because the likelihood for the EU model (equation 6) automatically incorporates the number of embryos transferred, it is unnecessary to include this as a covariate; if all women had the same number of embryos transferred, the model could not be fit. The nonlinear mixed models procedure using the likelihood function from equation 6 was used to fit the EU model. In total, the EU model was applied to (1) data from the first cycle only, (2) data from all cycles ignoring between-cycle correlations, (3) data from all cycles with cycle number included as indicator variables, (4) with a Gaussian random effect on the E (embryo viability) factor only, (5) with a Gaussian random effect on the U (uterine receptivity) factor only, and (6) with a Gaussian random effect on both the E and U factors.
After exclusions, data were available on 2318 couples contributing 3913 cycles (Tables 1, 2). Women's mean age was 35.2 years (SD = 4.3, range = 20–49 years); men's mean age was 36.9 years (SD = 5.6, range = 20–69). Male, female, and idiopathic infertility diagnoses were equally represented in this population. The median number of cycles per couple was 2 and the maximum, 6.
The percentage of failed implantations increased (and live birth percentage decreased) with cycle number (Table 2). Other outcomes such as spontaneous abortion did not change noticeably across number of cycles (results not shown). The proportion of cycles failing prior to transfer decreased with cycle number, whereas the proportion of couples discontinuing IVF treatment among those who did not have a live birth increased with cycle number.
Standard errors from any of the models using all cycles tended to be smaller than those from the model using only the first cycle, resulting in tighter confidence intervals (Table 3). For example, in the first cycle model, the estimated odds ratio (OR) of success for women over 40 compared with the referent age group (<35) was 0.21 (95% confidence interval [CI] = 0.13 to 0.33), whereas in the discrete survival model the estimate was 0.21 (0.15 to 0.29). The GEE model with exchangeable correlation structure yielded parameter estimates and standard errors that were extremely close to those for discrete survival analysis. On average, the estimates in the 2 models differed by 7%, and none varied by more than 15%. A mixed effects model was fit with subject-level random intercepts assumed to come from a Gaussian distribution. Comparing these results with those from other models is problematic as the coefficients in a nonlinear mixed effects model have subject-specific rather than population-averaged interpretations.22 Although the interpretation of the odds ratios are not the same, qualitatively the results are similar. The differences in effect magnitude and direction can be striking and nonintuitive; in particular, the subject-specific effect of cycle number seen in the analysis by Hogan and Blazar11 was found to be similar in our dataset (results not shown). Typically, however, within our data the directionality of effects remained the same. For example, the estimated odds ratio of success for downregulation compared with flare was 1.22 (1.00 to 1.48) in the discrete survival model and 1.32 (1.02 to 1.71) in the mixed effects model.
We attempted a total of 6 EU models. Overall, the first-cycle-only model yielded inflated estimates and wider confidence intervals compared with the multicycle models (Table 4). The models that included a Gaussian random effect on either the E factor or the U factor, or those that included a Gaussian random effect on both the E and U factors, failed to parameterize. For the model with a random effect on E, optimization could not be completed; with a random effect on U, there were negative eigenvalues; and with a random effect on both E and U, the model did not converge—highlighting the complexity of the multicycle EU-model statistical methodology.
We conducted additional analyses not tabulated here. To illustrate that logistic regression analyses using only one IVF cycle can vary by choice of cycle, we fit models using the first or last cycles. Using first-cycle data, the estimated odds ratio of successful IVF for women with a previous live birth compared with those without was 1.39 (1.10 to 1.75), whereas using last-cycle data produced an estimate of 1.17 (0.94 to 1.46). To gauge the extent to which within-woman observations were correlated, we computed intraclass correlations (ICC). In the mixed effects model, the estimated ICC was 0.18 (−0.50 to 0.85). In addition, the discrete survival model was refit including number of embryos transferred on the log scale. The likelihood ratio test statistics for improved fit after including number of embryos transferred were 43.14 (standard scale) and 65.43 (log scale), indicating an advantage if the number of embryos transferred variable is log transformed. This transformation makes intuitive sense as it incorporates a decreasing marginal benefit to transferring additional embryos. As an aside, a second motivation for the transformation is that if the log link is applied to the embryo viability factor in equation 6, and a Taylor series expansion is applied, the number-of-embryos-transferred variable appears as an offset term on the log scale.
Additionally, for the mixed effects model, we computed empirical Bayes estimates of the random effects. These were found to be somewhat bimodal with a longer tail on the negative side (results not tabulated), suggesting that the normality assumption for random effects may be suspect for IVF data. Tests for departure from normality such as Cramér von-Mises also indicated that the Gaussian assumption may be suspect. To address the issue of dependent censoring in the discrete survival model, data for women who dropped out after cycles 1–3 were regressed on the type of IVF failure (eg, implantation failure) controlling for predictors used (eg, woman's age). Rare fail points (such as stillbirth) were excluded from these analyses. None of the fail points was associated at every cycle with a statistically significant increase or decrease in dropout. However, on cycle 1 only, women who experienced spontaneous abortion were more likely to discontinue treatment, and were also more likely to have a later successful cycle. It therefore may be apropos to include attainment of clinical pregnancy on a previous cycle as a covariate; however, Buck Louis et al5 have advised caution when including previous reproductive outcomes as model predictors as doing so may mask effects.
Researchers do not always make full use of IVF data due to its complexity. ART data present challenges in statistical analysis, including the fact that many couples require more than one treatment cycle to achieve a successful live birth. Although there are several potential approaches to this challenge, some techniques may introduce bias or reduce statistical power. To date, the literature is composed largely of analyses that present nonparametric χ2, Fisher exact test, Student t test, and life table analyses. Many of these analyses quantify statistical significance without an evaluation of magnitude of effect or variability of the association, do not make use of the full dataset, or do not control for confounders. Use of multivariate regression methods can address many of these concerns, but must account for correlation in the data and consider the tenability of regression model assumptions. To address these issues, we applied several statistical methods to identify predictors of live birth within a prospective cohort of 2687 couples treated with IVF.
By decomposing the probability of IVF success into embryo viability and uterine receptivity factors, the EU (embryo-uterus) model allows for mechanism-focused inferences, can improve model fit, and has become substantially easier to program with the advent of packages such as the NLMIXED procedure in SAS. However, the richness of the model can lead to complexity in interpretation of results. To fit EU models to all-cycle data, the hierarchical Bayesian approach of Dukic and Hogan21 can be used, but this requires expertise to implement. We observed instability in the EU models with large changes in parameter estimates comparing the single cycle to multiple cycle analyses. Each covariate is entered into the model twice—once for embryo viability and once for uterine receptivity, doubling the parameterization relative to that of the simple logistic regression or discrete survival models. Stability was improved when model covariates were pared down, but this may limit interpretability and data utilization. As an alternative, we fit EU models with random effects on the E factor only, U factor only, and both E and U factors. Computational time was lengthy and convergence of parameter estimates for the single random-effect models was sensitive to initial values and ultimately did not converge. In general, we observed greater woman-to-woman and temporal variation in uterine receptivity, whereas embryo viability appeared to be more stable.
Power to detect effects can be diminished if data are discarded; standard errors were noticeably larger in the first-cycle-only logistic regression compared with models that used all cycles. Furthermore, it should be noted that our sample size was large; discarding cycles is likely to have more severe consequences in a typical single-site IVF study.
We fit 3 types of models using all IVF cycles, with various mechanisms to account for within-couple dependence of outcomes: the logistic-normal mixed effects model, GEE with exchangeable correlation structure, and the discrete survival model. Because the survival model is equivalent to ordinary logistic regression with cycle number included as a nominal covariate, this approach is straightforward to interpret and the easiest to implement with standard software. Furthermore, the model is well suited to features of IVF data, such as censoring and cessation of treatment upon success. Use of either discrete survival analysis or GEE requires assumptions about dropout: independent censoring in the case of survival analysis,17 or noninformative cluster size in the case of GEE.13 In this study, the parameter estimates and standard errors in the 2 models were very similar. The mixed effects model is well-suited to women undergoing different numbers of IVF cycles, and is perhaps the least likely to rely on untenable assumptions. However, its coefficients have subject-specific interpretations, and drawing population-level inferences can be complex. The choice of a marginal model compared with a subject-specific model (such as nonlinear mixed effects) should be based primarily on the target of study.23
We confirmed previous observations for the following predictors of IVF success given that an embryo transfer occurred: history of previous live birth, woman's age, gonadotropin-releasing hormone agonists regimen, number of gonadotropin ampules used, and number of embryos transferred. Other covariates (such as body mass index, man's age, primary infertility diagnoses, day-3 estradiol level, number of oocytes retrieved, and ICSI) were considered, as well, but they were not found to be important predictors of IVF success. (For purposes of method comparison, diagnosis and ICSI were retained in all models) (eTable, http://links.lww.com/EDE/A474.)
No model is without disadvantages.23 Techniques that use only one IVF cycle per woman can over-simplify the outcome of the overall IVF experience, either by counting only first-cycle successes, or by disregarding the number of attempts required to achieve success. However, if all cycles are used, the likely interdependence of outcomes from the same couple must be taken into account. Each method examined in this paper for dealing with multiple-cycle data relies on various assumptions that may or may not be expected to hold in a given IVF study. Probability models such as the EU and similar methods allow for greater flexibility, but this comes at a cost of increased model complexity. The target of inference and tenability of model assumptions must guide the choice of analytic method.
An encouraging conclusion of this methodologic investigation is that the magnitude of the effects observed and the strengths of the associations did not vary dramatically across models. In fact, the similarity of coefficient estimates across models was surprisingly striking in this exercise with multicenter IVF data. Though it is beyond the scope of this paper, a simulation study could shed more light on these issues. Those conducting ART research and practicing evidence-based medicine can be assured that previously published research that has critically considered and transparently detailed study design, data collection, and confounding control is likely to have produced results within the range of valid effect estimation regardless of the model that was applied. However, the robustness and reported magnitude of effect for individual predictors of IVF success may be inflated or attenuated due to violation of statistical assumptions, and should be critically interpreted.
We thank Paige Williams and Sohee Park for their statistical contribution, Mark Hornstein and Stephanie Estes for their manuscript critiques, and Allison Vitonis for her assistance with data management.
1. Wright VC, Chang J, Jeng G, Macaluso M. Assisted reproductive technology surveillance—United States, 2003. MMWR Surveill Summ
2. McDonough PG. Life-table analysis falls short of the mark! Fertil Steril
3. Land JA, Courtar DA, Evers JL. Patient dropout in an assisted reproductive technology program: implications for pregnancy rates. Fertil Steril
4. Olive DL. Analysis of clinical fertility trials: a methodologic review. Fertil Steril
5. Buck Louis GM, Schisterman EF, Dukic VM, Schieve LA. Research hurdles complicating the analysis of infertility treatment and child health. Hum Reprod
6. Penman R, Heller GZ, Tyler J. Modelling IVF data using an extended continuation ratio random effects model. In: Proceedings of the 22nd International Workshop on Statistical Modelling; July 2–6, 2007; Barcelona, Spain. ISBN 978-84-690-5943-2.
7. Penman R, Heller GZ, Tyler J. Modelling assisted reproductive technology data using an extended continuation ratio model. In: Statistical Solutions to Modern Problems. Proceedings of the 20th International Workshop on Statistical Modelling; July 10–15, 2005; Sydney, Australia.
8. Speirs AL, Lopata A, Gronow MJ, Kellow GN, Johnston WI. Analysis of the benefits and risks of multiple embryo transfer. Fertil Steril
9. Louis GB, Dukic V, Heagerty PJ, et al. Analysis of repeated pregnancy outcomes. Stat Methods Med Res
10. Ecochard R, Clayton DG. Multi-level modelling of conception in artificial insemination by donor. Stat Med
11. Hogan JW, Blazar AS. Hierarchical logistic regression models for clustered binary outcomes in studies of IVF-ET. Fertil Steril
12. Pinheiro JC, Bates DM. Approximations to the log-likelihood function in the nonlinear mixed-effects model. J Comput Graph Stat
13. Hoffman EB, Sen PK, Weinberg CR. Within-cluster resampling. Biometrika
14. Diggle P, Heagerty P, Liang K-Y, Zeger S. Analysis of Longitudinal Data.
2nd ed. Oxford: Oxford University Press; 2002.
15. Preisser JS, Lohman KK, Rathouz PJ. Performances of weighted estimating equations for longitudinal binary data with drop-outs missing at random. Stat Med
16. Cox DG, Hankinson SE, Kraft P, Hunter DJ. No association between GPX1 Pro198Leu and breast cancer risk. Cancer Epidemiol Biomarkers Prev
. 2004;13(11 pt 1):1821–1822.
17. Cox DR, Oakes D. Analysis of Survival Data.
London: Chapman & Hall; 1984.
18. Kalbfleisch JD, Prentice R. The Statistical Analysis of Failure Time Data.
2nd ed. Hoboken, NJ: John Wiley & Sons; 2002.
19. Baeten S, Bouckaert A, Loumaye E, Thomas K. A regression model for the rate of success of in vitro fertilization. Stat Med
20. Zhou H, Weinberg CR. Evaluating effects of exposures on embryo viability and uterine receptivity in in vitro fertilization. Stat Med
21. Dukic V, Hogan JW. A hierarchical Bayesian approach to modeling embryo implantation following in vitro fertilization. Biostatistics
22. Heagerty PJ, Zeger Scott. Marginalized multilevel models and likelihood inference. Stat Sci.
23. Fitzmaurice GM, Laird N, Ware JH. Applied Longitudinal Analysis.
Hoboken, NJ: John Wiley & Sons; 2004.