In the 1939 Metro-Goldwyn-Mayer film, *The Wizard of Oz*, the benevolent Wizard is asked to solve the existential dilemmas of the Scarecrow, Tin Man, and Cowardly Lion. Seemingly unable to provide them the intangible things that they seek, he instead gives them 3 mundane items that he happens to have in his bag. The munificence of the Wizard is not cast into doubt by this sleight of hand, however, because the characters discover that the traits they sought were things they had possessed all along. The article by VanderWeele and colleagues^{1} responds to a crisis in reproductive epidemiology over how to handle measures of maturity and development. Variables such as gestational age and birth weight are ubiquitous in perinatal research, but problematic for all the reasons described in their paper. Analysts want to condition on some measure of maturation to put neonates on an equal developmental footing, but exposures of interest may affect the progress of the pregnancy, making conditional estimates biased as measures of total effect. At the same time, direct effects are generally not identifiable because of unmeasured common causes of the maturity indicator and the outcome, leading to the birth-weight paradox described by the authors. Reproductive epidemiologists therefore look eagerly to method wizards for a magical solution to their problem: for the brain, heart, and courage necessary for their daily travails. What have the methodologists got in their bag?

The first method proposed by VanderWeele et al^{1} is to estimate the total effect of an exposure such as smoking on neonatal mortality within strata defined by the estimated risk of a developmental intermediate such as low birth weight. The authors predict the risk of low birth weight through a regression model and create categories of high and low risk. This is a straightforward extension of the typical practice of estimating total effects within strata of baseline covariates. In this case, the strata are defined as some more complicated function of the covariates, but the approaches are similar; they estimate the total effect of smoking among subcategories of the population. The ability of this method to resolve the birth-weight paradox depends, however, on the discriminatory power of the risk score and the prevalence of the intermediate. The authors estimate the effect of smoking on neonatal mortality as OR_{high risk}=1.6 and OR_{low risk}=1.3, among those at high and low risk. Rather than implying a resolution of the paradox, however, these estimates are entirely compatible with a state of nature in which smoking actually has a protective effect among low birth weight infants.

Imagine we know *pam*=Pr(*Death*|*SET*[*Smoke*=*a*], *SET*[*LBW*=*m*]), where smoke = 1 indicates maternal smoking and LBW=1 indicates infant low birth weight. If we posit values of *pam*, we can ascertain how well the authors' stratification appears to resolve the paradox. Suppose, for example, that the birth weight paradox is not actually a paradox and that smoking is truly beneficial in one stratum and harmful in the other. If we take as parameters *p*_{11}=0.04, *p*_{01}=0.06, *p*_{10}=0.05, and *p*_{00}=0.03, then in the absence of confounding, the true controlled direct effect of smoking among normal-weight infants would be OR_{(m=0)}=1.7, whereas among low-birth-weight infants it would be OR_{(m=1)}=0.7. Suppose, we apply the stratification proposed by the authors^{1} and our classification scheme has some associated set of positive and negative predictive values (ppv_{(a=1)}, ppv_{(a=0)}, npv_{(a=1)}, and npv_{(a=0)}). The stratified total effects obtained by the proposed method are given by the formulae:

For example, if the risk classification model has ppv_{(a=1)}=ppv_{(a=0)}=0.9 and npv_{(a=1)}=npv_{(a=0)}=0.9, then equation (1) yields OR_{low risk}=1.5 and OR_{high risk}=0.7, which are similar to the controlled direct effects. However, an intermediate with a low prevalence such as low birth weight will have poor ppv unless the specificity is nearly 1.0. If the risk classification model has ppv_{(a=1)}=ppv_{(a=0)}=0.3 and npv_{(a=1)}=npv_{(a=0)}=0.9, then equation (1) yields OR_{low risk}=1.5 and OR_{high risk}=1.2. A practitioner could be tempted to believe these results imply that the qualitative effect modification referred to as a “paradox” does not really exist when, in fact, it does. In the absence of extremely good predictive models for low birth weight, it is therefore hard to have much faith in the ability of this method to detect a “paradox” were it actually to exist. The authors' finding of harmful effects of smoking in both risk strata cannot be taken as evidence of having “resolved” the paradox; in the absence of an excellent prediction tool for the intermediate, it is the answer one would expect regardless of the truth.

Jumping to the third method, the authors^{1} propose estimation of the principal-stratum direct effect. Resolution of the problem by means of restriction to a stratum in which exposure has no effect on the intermediate predates the cited article of Frangakis and Rubin^{2}—for example, in the work of Joffe et al.^{3},^{4} But Joffe and colleagues identified this stratification on substantive grounds, whereas the authors' approach here treats it as a latent factor that must be estimated with a model. The proposed sensitivity analysis parameter is an honest way of dealing with the inherent uncertainty of this estimation, but one that will often be difficult in practice. There is little substantive guidance available about the likely magnitude of this parameter, only its sign. Even if one happens to get this value right in the analysis, however, the effect estimate must be applied in public-health terms to a subpopulation whose membership and prevalence are unknown.^{5} The authors are quite frank about all of these dilemmas, which leaves them not entirely sanguine about the prospects for this approach.

This leaves approach 2, which has been applied elegantly in a previous paper^{6} and which we agree will be of greatest value. This approach follows a long line of methodological development for effect estimation in the presence of unmeasured covariates.^{7} By specifying the prevalence of an unmeasured confounder within strata of the intermediate, as well as the magnitude of its effect on the outcome, the authors^{1} can correct the observed direct-effect estimates. As they note, one need not specify the parameters perfectly; it can be illuminating merely to estimate the extent of bias under alternative guesses at the bias parameters. But let's go one step further by rephrasing the question: what values of the parameters given in the author's expression (3) could resolve the birth-weight paradox? The resolution of the paradox can be viewed in different ways; however, because the greatest concern seems to be that smoking could appear to be protective in some stratum, one might wonder what parameter values lead to bias equal to the observed OR, and therefore to an adjusted OR=1.0. To determine this, one can rearrange expression (3) given by the authors as:

where π_{am} is the prevalence of the unmeasured confounder among those with smoking status *a* in low birth weight stratum *m*, *B* is the amount of bias necessary to render the observed association null, and γ is the unmeasured relationship between confounder and disease. In this case, a value of *B*=0.76 would lead to a perfectly null bias-adjusted OR and a resolution of the paradox. Rather than examine a few token examples, we plot the values of γ necessary at all combinations of π_{0}_{m} and π_{1}_{m} in the Figure. Notably, for some prevalence combinations, there is no value of γ that can explain the paradox according to this formula. Moreover, prevalence combinations that approach π_{1}_{m}=*B* × π_{am} require increasingly large confounder–outcome associations to explain the paradox, whereas prevalence combinations further from that line require less substantial effects. In this setting, with limited substantive knowledge, it may be more useful to examine a graph such as this rather than 2 or 3 discrete points.

Approach 2 involves estimation of controlled direct effects, which have an established place in the epidemiologic toolkit.^{8} The authors^{1} use a sensitivity parameter for unmeasured confounding of the intermediate, which is a pragmatic and effective elaboration. This seems to be the superior strategy of the 3 considered in this paper, as the authors themselves note. However, it is important to add that a crucial consistency assumption is not met for an effect premised on intervention when no such intervention is realistic.^{9} We presumably have the technology, if not the malice, to fix all births as low birth weight by induction of premature labor. We have no mechanism for controlling pregnancies to normal birth weight, and, more importantly, these effect estimates have no pretense of corresponding to any real-world interventions. The authors therefore caution that the estimates should be granted no causal interpretation. Unfortunately, epidemiologists face this dilemma exactly because we need causal estimates. A more satisfying solution to this problem awaits improved knowledge of the unmeasured common causes of low birth weight and neonatal mortality. And so, like Dorothy at the end of the film, epidemiologists may find that the magic needed to resolve their problem actually lies in the substantive and methodological knowledge in their own backyard.