Secondary Logo

Journal Logo


Which Propensity Score Method Best Reduces Confounder Imbalance? An Example From a Retrospective Evaluation of a Childhood Obesity Intervention

Schroeder, Krista; Jia, Haomiao; Smaldone, Arlene

Author Information
doi: 10.1097/NNR.0000000000000187


Childhood obesity is one of the nation’s greatest child health threats, with 17% of children meeting criteria for obesity (body mass index [BMI] ≥ 95th percentile for age and gender; Ogden, Carroll, Kit, & Flegal, 2014). Childhood obesity is associated with many health comorbidities during childhood, negative psychosocial and academic outcomes, and adult obesity (Daniels, 2006; Freedman et al., 2005; Rappaport, Daskalakis, & Andrel, 2011). Schools are an ideal setting in which to implement childhood obesity interventions (Institute of Medicine, 2012; Waters et al., 2011), and school-based obesity interventions can be effective in reducing children’s BMI percentile (Katz, O’Connell, Njike, Yeh, & Nawaz, 2008; Schroeder, Travers, & Smaldone, 2016; Sobol-Goldberg, Rabinowitz, & Gross, 2013; Waters et al., 2011). However, many school-based obesity interventions are implemented in a voluntary manner, with only students who are interested in the intervention actually receiving the intervention. Students who do not want to participate (or whose parents do not want their child to participate) do not receive the intervention. As a result of lack of randomization, key differences may exist between intervention participants and nonparticipants (Shadish, Cook, & Campbell, 2002). Differences may include factors that are known to influence childhood obesity (e.g., gender, race/ethnicity, socioeconomic status; Davison & Birch, 2001); these differences can confound the relationship between the intervention and outcome of interest, in this case, BMI percentile change. Such confounder imbalance introduces bias (Rubin & Zell, 2016). Therefore, statistical methods that reduce bias in observational studies are required.

Various statistical methods can be used by the researcher to control for bias when assignment to treatment or control conditions is not at random (in quasiexperimental or observational studies; Eckardt & Rindskopf, 2016). One important approach is application of propensity scores (PSs; Rosenbaum & Rubin, 1983). A PS, developed by Rosenbaum and Rubin (1983), is the probability of an individual being in the intervention group, given his or her baseline characteristics. The PS is calculated using logistic regression, in which the individual’s characteristics (potential confounders) are the predictors and probability of being in the intervention group is the outcome. PSs can be applied in observational studies to reduce bias in understanding the relationship between an intervention and outcome. On the basis of the potential outcomes framework, the unconfounded effect of an intervention can be ascertained when the PS balances all confounders between the nonrandomized intervention and control groups (Rubin & Zell, 2016). Various methods of PS application exist (Rosenbaum & Rubin, 1983). PS methods may be superior to other methods of controlling for confounding in nursing research, such as sample stratification and matching (which can only account for a limited number of known confounders) or regression analysis (which may result in residual bias if the intervention and control group are heavily imbalanced on baseline characteristics; D’Agostino & Kwan, 1995; Qin, Titler, Shever, & Kim, 2008; Rosenbaum & Rubin, 1983).

The scholarly discussion about and use of PSs in both the general biomedical and nursing literature has markedly increased during the past 10 years (Figure 1), necessitating that nurse scientists who work with observational data have a working knowledge of this analytic technique (Eckardt, 2012; Qin et al., 2008; Shadish & Steiner, 2010). Need for PS use will likely increase with the explosion of observational data available to nurse scientists via electronic medical records (Clarke & Cossette, 2000; Lin, Jiao, Biskupiak, & McAdam-Marx, 2013). Currently, many papers in the nursing literature that discuss PS methods do so by applying PSs prior to their study analyses (e.g., Moser et al., 2014; Stimpfel, Rosen, & McHugh, 2014). This article is one of the few within the nursing literature to focus on PS methodology (Eckardt, 2012; Qin et al., 2008; Shadish & Steiner, 2010) and, to our knowledge, the first in the nursing literature to examine effects of different PS methods on improvement in confounder balance.

Publications retrieved from PubMed and Cumulative Index to Nursing and Allied Health Literature (CINAHL) from 2004 to 2013 using the key term propensity score*.


The purpose of our study was to apply three PS methods to an observational data set in order to determine which method best improved confounder balance. We measured confounder balance by examining the number of confounders with significant differences between the intervention and control groups. We chose this approach to measuring confounder balance because it is easily observable, quantifiable, and applicable across various PS methods.


Definition and Creation

A PS is the conditional probability of assignment to an intervention given a vector of observed covariates (Rosenbaum & Rubin, 1983). Applied to nonrandomized intervention studies, a participant’s PS is the probability of receiving the treatment or intervention in question given characteristics such as gender, race, or neighborhood poverty level. Because a PS is a probability, values range from 0 to 1. Two participants with identical PSs can be considered to have the same probability of receiving the intervention. In this manner, the PS can help create a proxy intervention and control group within observational cohort data (D’Agostino, 1998; D’Agostino & Kwan, 1995). For example, a student with a PS of 0.72 who received the intervention can be considered comparable on key confounders to a student with a PS of 0.72 who did not receive the intervention. Because of their identical PSs, these two participants can be considered comparable, as the PS has adjusted for their differences. It is important to note that PSs are created using measured covariates only. Therefore, PS methods cannot mimic a randomized controlled trial (D’Agostino, 1998; D’Agostino & Kwan, 1995; Wagner, 2012). However, in large data sets, many variables (i.e., potential confounders) are often available to the researcher for creation of a PS, reducing the risk of excluding important confounders (Brookhart et al., 2006; D’Agostino, 1998).

Logistic regression equations are used to compute PSs for each person. The outcome is the log odds of a binary variable (receiving the treatment or being in the control group). The predictors are the k potential confounders of treatment effect. The equation is

where p is the probability of being in the treatment group, (1 − p) is the probability of being in the control group, and the coefficients give the difference in log odds of being in the treatment group for each unit difference in the confounder adjusting for all other confounders. Once the coefficients have been estimated, the PS is obtained as the estimated probability of membership in the treatment group.

The PS can then be used in a number of ways to balance confounders in the estimate of treatment effect on the experimental outcome. When determining which predictors to include in the PS model, Rosenbaum (2002) cautions against using only predictors that significantly differ between groups because (a) this does not consider the relationship between predictor and outcome; (b) just because the difference between groups on a predictor is not statistically significant, it doesn’t mean that it can be ignored; and (c) this process considers predictors only one at a time, whereas the logistic model considers the predictors as a group. As such, it is suggested that all available predictors potentially related to the outcome are included in a PS model unless there is a theoretical reason not to do so (Brookhart et al., 2006).

After creation of the PS, data should be further examined before proceeding with analysis. The distribution of PSs by group should be examined via graphical display using a histogram or boxplot. Presence of a large overlapping area of PSs, or “common support,” indicates that the use of PSs will help to balance the intervention and control groups on key confounders (Wagner, 2012). Although there is no theoretical guidance on what exactly merits common support, it can be considered the range of PSs with at least five observations per PS in both the intervention and control groups (L. Li, Kleinman, & Gillman, 2014). Limiting analyses to only participants whose PSs fall under the common support is recommended, because characteristics of individuals outside the common support may be too different to compare without introducing significant bias in the estimate of treatment effect (F. Li, Morgan, & Zaslavsky, 2014; Wagner, 2012). After the common support is examined, the PS can be applied to the data using multiple methods. We compared three methods—PS matching, PS stratification, and PS weighting—to determine which method best improved confounder balance in one data set.

PS Methods


Using this method, participants who received the intervention are matched by PS to participants who did not receive the intervention (Rosenbaum & Rubin, 1985a), creating two groups that are comparable on potential confounders. After creation of the groups, the treatment outcomes for each group can be directly compared to estimate the intervention effect (Austin, 2011; D’Agostino, 1998; Hill & Reiter, 2006; Rosenbaum & Rubin, 1985a).

Several methodological decisions are required when implementing PS matching: (a) ratio of control to intervention group, (b) replacement versus nonreplacement, (c) greedy versus optimal, and (d) nearest neighbor versus caliper matching. The ratio refers to how many control subjects are matched to an intervention subject; usually, a 1:1 ratio is selected, but many-to-one matching may be employed (Gu & Rosenbaum, 1993). Matching can be performed either with or without replacement (Austin, 2011). When matching with replacement, control participants who are matched to intervention participants are returned to the pool for potential matching with another intervention participant. Matching with replacement requires a special variance estimator to consider the fact that one control participant may be matched to multiple intervention participants (or vice versa; Hill & Reiter, 2006). Greedy versus optimal matching involves using proximity to match intervention and control participants. Greedy methods match the intervention participant with the closest control participant, regardless of whether the control participant would be a closer match for another intervention participant. Optimal matching is based on minimizing the total within-pair difference on PSs (Austin, 2011). Optimal matching has been found to be no better than greedy matching in reducing estimator bias (Gu & Rosenbaum, 1993). Lastly, PS matching may be performed using either nearest neighbor or caliper methods (or a combination of both methods). In nearest neighbor methods, an untreated participant is matched to a treated participant with the closest PS, but no defined PS distance is considered too great for matching. Alternately, in caliper matching, a predetermined distance between PSs, defined by the investigator, is considered the maximum allowable distance. Using the caliper matching approach, a participant lacking a match within that caliper is excluded from analysis. Recommended optimal caliper distances range between 0.05 (Kurth et al., 2006) and 0.2 SD of the logit of the PS (Austin, 2011). The nearest neighbor method is the easiest from a computational standpoint and may be superior to caliper matching methods in minimizing bias in the treatment effect estimate (Austin, 2010), though nearest neighbor matching within calipers defined by the PS may be superior in balancing covariates between groups (Rosenbaum & Rubin, 1985b). Of note, PS matching may lead to some participants being excluded from the final analyzed sample because they do not have a match within the specified caliper (e.g., closest match is within 0.07, but specified caliper is 0.05) or exceed the matching ratio (e.g., data set includes more control than intervention participants, but matching ratio is 1:1). When a PS matching procedure is used, it is recommended that outcome analyses be conducted using a paired t-test or a Wilcoxon signed rank test because the matching process introduces a correlated structure to the data (Lanehart et al., 2012).


In PS stratification, participants are first ranked by PS and then divided into strata (Austin, 2011). Although there is no predetermined number of strata that should be used in all studies, five are typically recommended because five strata can reduce up to 90% of bias (Cochran, 1968; Rosenbaum & Rubin, 1984); every participant is included in one of the five strata. Within each stratum, PSs of participants are similar—more similar than they are across the sample as a whole. After participants are divided into strata, the effect of treatment on outcome can be estimated for the intervention versus control group within each stratum. Outcome analyses can be conducted by pooling the weighted outcome for each stratum, with each stratum weighted by the percentage of participants within that stratum (Austin, 2011). For example, if participants are divided into five strata, then the outcome in each stratum will be weighted by 1/5 or 0.20, when estimating the effect of treatment on the outcome. This can be conceptualized as a “meta-analysis of a set of quasi-RCTs” (Austin, 2011).


In PS weighting, participants are weighted by the inverse of their PS. This weighted sample can be used to determine an unconfounded estimate of the outcome (F. Li et al., 2014). Outcome scores of intervention participants are weighted by a factor of 1/PS; scores of participants who did not receive the intervention are weighted by 1/(1 − PS). PS weighting uses exact PS values, which avoids the possibility of residual confounding that exists when inexact matching and stratification are used (F. Li et al., 2014). One concern with PS weighting is that standard errors of the treatment effect may be larger than with other PS methods because of extreme weighting of participants with PSs near 0 or near 1. One potential solution to this issue is to trim weights (Potter, 1993) by excluding individuals with extreme weights (e.g., <1st or >99th percentile) from the analysis (Cole & Hernan, 2008; Lee, Lessler, & Stuart, 2011). Doing so may increase accuracy of the treatment effect (Kurth et al., 2006). Another method to minimize extreme weighting is to use bounded overlap weights. A discussion of bounded overlap weights method is beyond the scope of this article, but interested readers are referred to L. Li et al. (2014). Once weighting has been applied to the data, outcome analyses can be conducted using weighted least squares regression.

Sample Size

Final analytic sample sizes differ for each PS method. For PS 1:1 matching, the sample size is twice the number of intervention participants. For PS stratification, the sample size includes all intervention and control participants. For PS weighting, the sample size includes all intervention and control participants. Sample SAS code used for implementing each PS method is available (see Table, Supplemental Digital Content 1,


We designed a retrospective observational study using electronic medical record data to examine the effect of a school nurse-led obesity program on BMI percentile change at one year of kindergarten to 5th grade students who met the criteria for severe obesity (BMI for age and gender at the 99th percentile or 120% of the 95th percentile; Flegal et al., 2009). The program—the Healthy Options and Physical Activity Program (HOP)—was implemented in New York City (NYC) public schools in 2010. The goal of HOP is to help children improve health behaviors and maintain or decrease BMI percentile. During HOP sessions, school nurses provide education and counseling and assist students with goal setting around five health behaviors: fruit/vegetable intake, sedentary media use, physical activity, sugar-sweetened beverage intake, and portion size. School nurses document participation in the program in the students’ electronic medical records. During the 2012/2013 school year, 1,054 children participated in HOP (intervention group) and 19,464 were eligible for but did not participate in HOP (control group). Because HOP participation is voluntary and not randomly assigned, there are likely inherent differences between children with severe obesity who received HOP and children with severe obesity who were eligible for but did not receive HOP.

Data used in the creation of the PS came from three sources: the electronic medical records in the public schools (primary data source), the NYC Department of Health and Mental Hygiene Office of School Health Reports, and the NYC Center for Economic Opportunity. Variables were organized using the socioecological model, which acknowledges that individual, family, institution, and community-level factors influence a child’s body weight (Davison & Birch, 2001). Although the purpose of the HOP study was to examine multiple aspects of HOP implementation and efficacy, in this article we limit discussion to improvement in confounder balance resulting from use of PS methods.

PS Creation

Eleven potential confounders (community poverty level, school poverty level, school nurse workload, household food insecurity, baseline BMI, baseline BMI percentile, age, race/ethnicity, grade, gender, and diagnosis of at least one chronic illness) were available for incorporation into a PS. Logistic regression based on participation in HOP (1 = yes, 0 = no) was used to compute the PS. Following creation of the PS, a histogram was examined to assess common support and each PS method was applied.

Using PS matching, each intervention child was matched to a control child (1:1 matching) with the most similar PS (caliper of 0.05). Greedy matching without replacement was employed. Once a pair was matched, the control child was not returned to the pool for further matching. Using PS stratification, intervention children and control children were divided into five equal strata based on PSs. Using PS weighting, each intervention child was assigned a weight of 1/PS, and each control child was assigned a weight of 1/(1 − PS). Weights were then normalized by dividing each weight by the mean weight.

Improvement in Confounder Balance

Distribution of key confounders between the intervention and control groups was compared before application of PSs and after use of each PS method. For each confounder of interest, the difference between the intervention and control groups was tested using either Wilcoxon rank sum tests for continuous variables or χ2 tests for dichotomous or categorical variables. For PS matching only, differences were tested using methods that account for the correlated structure of the matched data (Wilcoxon signed rank tests for continuous variables, McNemar tests for binary variables, and Friedman tests for nonbinary variables). Following application of each PS method, the total number of variables demonstrating significant differences between the intervention and control groups was tallied. For PS stratification, differences between the intervention and control group were tested within each of the five strata; if a statistically significant difference existed in at least one strata, the confounder was counted as a significant difference even if it did not differ within the other four strata. All analyses were performed using SAS 9.4.

Impact on Outcome Variable

An exploratory analysis was conducted to examine the impact of each PS method on estimation of the treatment effect on the outcome variable—BMI percentile change at 1 year. Prior to application of PSs, BMI percentile change at 1 year was calculated for the intervention and control groups. Group differences were then compared using a Wilcoxon rank sum test. Following PS application, differences between groups were analyzed using the Wilcoxon signed rank test (matching), a fixed effect meta-analysis model (stratification), and weighted ordinary least squares regression (weighting) using SAS 9.4 (Cary, NC) and Comprehensive Meta-analysis statistical software (Comprehensive Meta-analysis, 2015).

Institutional review board approval was obtained from Columbia University Medical Center, the NYC Department of Health and Mental Hygiene, and the NYC Department of Education.


Of 20,518 children with severe obesity, 1,054 (5.1%) received the HOP intervention. Baseline characteristics of the sample prior to application of PSs are listed in Table 1. Groups differed on seven of 11 potential key confounders. On average, intervention children went to a school where fewer students received free/reduced lunch and the nurse workload was lower. Intervention children had higher baseline body measures, were older, and were more likely to have at least one chronic condition.

Sample Characteristics: Treated and Untreated Groups

PS Method and Confounding

Data from one or more variables required for PS creation were missing from 75 participants (3.7%), leaving a total sample of 1,049 children in the intervention group and 19,394 children in the control group for PS analysis. PSs ranged between 0.002 and 0.621. The mean PS in the intervention group was 0.085 (SD = 0.063; range: 0.005–0.491); the control group mean was 0.050 (SD = 0.040; range: 0.002–0.621). The area of common support is shown in the histograms in Figure 2; visual inspection of a histogram showed that there appeared to be good common support. Table 2 summarizes confounder distribution before and after application of each PS method.

Propensity score distributions in the control group (A) and intervention group (B). The area of common support is demarcated by the boxed area.
Confounder Imbalance in Treated and Untreated Groups Following Application of Three Propensity Score Methods


Two equal groups of 1,049 participants were created. It was possible to match each intervention child with a control child within a 0.05 caliper. Using this method, all significant differences between the intervention and control groups were removed (see Table, Supplemental Digital Content 2, Sensitivity analysis was conducted by repeating the process with a 1:4 matching ratio; this did not change the results.


Five strata were constructed. Each stratum included the same number of children; however, the number of intervention children differed among strata because intervention children had higher PSs than control children. Stratum 1 had 59 intervention children (1.4% of stratum), Stratum 2 had 90 intervention children (2.2% of stratum), Stratum 3 had 145 intervention children (3.6% of stratum), Stratum 4 had 277 intervention children (6.8% of stratum), and Stratum 5 had 478 intervention children (11.7% of stratum). As shown in Table 2, one significant difference (grade) was removed from all five strata, but new significant differences were created for two confounders (household food insecurity and race/ethnicity); these differences occurred in only one and two of the five strata, respectively. Following stratification, differences between the intervention and control groups remained for six variables: school poverty level (2 of 5 strata), school nurse workload (3 of 5 strata), baseline BMI percentile (1 of 5 strata), baseline BMI (3 of 5 strata), age (1 of 5 strata), and diagnosis of at least one chronic illness (1 of 5 strata). Results using this method are presented in greater detail in Table, Supplemental Digital Content 3, Following use of PS stratification, differences between the intervention and control groups remained for eight of the 11 confounders.


PS weighting applied a normalized weight to each child in the sample. Weights differed between groups. For the control group, normalized weights ranged between 0.50 and 1.32 (M = 0.53, Mdn = 0.52). For the intervention group, normalized weights ranged from 1.02 to 99.39 (M = 9.75, Mdn = 7.24). Because of these differences, we explored limiting the sample to children with normalized weights between the 5th and 95th percentile (Kurth et al., 2006; Potter, 1993). This action resulted in exclusion of 98% of children from the intervention group and, therefore, was not employed. Application of normalized weights to the full sample, one difference between groups was removed (grade) and two new differences (household food insecurity, race/ethnicity) were created (Table 2). Additional details may be found in Table, Supplemental Digital Content 4, To further explore this result, two analyses were performed using the normalized log of the weight and the normalized square root of the weight. Results using the square root of the weight were similar (one difference removed, two differences created), as were the results using the log of the weight (no difference removed or created). Therefore, following PS weighting, differences between groups remained for eight of 11 confounders.

PS Method and Estimation of Treatment Effect

Table 3 presents the impact of the PS method on estimating the effect of the treatment on the outcome variable (BMI percentile change). Without PS adjustment, the intervention group had a 0.12 smaller decrease in BMI percentile compared to the control group (p < .001). With PS matching, the intervention group had 0.05 smaller BMI percentile decrease compared to the control group (p = .01). In the PS stratification analysis, the intervention group had a 0.14 smaller BMI percentile decrease (p < .001). With PS weighting, the intervention group showed a 0.02 smaller BMI percentile decrease (p = .001).

BMI Percentile Change Before and After Application of Propensity Score Methods


We compared confounder balance following use of three PS methods by applying them to one data set with information about children with severe obesity—some of whom took part in the HOP intervention. Use of PSs was effective in reducing significant differences between groups. Prior to PS application, groups differed on 7 of 11 potential confounders. After PS matching, no confounders differed between groups. After PS stratification, eight potential confounders differed between groups, though only two confounders differed in more than half of the five strata. After PS weighting, eight potential confounders differed between groups. For this data set, PS matching was most effective in improving confounder balance. Assessment of the impact of PS method on estimation of the treatment effect on the outcome showed that change in BMI percentile at 1 year varied by PS method. Estimated treatment effect on BMI percentile change was most similar to the uncontrolled analysis with PS stratification (0.02 greater than uncontrolled analysis) and most different with PS weighting (0.10 less than uncontrolled analysis) and PS matching (0.07 less than uncontrolled analysis).

We computed PSs using three approaches and then illustrated that choice of PS method may influence the estimate of treatment effect on the outcome. In practice, outcome analyses should not be conducted until after a PS method is selected.

Use of PS weighting and stratification methods increased confounder imbalance between groups. Stratification likely increased differences between groups because of our strict criterion for improvement in confounder balance. We considered balance to be improved only if differences were removed in all five strata. Researchers who impose less stringent criteria may find that stratification is more effective. The weighting procedure likely increased differences between groups because, in this data set, receipt of the intervention was a rare event (approximately 5%; 1,059 of 20,518 students). Rare events result in very small PSs. As a result, weights for the intervention group (1/PS) were much larger than the weights for the control group (1/(1 − PS)). Further testing using the log of the normalized weights and the square root of the normalized weights did not improve the results. Thus, despite the reported strengths of the weighting method (F. Li et al., 2014; Kurth et al., 2006), we concluded that PS weighting may not be well suited for data sets where receipt of the intervention of interest is rare.

PS methods have inherent limitations. First, the PS can only adjust for confounders used in the PS logistic model. It cannot adjust for unknown confounders (Austin, 2011; Rosenbaum & Rubin, 1983). Large data sets with many available confounders are ideal for use with PS methods, yet no data set will include all possible confounders. Because randomization theoretically balances groups on both known and unknown confounders, it remains the superior approach for minimizing bias. Second, use of PSs reduces bias in observational data, allowing approximation to a randomized design. However, PS application only reduces and does not remove the limitations inherent to observational studies—findings are limited to association and causality may not be inferred (Austin, 2011; Rosenbaum & Rubin, 1983). Therefore, a randomized controlled trial remains the gold standard for evaluating an intervention and inferring causality (Shadish et al., 2002, p.13).

To our knowledge, our study is the first within the nursing literature to compare multiple PS methods. Examples from other fields have shown conflicting results. In one study, researchers showed that both stratification and matching removed confounder differences between groups—though matching did so more effectively (Austin & Mamdani, 2006). The researchers noted one weakness of matching to be the loss of participants because of inability to find a match; this was not a concern in our study because of the large pool of untreated participants. Another study (Kurth et al., 2006) showed that different PS methods yielded widely differing results when extreme PS values (<5th percentile) were included. When the data were clipped to remove such values, the methods become comparable. However, the authors noted that neither method was necessarily superior in reducing bias in estimating treatment effect; the best method depends on the sample of interest, the clinical question, and the data (Kurth et al., 2006).

Many researchers implement a PS method without first examining the improvement in confounder balance following its implementation or evaluating more than one PS method (Austin & Mamdani, 2006). The findings of this study highlight the importance of assessing multiple PS methods, as they may produce different results. It is particularly important to note that PS methods must always be selected a priori and evaluated in absence of outcome data in order to avoid the potential selection of the PS method that produces the most desirable outcome analysis. For this reason, it is important for investigators to describe the PS method selection and evaluation process they used when reporting intervention studies that utilize PSs.


Our study has several limitations. We created the PSs for this data set using one logistic model that was determined a priori. We also did not limit our analyses to only PSs that fell under the common support. Additional PS model development and limitation of PSs to the range of scores common support may have led to different findings. Second, our results reflect PS application to only one data set. Our data set also may not be typical in that only a small number of subjects (5%) received the intervention, which led to differences in sample size between PS matching (n = 2,058) and PS stratification and PS weighting (n = 20,443). Thus, our comparison of the three methods was biased toward finding fewer significant differences between groups in the matching analysis because the sample size was smaller. In addition, we applied only the three most common PS methods; additional methods exist (Austin, 2011; Rosenbaum & Rubin, 1983). Also, our measure of confounder balance considered only whether or not a significant confounder differed between the intervention and control groups; it did not consider magnitude of differences between the intervention and control groups, nor did it consider significance of differences in confounder balance improvement between PS methods (Gelman & Stern, 2006). Lastly, additional potential confounders that would be ideally included in our PS—such as parental attitudes toward HOP—were not available in our data set.


Our analysis of applying PSs to a data set of a nurse-led obesity intervention showed that PSs can be effectively applied to improve confounder balance in observational studies (studies of treatment effect not assigned at random). Of the three methods compared, PS matching removed all significant differences in confounders between groups, whereas PS stratification and weighting removed some preexisting differences but created new differences. These results are likely influenced by the fact that only a small proportion of participants received the intervention. Because of these differences, we recommend that nurse scientists test multiple PS methods before selecting a method for use in their analyses. By adding PS methods to their toolbox, nurse scientists can harness the increasingly available large data sets to conduct studies with reduced bias to create a stronger body of knowledge to improve health.


Austin P. C. (2010). Statistical criteria for selecting the optimal number of untreated subjects matched to each treated subject when using many-to-one matching on the propensity score. American Journal of Epidemiology, 172, 1092–1097. doi:10.1093/aje/kwq224
Austin P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46, 399–424. doi:10.1080/00273171. 2011.568786
Austin P. C., Mamdani M. M. (2006). A comparison of propensity score methods: A case-study estimating the effectiveness of post-AMI statin use. Statistics in Medicine, 25, 2084–2106. doi:10.1002/ sim.2328
Brookhart M. A., Schneeweiss S., Rothman K. J., Glynn R. J., Avorn J., Stürmer T. (2006). Variable selection for propensity score models. American Journal of Epidemiology, 163, 1149–1156. doi:10.1093/aje/kwj149
Clarke S. P., Cossette S. (2000). Secondary analysis: Theoretical, methodological, and practical considerations. Canadian Journal of Nursing Research, 32, 109–129.
Cochran W. G. (1968). The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics, 24, 295–313. doi:10.2307/2528036
Cole S. R., Hernán M. A. (2008). Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology, 168, 656–664. doi:10.1093/aje/kwn164
Comprehensive Meta-analysis. (2015). Comprehensive meta-analysis. Retrieved from
D’Agostino R. B. (1998). Tutorial in biostatistics: Propensity score methods for bias reducation in the comparison of a treatment to a non-randomized control group. Statistics in Medicine, 17, 2265–2281.
D’Agostino R. B., Kwan H. (1995). Measuring effectiveness. What to expect without a randomized control group. Medical Care, 33, AS95–AS105.
Daniels S. R. (2006). The consequences of childhood overweight and obesity. Future of Children, 16, 47–67.
Davison K. K., Birch L. L. (2001). Childhood overweight: A contextual model and recommendations for future research. Obesity Reviews, 2, 159–171. doi:10.1046/j.1467-789x.2001.00036.x
Eckardt P. (2012). Propensity score estimates in multilevel models for causal inference. Nursing Research, 61, 213–223. doi:10.1097/ NNR.0b013e318253a1c4
Eckardt P., Rindskopf D. (2016). Quasi-experimental design in nursing research. In S. J. Henly (Ed.), The Routledge international handbook of advanced quantitative methods in nursing research (pp. 224–237). Abingdon, Oxon, UK: Routledge/Taylor & Francis.
Flegal K. M., Wei R., Ogden C. L., Freedman D. S., Johnson C. L., Curtin L. R. (2009). Characterizing extreme values of body mass index-for-age by using the 2000 Centers for Disease Control and Prevention growth charts. American Journal of Clinical Nutrition, 90, 1314–1320.
Freedman D. S., Khan L. K., Serdula M. K., Dietz W. H., Srinivasan S. R., Berenson G. S. (2005). The relation of childhood BMI to adult adiposity: The Bogalusa Heart Study. Pediatrics, 115, 22–27. doi: 10.1542/peds.2004-0220
Gelman A., Stern H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. American Statistician, 60, 328–331. doi:10.1198/000313006X152649
Gu X. S., Rosenbaum P. R. (1993). Comparison of multivariate matching methods: Structures, distances, and algorithms. Journal of Computational and Graphical Statistics, 2, 405–420. doi:10.1080/ 10618600.1993.10474623
Hill J., Reiter J. P. (2006). Interval estimation for treatment effects using propensity score matching. Statistics in Medicine, 25, 2230–2256. doi:10.1002/sim.2277
Institute of Medicine. (2012). Accelerating progress in obesity prevention: Solving the weight of the nation. Washington, DC: The National Academies Press.
Katz D. L., O’Connell M., Njike V. Y., Yeh M.-C., Nawaz H. (2008). Strategies for the prevention and control of obesity in the school setting: Systematic review and meta-analysis. International Journal of Obesity, 32, 1780–1789. doi:10.1038/ijo.2008.158
Kurth T., Walker A. M., Glynn R. J., Chan K. A., Gaziano J. M., Berger K., Robins J. M. (2006). Results of multivariable logistic regression, propensity matching, propensity adjustment, and propensity-based weighting under conditions of nonuniform effect. American Journal of Epidemiology, 163, 262–270. doi: 10.1093/aje/kwj047
Lanehart R. E., de Gil P. R., Kim E. S., Bellara A. P., Kromrey J. D., Lee R. S. (2012, April). Propensity score analysis and assessment of propensity score approaches using SAS® procedures. Paper presented at the SAS Global Forum: Statistics and Data Analysis, Orlando, FL. Retrieved from
Lee B. K., Lessler J., Stuart E. A. (2011). Weight trimming and propensity score weighting. PLOS ONE, 6(3), e18174. doi:10.1371/journal.pone.0018174
Li F., Morgan K. L., Zaslavsky A. M. (2014). Balancing covariates via propensity score weighting. Retrieved from arXiv preprint arXiv:1404.1785
Li L., Kleinman K., Gillman M. W. (2014). A comparison of confounding adjustment methods with an application to early life determinants of childhood obesity. Journal of Developmental Origins of Health and Disease, 5, 435–447. doi:10.1017/S2040174414000415
Lin J., Jiao T., Biskupiak J. E., McAdam-Marx C. (2013). Application of electronic medical record data for health outcomes research: A review of recent literature. Expert Review of Pharmacoeconomics & Outcomes Research, 13, 191–200. doi:10.1586/erp.13.7
Moser D. K., Lee K. S., Wu J. R., Mud-Martin G., Jaarsma T., Huang T. Y., … Riegel B. (2014). Identification of symptom clusters among patients with heart failure: An international observational study. International Journal of Nursing Studies, 51, 1366–1372. doi:10.1016/j.ijnurstu.2014.02.004
Ogden C. L., Carroll M. D., Kit B. K., Flegal K. M. (2014). Prevalence of childhood and adult obesity in the United States, 2011–2012. JAMA, 311, 806–814. doi:10.1001/jama.2014.732
Potter F. J. (1993). The effect of weight trimming on nonlinear survey estimates. Proceedings of the American Statistical Association, Section on Survey Research Methods (pp. 758–763). Washington, DC: American Statistical Association. Retrieved from
Qin R., Titler M. G., Shever L. L., Kim T. (2008). Estimating effects of nursing intervention via propensity score analysis. Nursing Research, 57, 444–452. doi:10.1097/NNR.0b013e31818c66f6
Rappaport E. B., Daskalakis C., Andrel J. (2011). Obesity and other predictors of absenteeism in Philadelphia school children. Journal of School Health, 81, 341–344. doi:10.1111/j.1746-1561. 2011.00599.x
Rosenbaum P. R. (2002). Overt bias in observational studies. In Observational studies (pp. 71–104). New York, NY: Springer.
Rosenbaum P. R., Rubin D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. doi:10.1093/biomet/70.1.41
Rosenbaum P. R., Rubin D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79, 516–524. doi:10.2307/2288398
Rosenbaum P. R., Rubin D. B. (1985a). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. American Statistician, 39, 33–38. doi: 10.1080/00031305.1985.10479383
Rosenbaum P. R., Rubin D. B. (1985b). The bias due to incomplete matching. Biometrics, 41, 103–116. doi:10.2307/2530647
Rubin D. B., Zell E. R. (2016). Causality in experiments and observational studies. In S. J. Henly (Ed.), The Routledge international handbook of advanced quantitative methods in nursing research (pp. 213–223). Abingdon, Oxon, UK: Routledge/Taylor & Francis.
Schroeder K., Travers J., Smaldone A. (2016). Are school nurses an overlooked resource in reducing childhood obesity? A systematic review and meta-analysis. Journal of School Health, 86, 309–321. doi:10.1111/josh.12386
Shadish W. R., Cook T. D., Campbell D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Cengage Learning.
Shadish W. R., Steiner P. M. (2010). A primer on propensity score analysis. Newborn and Infant Nursing Reviews, 10, 19–26. doi: 10.1053/j.nainr.2009.12.010
Sobol-Goldberg S., Rabinowitz J., Gross R. (2013). School-based obesity prevention programs: A meta-analysis of randomized controlled trials. Obesity, 21, 2422–2428. doi:10.1002/oby.20515
Stimpfel A. W., Rosen J. E., McHugh M. D. (2014). Understanding the role of the professional practice environment on quality of care in Magnet® and non-Magnet hospitals. Journal of Nursing Administration, 44, 10–16. doi:10.1097/NNA.0000000000000015
Wagner T. (2012, April). Propensity scores. U.S. Department of Veterans Affairs Health Services Research and Development Cyberseminar Series. Retrieved from
Waters E., de Silva-Sanigorski A., Hall B. J., Brown T., Campbell K. J., Gao Y., … Summerbell C. D. (2011). Interventions for preventing obesity in children. Cochrane Database of Systematic Reviews, 12, CD001871. doi:10.1002/14651858.CD001871.pub3

children; nonrandomized controlled trial; obesity; propensity score; quasiexperimental study; school nursing

Supplemental Digital Content

Copyright © 2016 Wolters Kluwer Health, Inc. All rights reserved