Large health care utilization databases are frequently used to estimate the causal effect of prescription drugs on health outcomes.^{1} Health care utilization data reflect routine practice, are large enough to study rare drug effects, and avoid the delays common in the collection of primary data.^{2,3} Despite their importance, studies of pharmacoepidemiologic claims data have been criticized for the incompleteness of information on potential confounders such as markers of clinical disease severity, laboratory results, functional status, body mass index, smoking status, and over-the-counter medication use. Such factors may lead to selective prescribing, which may in turn result in biased estimates of the association between drugs and health outcomes.^{4} Longitudinal claims data contain information about patient health status and confounding beyond what is normally used in pharmacoepidemiologic research. We have explored the utility of this additional information with the help of a computer algorithm examining all health care claims data.

#### PROXY ADJUSTMENT

Longitudinal health care claims data can be understood and analyzed as a set of proxies that indirectly describe the health status of patients. This status is presented through the lenses of health care providers recording their findings and interventions via coders and operating under the constraints of a specific health care system.^{5} Quite often, several levels of proxies, which we call chains of proxies, are involved. For example, the health state of a patient can be assessed through (*a*) the dispensing of a drug that was (*b*) prescribed by a physician who made a diagnosis in a (*c*) patient who came forward for medical care, and (*d*) presented certain symptoms (Fig. 1). Such a chain of proxies is influenced by access to care,^{6} severity of the condition, diagnostic ability of the physician, preference for one drug over another,^{7} the patient's ability to pay the medication copayment,^{8} and the accurate recording of the dispensed medication. Here, the chain of proxies leads to a reasonable interpretation that the patient indeed had a condition that troubled the patient enough to see the physician, and that was severe enough for the physician to treat and for the patient to pay a copayment for the medication. Medical evidence and treatment options were weighed in several steps along the way. These are not observable in claims data, but collectively they resulted in a measurable action.

The measured action in this case had a clear interpretationâ€”the prescribed medication addressed a specific conditionâ€”but such interpretations are not always possible. In fact, we cannot determine the exact interpretation in most cases but, at the same time, an exact interpretation is not required for effective confounder adjustment. For example, old age serves a proxy for comorbidity, frailty, cognitive decline, and many other factors. Adjusting for a perfect surrogate of an unmeasured factor is equivalent to adjusting for the factor itself.^{9} The degree to which a surrogate is related to an unobserved or imperfectly observed confounder is proportional to the degree to which adjustment can be achieved.^{10,11}

If we could measure a battery of proxies, we would increase the likelihood that in combination they are a good overall proxy for relevant unobserved confounding factors. Using a large number of proxy covariates for propensity score estimation and then estimating the average causal treatment effect conditional on deciles of the propensity score may result in improved control for confounding in epidemiologic studies of treatment effects using claims data compared with models that have fewer covariates. Some authors have explored the use of very large propensity score models in nonrandomized assessment of treatment effects.^{12,13} In some studies, large propensity score models resulted in better control of confounding than estimating the propensity score with fewer covariate information.^{14,16} A major challenge remains, however, to identify a very large pool of potential covariates that can be implemented in claims data, and then to identify which are influential enough in the treatment/disease relationship to include in an analysis.

We propose an algorithm that identifies a large number of covariates in claims databases, eliminates covariates with very low prevalence and minimal potential for causing bias, and then uses propensity score techniques to adjust for a large number of target covariates. The approach will be illustrated using 2 pharmacoepidemiologic studies of intended treatment effects in elderly patients, including (a) statin use and reduced risk of death and use of selective Cox-2 inhibitors and reduced risk of GI complications, and (b) influenza vaccination and hip fractures with an expected null association.

#### METHODS

We describe a generic algorithm that identifies a large number of target covariates in claims databases and selects covariates for propensity score adjustment to minimize residual confounding. We present 7 steps to achieve high-dimensional propensity score adjustment using health care claims databases. These steps assume that the cohort, exposure, and outcome have already been defined.

##### 1. Specify Data Sources.

A wide range of databases of health care utilization data (claims) is available for use in pharmacoepidemiology.^{3} Each database is arranged in specific ways using a variety of classifications to code diagnoses (eg, International Classification of Diseases [ICD]-8 through ICD-10), procedures (eg, Current Procedural Terminology, Canadian Classification of Procedures, ICD-9-Clinical Modification), or medications (eg, National Drug Codes, American Hospital Formulary Services, Anatomic Therapeutic Chemical Classification). Beyond these basic data dimensions and coding systems, many more data dimensions can be found in such databases. Some databases provide additional dimensions such as laboratory results, other electronic medical record information, and accident registries.

We propose an algorithm that is independent of the specific data source as long as the source's data dimensions can be identified. In Figure 2, we provide a flow diagram using a typical example of data dimensions available in US Medicare claims data linked to medication use data. First, a temporal window must be defined in which baseline covariates will be identified. A frequent choice is 6 or 12 months preceding the initiation of the study or comparison drug.^{2} The recording of diagnoses and procedures is correlated with the frequency of health care encounters. Therefore, longer baseline periods increase the number of encounters and therefore yield more covariate information.^{2}

The most basic patient information always available to typical databases is age, sex, and calendar time. We assume that given their ubiquity, these demographic covariates will always be adjusted for.

Additional covariates can then be identified from the various data dimensions, but it is first necessary to identify variables that should not be part of covariate adjustment. Although it is generally recommended to include many covariates in a propensity score regression model, in specific cases researchers may exclude variables from covariate adjustment.^{17} Surrogates for the exposure that are strong correlates of the study exposure but not associated with the outcome will not only increase standard errors but may also increase biasâ€”and should, therefore, not be included in propensity score analyses.^{18,19} Bias can also occur through the inclusion of so-called â€ścolliderâ€ť variables, although this bias is generally thought to be weak.^{20,21} In our example study comparing statin initiation with glaucoma drug initiation, diagnostic codes for glaucoma should not be included in a propensity score because of their close correlation with treatment choice but not with the outcome other than through treatment.^{22} At this stage of the procedure, such codes can be identified and removed from the dimension data input to the algorithm. We have developed a screening tool for such covariates as part of the algorithm that will help investigators identify and remove such covariates (eAppendix 1, http://links.lww.com/A1043).

##### 2. Identify Candidate Empirical Covariates.

Within each of *p* data dimensions (eg, outpatient diagnostic ICD codes, inpatient procedure codes, and drugs dispensed) codes were sorted by their prevalence. Prevalence was measured as the proportion of patients having a specific code at least once during a 6-month baseline period. Since the prevalence of a binary factor is symmetrical around 0.5, we subtracted all prevalence estimates larger than 0.5 from 1.0. The top *n* most prevalent codes were identified as candidate empirical covariates. If fewer than 100 patients were identified with a covariate, the covariate was dropped.

The prevalence of each code (and therefore its empirical ranking) depends on the granularity of the coding; ICD-9 codes are hierarchical such that each additional digit provides more detail of the diagnosis. Considering the fourth or fifth digit of the ICD-9 code will reduce the prevalence of the code in the data but may be a better proxy for the underlying confounder. We initially set the granularity to 3 digits for ICD-9 data for this illustration, since every system using ICD-9 records at least 3 digits. Granularity decisions need to be considered for all data dimensions, including medication coding. Depending on the application, therapeutic class may be sufficiently detailed and in other settings individual drugs or even drug dose or preparations may be required. We chose the individual drug level for the base-case algorithm.

##### 3. Assess Recurrence.

For the top *n* most prevalent codes in each data dimension, we assessed how frequently that code was recorded for each patient during the baseline period. We divided each code into 3 binary variables: code occurred â‰Ą1 time, â‰Ąmedian number of times, and â‰Ą75th percentile number of times. A code that appeared above the 75th percentile number of times would have a â€śtrueâ€ť value for all 3 recurrence variables. If any of the values were equal, the variable representing the higher cutpoint was dropped. For a data structure with *p* data dimensions, this results in up to *p* Ă— *n* Ă— 3 covariates.

##### 4. Prioritize Covariates.

If we now wish to combine information from all *p* data dimensions to reduce the total number of covariates, we need to consider that the average prevalence of codes can be quite different among dimensions. From our experience, prevalence of procedure codes, including codes for simple office visits, have a higher prevalence than drug dispensings. Simply combining these tables and picking the top *k* prevalent candidate covariates would down-weight the importance of medication-dispensing in controlling for confounding. Further, Brookhart et al^{20} showed that including patient characteristics in the propensity score that are associated with the exposure but not the outcome will increase variance of the estimator with no improvement in confounding control, and in some situations can actually introduce confounding. We, therefore, decided to prioritize covariates across data dimensions by their potential for controlling confounding that is not conditional on exposure and other covariates. Because we are exclusively dealing with binary covariates, the confounded or apparent relative risk (ARR) is a function of the imbalance in prevalence of a binary confounding factor among exposed (*PC1*) and unexposed (*PC0*) subjects as well as the independent association between a confounder and the study outcome (*RRCD*)^{23}:

Equation (Uncited) Image Tools |
Equation (Uncited) Image Tools |

The fraction on the right side of the equation is the multiplicative bias term, Bias_{M}. We then sorted all *p* Ă— *n* Ă— 3 covariates by the magnitude of log (*BiasM*) in descending order. We chose multiplicative bias assessment because a bias term on the absolute risk scale (*BiasA* *= RDCE* Ă— *RDCD*) would implicitly down-weight the association between a confounder and outcome if the outcome event rate is small but the prevalence of the exposure high, a typical occurrence in pharmacoepidemiologic cohort studies. The covariate prioritization is illustrated for binary variables since our algorithm to generate target covariates exclusively creates binary variables.

##### 5. Select Covariates.

Once this prioritization of covariates was accomplished, we included the top *k* covariates from step 4, which could be as large as *p* Ă— *n* Ă— 3 when including all candidate covariates. Our base case settings were *p* = 8 and *n* = 200 resulting in 4800 candidate covariates. We selected the top *k* = 500 binary empirical covariates (about 10%) for inclusion in the propensity score modeling.

In addition to these *k* binary empirical covariates, we included covariates that should always be adjusted for if available, including *d* binary demographic covariates age, sex, race (available in Medicare data), and calendar year. In addition to these, we allowed the investigator to force *l* binary, categorical, or numeric predefined covariates into the propensity score model, based on context knowledge regarding the specific study question.

In a subanalysis, we explored the impact of adjusting for 2-way interaction terms. Of the *k* empirical covariates, we selected the 20 highest priority empirical covariates and computed multiplicative 2-way interactions among those covariates and with the demographic and predefined covariates, resulting in another (20 + *d* + *l*) Ă— (20 + *d* + *l*)/2 covariates.

##### 6. Estimate Exposure Propensity Score.

Using multivariate logistic regression, a propensity score was estimated for each subject as the predicted probability of exposure conditional on all *d* + *l* + *k* covariates.

##### 7. Estimate Propensity Score-Adjusted Outcome Models.

We grouped subjects into propensity score deciles and used multivariate logistic regression analyses to model the study outcome as a function of exposure and indicator terms for decile of propensity score. In addition to an adjusted estimate, we computed a standardized mobility ratio (SMR)-weighted estimate using weights of 1 for subjects in the study drug group and the odds of the propensity score (PS) for members of the comparison group (*PS*/[1â€“ *PS*]). SMR-weighted estimates provide treatment effect estimates among the treated. As the output of Step 6 includes each subject's propensity score, other ways to use propensity scores in the outcome estimation may be applied, including matching, inverse probability of treatment weighting, or modeling the propensity score as continuous variable.^{24} The high dimensional propensity score algorithm is implemented as a SAS macro available at http://www.drugepi.org.

##### Example Data Sources and Study Cohorts

All 3 study cohorts were drawn from a population of patients aged 65 years and older enrolled in both Medicare and the Pennsylvania Pharmaceutical Assistance Contract for the Elderly (PACE) programs between 1995 and 2002. PACE is a state pharmaceutical benefits program with incomes below $14,000 for individuals and below $17,200 for couples; its data have been frequently used for pharmacoepidemiologic studies.^{7,25} All prescription drugs commercially available in the United States during the study period were fully covered by PACE, requiring a nominal copayment of $6. Prescription drug information was assessed based on pharmacy claims from PACE with detailed and highly accurate information^{26,27} on drug name, dosage, quantity, and date of dispensing.

##### Study Exposures and Outcomes

##### Example Cohort 1.

Initiation of nonselective NSAID use versus selective Cox-2 inhibitor use was defined if an eligible beneficiary filled at least one prescription for an NSAID between 1 January 1999 and 31 December 2002 but did not use any NSAID during the 18 months prior to the index date. The index date was the first date an NSAID prescription was filled.^{28} The follow-up period included the 180 days after the initiation of therapy.

The study outcome of severe gastrointestinal (GI) complication was defined as either a hospitalization for GI hemorrhage or peptic ulcer disease complications including perforation (coded as ICD-9 discharge diagnoses 531Ă—, 532Ă—, 533Ă—, 534Ă—, 535Ă—, or 578Ă— in the first or second position or a physician service code for GI hemorrhage). These definitions were validated in 1762 patients in a hospital discharge database, with a composite positive predictive value (PPV) of 90%.^{29} We expected to find a moderate protective effect of Cox-2 inhibitors on GI complications,^{30â€“32} which may be concealed by confounding.^{33}

##### Example Cohort 2.

The initial exposure status of statin use, nonuse, or comparator drug use as determined from pharmacy claims was carried forward until censoring after 1 year or death, whichever came first. We analyzed the extent to which patients classified as nonusers started statins during follow-up and how many statin users discontinued use, using a gap of 90 or more days in addition to the dispensed supply without statin use as the definition for statin discontinuation.

We then used Medicare claims data to ascertain time to death. Death information from Medicare records is routinely cross-checked with Social Security data. Subjects were censored at the end of 365 days after drug initiation or disenrollment from the pharmacy assistance program. We expected to find a moderate protective effect of statins on mortality in older adults (RR about 0.85)^{34} that may be exaggerated by confounding.^{35}

##### Example Cohort 3.

For the previous examples, we hypothesized protective effects of the drug therapy with confounding going either toward the null (example 1) or away from the null (example 2). We added a third example with a strong prior hypothesis of a null association, based on context knowledge. This is the relationship between influenza vaccination in elderly people and the risk of hip fracture. A typical pre-flu season (1 Octoberâ€“31 December 1996) was selected to assess the exposure to influenza vaccine, and the next 4 months (Januaryâ€“April 1997) were the follow-up period. Patients with prior hip fractures of bisphosphonate use for the treatment of osteoporosis were excluded. Patients were censored after the occurrence of the study end point, death, or disenrollment.

##### Overall Analytic Strategy

To make the comparisons among models that contained a varying number of covariates as fair as possible, we used the available covariates for propensity score estimation and then adjusted the respective outcome models (logistic regression for examples 1 and 2 and Cox proportional hazard regression for example 3) for deciles of the estimated propensity score. We report the numbers of covariates entered into each propensity score model as well as its c-statistic of model discrimination.

#### RESULTS

Population characteristics of the 2 sample cohort studies are presented in Tables 1 and 2. There were 32,042 subjects that initiated selective Cox-2 inhibitors. The 17,611 nonselective NSAID initiators were older and had more comorbidities and more risk factors for GI complications. The NSAID initiators had 185 GI complication in 180 days (1.1%), and the Cox-2 initiators had 367 events (1.2%). Compared with 14,889 glaucoma drug initiators, the 21,233 initiators of statin therapy were younger, more likely to have cardiovascular risk factors, have more health care utilization, and about equal numbers of comorbidities. The statin initiators had 784 deaths in 1 year (3.7%); the glaucoma drug initiators had 955 deaths (6.4%).

Table 1 Image Tools |
Table 2 Image Tools |

In a traditional multivariable analysis comparing Cox-2 inhibitors and NSAIDs (Table 3), we observed no association with GI complications (RR = 0.94; 95% confidence interval [CI]: 0.78â€“1.12), which was slightly reduced from an unadjusted RR of 1.09, suggesting that additional adjustment for residual confounding would move the relative-risk further towards a protective effect.

Considering statin initiators versus nonstatin using initiators of glaucoma drugs (Table 4), we observed a strongly reduced risk of 1-year mortality (RR = 0.80; 0.70â€“0.90), which is closer to the expected results from RCTs in elderly people then an unadjusted analysis (RR = 0.56), suggesting that additional adjustment for residual confounding would move the relative risk further toward the null. In the first 2 example studies, we observed several trends regarding the performance of the high-dimensional propensity score algorithms:

1. Adding the high-dimensional propensity scores to the predefined covariates moved the point estimate in the expected direction (Cox-2 inhibitors toward a more protective effect, statins toward a less protective effect), consistent with RCT findings.

2. Results of the high-dimensional propensity score alone were identical to the second decimal place compared with the high-dimensional propensity score combined plus predefined covariates (Models 5 and 5a in Tables 3 and 4). SMR-weighted propensity score outcome models resulted in effect estimates of 0.77 (0.67â€“0.88) for Cox-2 inhibitors and GI complication, and 0.64 (0.59â€“0.70) for statins and death.

3. First stage c-statistics quantifying the degree of exposure prediction did not consistently correlate with the changes in effect estimates.

4. Including 500 rather than 200 covariates in the high-dimensional PS appeared to move the effect estimate little.

5. More finely granulated diagnostic codes (4 digit ICD-9 vs. 3 digit ICD-9) appeared to move the effect estimate little.

6. Dropping the recurrence assessment appeared to move effect estimates slightly away from the expected direction.

7. Including 2-way interactions appeared to leave the effect estimate unchanged or move it slightly away from the expected direction.

8. Selecting covariates based only on their prevalence without any further covariate prioritization moved effect estimates slightly away from the expected direction in the Cox-2 inhibitor example but not in the statin example.

Bootstrapped 95% confidence intervals based on 1000 samples were very similar to the base case algorithm (Model 5) in both example studies (0.73â€“1.06 in Table 3 and 0.76â€“0.98 in Table 4).

For the third example study, on the relationship between influenza vaccination in seniors and hip fractures, we expected a null association with strong contextual support of that null hypothesis. In a cohort of 147,583 patients, 42% received influenza vaccination and we observed 710 hip fractures (0.5%). Due to confounding we observed a slightly protective effect in an unadjusted analysis (RR = 0.93; 0.80â€“1.08). After adjustment for demographic factors and the high-dimensional propensity score, the effect was entirely explained (RR = 1.02; 0.85â€“1.21).

#### DISCUSSION

We hypothesized that high-dimensional proxy adjustment based on propensity score techniques could reduce residual confounding in claims databases of treatment effects. To explore this hypothesis, we developed a generic algorithm that identifies a large number of target covariates and selects covariates for propensity score adjustment to facilitate high-dimensional propensity score adjustment. In the example studies of drug-outcome relationships, we found that the application of the high-dimensional propensity score algorithm produced results closer to the expected findings based on randomized trials, compared with propensity score adjustment that uses a more limited number of investigator predefined covariates. We further found that some components of the algorithm were more important than others in our example studies. The covariate prioritization as well as the assessment and adjustment of recurrent coding of health services seem to be important contributors to the algorithm in our examples.

The algorithm's main strength rests on the exploitation of information that is usually untapped in epidemiologic analyses of health care utilization databases. It also includes a variable selection component to limit the number of adjusted covariates to an arbitrary number (500 in our base case) because in theory a number of covariates larger than the number of study subjects could be generated. Hirano and Imbens^{16} have developed a propensity score variable selection algorithm that is based on comparing the *t*-statistics of the entire propensity score regression model (*tprop*) with those of individual covariates (*tk*). Such test-based approaches to variable selection have been criticized for their dependency on study size and potential for bias.^{36} This approach also seems impractical if the number of candidate covariates is very large (eg, the 4800 in our example), which can provide a challenge to fitting the entire PS regression model even in large datasets.

Brookhart et al^{20} found that the inclusion of variables that predict only exposure but not outcome can result in larger standard errors in small studies; if residual confounding exists, this can increase bias.^{37,38} These effects were not evident here, probably because of the large sample size and the modest associations between individual covariates and exposure. However, the possibility that an empirically generated variable could increase bias and variance represents the primary concern of the algorithm. Users of this methodology should remove covariates that are a priori expected to be strong predictors of exposure but not likely to be related to the outcome. An example of such a covariate would be the history of glaucoma as a strong predictor for glaucoma treatment (but not death) that appeared in our screening tool (eAppendix 1, http://links.lww.com/A1043) was removed in our second example cohort. Further research is needed to consider ways in which empirically generated claims-based covariates could generate collider bias, and how they could be identified and removed.

Variable selection techniques may result in falsely narrow standard errors.^{39} We therefore applied bootstrapping to estimate 95% confidence intervals^{40} and found very similar confidence intervals compared with the simple logistic regression of the base case algorithm. This is not surprising as we did not apply any confounder selection algorithms that resulted in multiple tests of exposure-outcome associations, including change-in-estimate, forward or backward selection.^{36} Instead, we did a preliminary screen of candidate confounders by estimating unconditional associations of individual potential confounders with the exposure and then separately with the outcome. We then ranked covariates with regard to their potential for being a confounder and included these candidate covariates up to a predefined maximum number. We observed fairly weak unconditional associations of individual factors.

The present study is an empirical comparison of methods without a true gold standard. We used randomized trial findings to set specific expectations regarding the treatment effect estimates, but it ultimately remains unanswerable whether our high-dimensional propensity score algorithm reached that goal fully. Simulation studies are unlikely to clarify the performance of this algorithm because it is inherently empirical and relies on data-generating mechanisms that will vary from study to study, and thus are difficult to prespecify. The strength of the high-dimensional propensity score is rather that it does not make any assumptions about data quality, quantity, and interpretation. Further validation of the algorithm is possible by replicating findings that are expected based on randomized trial findings, including our example of a null association in which the high-dimensional propensity score algorithm eliminated all confounding. A specific point of concern is the performance of a covariate prioritization strategy that considers the association of each factor with the study outcome if outcomes are rare. At some point the prioritization rule may miss potentially important confounders by chance. While this is a theoretically important point it needs to be seen in light of the fact that the proposed high-dimensional proxy adjustment will be used in addition to adjustment for factors specified by the investigator.

Further work may provide improved ways of selecting covariates or using the covariates in the analysis. For example, optimization of the variable selection algorithm may be possible by considering the association between the candidate covariate and the exposure, conditional on either a set of predefined covariates or the entire list of selected covariates. It is also possible that algorithms based on cross-validation could prove to be useful for covariate selection.^{41,42} Finally, we have considered the use of selected covariates in an analysis that depends on correct specification of the propensity score model. Doubly robust approaches are based on an assumed model of both the exposure and outcome, but are consistent if only one of the models is correctly specified.^{18} The use of the selected covariates in the setting of a doubly robust estimator may improve the performance of the algorithm. However, since the outcome model must be more parsimonious, a separate list of covariates would need to be generated for the outcome modelâ€”perhaps those with particularly strong outcome associations.

It is too early to conclude that the proposed algorithm or variations thereof will be able to substitute for existing confounder adjustment strategies in claims data analyses, although in our limited examples the algorithm performed better than standard techniques. Practical advantages are that the algorithm can be run efficiently on a large scale, it reduces investigator and programming time substantially, and it reduces programming errors and potential mischaracterization of covariate definitions or adjustments without a loss of validity. This last point might be of particular practical advantage in studies pooling multiple claims databases.

In conclusion, in some typical pharmacoepidemiologic studies of treatment effects, the proposed proxy adjustment via high-dimensional propensity scores generated effect estimates closer to randomized trial findings, compared with standard covariate adjustment of predefined covariates. Further replication will be necessary in a variety of settings to assess the value of this approach.