A large study has provided information on the impact of prescribed drugs on the risk of road traffic crashes leading to injury.1 The modeling strategy was to implement a separate logistic regression model for each drug and to correct for multiple testing. Multivariate analysis in which a single model adjusts for individual co-consumption of drugs would be more correct, but involves high-dimensional variable selection. In the absence of prior knowledge to guide model selection, conventional automatic selection methods such as stepwise selection are usually applied, despite drawbacks.2–6 Shrinkage methods have emerged as an alternative, but are underused in epidemiologic analyses.5,7,8
We report the application of lasso shrinkage logistic regression to the analysis of the database described in Orriols and colleagues. We compare these results with those obtained with a conventional modeling strategy with discussion of the advantages and limitations of the 2 strategies.
A nationwide registry-based study1 was conducted in France to study the impact of prescribed drug intake on road crashes in France between July 2005 and May 2008 (Fig. 1). In this case-control study, the outcome variable of interest was the driver responsibility for the crash. Responsibility was determined by a standardized method that assigns a score to each driver based on factors likely to reduce driver responsibility.9,10 Drivers were then categorized as responsible for the crash (cases) or not responsible (controls). The cutoff score for “responsibility” was based on the consensus with decision maker experts.10
We studied dispensed and reimbursed prescription drugs, grouped according to the fourth level of the World Health Organization anatomic therapeutic chemical classification. Daily medication exposure was calculated for each drug as starting 1 day after dispensing. Exposure duration was estimated from median values reported within a survey on drug prescriptions in France.11 We investigated the association between responsibility and exposure to 242 drugs on the day of the crash, adjusted for potential confounders. Of these 242 drugs, 94 potentially impair driving, according to the current French classification system of drugs affecting driving ability.
Adjustment variables included traffic crash factors, sociodemographic factors, and the presence of chronic diseases (Table 1).1 We coded these with dummy variables, producing 55 binary variables. No missing values were recorded, except for alcohol level. Missing values were excluded from the analyses, leaving a final sample size of 62,766 persons.
A recurring difficulty in such studies is disentangling the impact of the medical condition from the impact of the medication. In France, patients are fully reimbursed for health care expenses related to 30 recognized chronic diseases. Only these diseases are registered in the health care database, and thus, the available data on medical condition were restricted to them. Substantial residual confounding by underlying health status remains likely because, for example, there are no data on transient disorders or on the severity of chronic conditions.
In addition to the potential problem of residual confounding, estimation of the association of a particular drug with car crashes raises the following 2 crucial issues: which of the other drugs should we select as confounders, and how should we construct the 95% confidence intervals (CIs)?
We considered 2 modeling strategies (Fig. 2):
- Separate logistic regression models: We fitted 242 separate logistic regression models, exposure by exposure, using conventional maximum likelihood adjusting for all identified potential confounders and accounting for coconsumption via a unique covariate—the binary exposure to any of the 94 drugs that can potentially impair driving (except the current exposure, if it is one of these 94 drugs). Because many CIs were to be constructed, we applied the Bonferroni correction to ensure that the overall confidence level was at least 95%.
- Lasso shrinkage logistic regression: We fitted a single model adjusted for all potential confounders, including all the exposures as covariates, using the lasso penalized maximum likelihood method (see eAppendix [http://links.lww.com/EDE/A597] for a full explanation of this approach). Adjustment variables and the intercept were not penalized and are thus forced into the model. To estimate the proper amount of shrinkage, we considered a grid of values ranging from 0 (the maximum likelihood solution for the full logistic model) to the smallest value, for which no exposures are retained in the model. We then applied the 10-fold cross-validated area under the curve criterion and corrected for bias. We used the R package glmnet.12,13 We built 95% bootstrap percentile CIs using 5000 replicates.
We aimed to highlight the most stable results, and thus, chose exposures having the narrowest CIs (defined as a ratio of the upper to lower confidence limits <3) for further investigation (Fig. 2).14
Of the 62,766 drivers in the analytic database, 30,211 (48%) were considered responsible for their crash (Table 1). Seventy-three percent of subjects were unexposed to any prescription drugs on the crash day, 9% were exposed to 1 drug, and 18% were exposed to >1 (from 2 to 21).
Table 2 shows estimated odds ratios (ORs) and CIs, as well as the ratio of the upper to lower confidence limits from the 2 modeling strategies. Non-null point estimates obtained with the single-model strategy had the same sign as the point estimates obtained with the separate-models strategy. After correcting bias in the lasso estimates, the sizes of the associations were comparable; however, narrower CIs were obtained with the shrinkage regression. With large number of participants, lasso CIs were only slightly narrower. Differences were most noticeable for drugs with moderate levels of exposure. Among those with the strongest associations, CIs were especially reduced for carbamates anxiolytics (N05BC), OR = 1.97 (95% CI = 1.26–3.38) with the single model strategy versus 2.21 (0.91–5.40) with the separate models strategy; carboxamide-derivative antiepileptics (N03AF), 1.61 (1.07–2.46) versus 1.73 (0.85–3.55); fatty acid-derivative antiepileptics (N03AG), 2.06 (1.46–3.03) versus 2.19 (1.11–4.32); and drugs used in opioid dependence (N07BC), 1.69 (1.30–2.23) versus 1.76 (1.07–2.89).
We also searched for interactions, but statistical power was insufficient to assess any interaction between exposures or confounders. Age and alcohol concentration were strongly associated with driver responsibility (eTable [http://links.lww.com/EDE/A597]).
We performed sensitivity analyses using various cutoff values with the lasso method (eFigure [http://links.lww.com/EDE/A597]). A moderate reduction or increase of the cutoff score had little impact on the results.
Shrinkage methods have gained popularity among the statistical community as an alternative to conventional selection methods in large data sets (large sample size, large number of exposures, or both as in the present study). There are examples of these methods used in fields such as neuroimaging,15 human functioning, and disability,16 and health-related quality of life.17 There are, however, few examples in epidemiology—limited primarily to genome-wide or genetic studies.18–22
One limitation of the lasso method is that with a proper penalty parameter, relevant covariates are retained, but at the cost of adding a few irrelevant variables (false positives). Such added variables typically have small point estimates and are thus unlikely to lead to serious errors of influence.23 Few of the studies that use lasso-related methods provide measures of uncertainty, probably because optimum methods have not been established.24–27 Like these studies, we used bootstrapping to generate measures of statistical accuracy.15,16,20 A narrow CI for the OR of a risk factor may, in particular, indicate that the exposure was frequently selected by the lasso in the bootstrap samples, which leads to more statistically stable results.
In general, the results from the conventional and the shrinkage approaches were in agreement (in terms of point estimates), after bias correction of lasso estimates,28 suggesting consistency across the 2 strategies. Narrower CIs were obtained with shrinkage regression, which may themselves be large.27 The differences were more noticeable for drugs with a moderate level of exposure. Fitting separate regressions, exposure by exposure, and thus accounting for coconsumption via a unique covariate is a simple ad hoc method. However, one advantage of lasso-type methods over separate models strategies is their ability to take into consideration the span of relevant exposures in the analysis. This feature enables filtering out of confounders that would be incidentally correlated to response noise. On the other hand, when multiple comparisons and model selection are handled using wider CIs such as those obtained with the Bonferroni correction, more conservative solutions are obtained. By shrinking some regression coefficients and setting others to zero, lasso methods have the ability to eliminate exposures that are not associated with the outcome (provided these exposures are not too heavily correlated with the relevant exposures).29
Results for antiepileptics and drugs used in opioid dependence have been previously described.1 Carbamates correspond to rarely prescribed drugs with an important effect. They were the first anxiolytics and were later supplanted by benzodiazepines because side effects of carbamates (particularly drowsiness and withdrawal syndrome) are more frequent. The OR was higher for this class of anxiolytics than for benzodiazepines, which are themselves consistently associated with risk of traffic accidents. In addition to the drugs strongly associated with traffic crash (carbamates, carboxamide derivative antiepileptics, fatty acid derivatives antiepileptics, and drugs used in opioid dependence), it is interesting to note the association with potassium supplements. One potential explanation is that there are markers of cardiovascular diseases, which themselves may increase the risk of a crash.30
We aimed to illustrate the shrinkage approach in a high-dimensional epidemiologic data analysis compared with a conventional approach. Causal inference is not possible because substantial residual confounding by underlying health status likely remains. Another limitation is that the study sample slightly over-represented drivers injured in more severe crashes, and so, injury severity is a possible source of selection bias.1 More complete discussion of the epidemiologic results is, however, out of the scope of this study. Further investigations are required to isolate treatment effects, elucidate the underlying biologic mechanisms of these drugs, and evaluate their potential effect on driving.
To conclude, the lasso method provides an alternative analytic strategy to conventional approaches such as separate modeling (which provide pessimistic precision estimates and may neglect important confounders) or stepwise selection (which provide optimistic precision estimates and may lead to less stable models) in the analysis of larger data sets. Furthermore, efficient and user-friendly algorithms are available. Other methods, especially Bayesian techniques, have also been proposed to address high-dimensional problems.24,31 Such methods are optimized for settings with many covariates, but not for situations with many subjects.
We thank the CESIR research group for its collaborative support: Fabienne Bazin (INSERM U657), Sylvie Blazejewski (CIC 0005, Bordeaux), Bernard Delorme (AFSSAPS), Geneviève Durrieu (Service de pharmacologie médicale et clinique, CHU Toulouse), Blandine Gadegbeku (Université de Lyon, IFSTTAR), Pierre-Olivier Girodet (CIC 0005, Bordeaux), Marcel Goldberg (INSERM U687-UVSQ), Bernard Laumon (Université de Lyon, IFSTTAR), Dominique Lauque (CHU Toulouse), Nathalie Lecoules (CHU Toulouse), Laurence Memes (CIC 0005, Bordeaux), Louis Merle (CHU Limoges), Jean-Louis Montastruc (Service de pharmacologie médicale et clinique, CRPV, INSERM U 1027, Université de Toulouse, CHU Toulouse), Nicholas Moore (INSERM U657, CIC 0005, Bordeaux), Pernelle Noize (Inserm U657), Nathalie Orsoni (CHU Limoges), Antoine Pariente (INSERM U657, CIC 0005, Bordeaux), Pierre Philip (Clinique du sommeil, CHU Bordeaux), Régis Ribéreau-Gayon (CHU Bordeaux), Louis-Rachid Salmi (INSERM U897, Université de Bordeaux Segalen), Aurore Tricotel (AFSSAPS). We also acknowledge the French National Health Insurance (CNAM-TS), the National Interministerial Road Safety Observatory (ONISR), and Agira-TransPV for providing data related to health insurance reimbursements and road traffic accidents, as well as the Public Health Research Federative Institute (IFR 99).
1. Orriols L, Delorme B, Gadegbeku B, et al.. Prescription medicines and the risk of road traffic crashes: a French registry-based study. PLoS Med. 2010; 7: e1000366.
2. Hurvich CM, Tsai CL. The impact of model selection on inference in linear regression. Am Stat. 1990; 44: 214–217.
3. Breiman L. Heuristics of instability and stabilization in model selection. Ann Stat. 1996; 24: 2350–2383.
4. Austin PC. Using the bootstrap to improve estimation and confidence intervals for regression coefficients selected using backwards variable elimination. Stat Med. 2008; 27: 3286–3300.
5. Greenland S. Invited commentary: variable selection versus shrinkage in the control of multiple confounders. Am J Epidemiol. 2008; 167: 523–529; discussion 530–531.
6. Wiegand RE. Performance of using multiple stepwise algorithms for variable selection. Stat Med. 2010; 29: 1647–1659.
7. Liao H, Lynn HS. A survey of variable selection methods in two Chinese epidemiology journals. BMC Med Res Methodol. 2010; 10: 87.
8. Walter S, Tiemeier H. Variable selection: current practice in epidemiological studies. Eur J Epidemiol. 2009; 24: 733–736.
9. Robertson MD, Drummer OH. Responsibility analysis: a methodology to study the effects of drugs in driving. Accid Anal Prev. 1994; 26: 243–247.
10. Laumon B, Gadegbeku B, Martin JL, Biecheler MB. Cannabis intoxication and fatal road crashes in France: population based case-control study. BMJ. 2005; 331: 1371.
11. IMS Health. Enquête Permanente sur la Prescription Médicale (EPPM). Danbury, CT: IMS Health; 2005–2008.
12. R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2011. ISBN 3-900051-07-0. Available at: http://www.R-project.org/
13. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010; 33: 1–22.
14. Poole C. Low P
-values or narrow confidence intervals: which are more durable? Epidemiology. 2001; 12: 291–294.
15. Bunea F, She Y, Ombao H, Gongvatana A, Devlin K, Cohen R. Penalized least squares regression methods and applications to neuroimaging. Neuroimage. 2011; 55: 1519–1527.
16. Strobl R, Stucki G, Grill E, Müller M, Mansmann U. Graphical models illustrated complex associations between variables describing human functioning. J Clin Epidemiol. 2009; 62: 922–933.
17. Mukherjee B, Ou HT, Wang F, Erickson SR. A new comorbidity index: the health-related quality of life comorbidity index. J Clin Epidemiol. 2011; 64: 309–319.
18. Tanck MW, Jukema JW, Zwinderman AH. Simultaneous estimation of gene-gene and gene-environment interactions for numerous loci using double penalized loglikelihood. Genet Epidemiol. 2006; 30: 645–651.
19. Archer KJ, Mas VR, David K, et al.. Identifying genes for establishing a multigenic test for hepatocellular carcinoma surveillance in hepatitis C viruspositive cirrhotic patients. Cancer Epidemiol Biomarkers Prev. 2009; 18: 2929–2932.
20. D'Angelo GM, Rao D, Gu CC. Combining least absolute shrinkage and selection operator (LASSO) and principal-components analysis for detection of gene-gene interactions in genome-wide association studies. BMC Proc. 2009; 3(suppl 7): S62.
21. Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009; 25: 714–721.
22. Ayers KL, Cordell HJ. SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet Epidemiol. 2010; 34: 879–891.
23. Meinshausen. Lasso-type recovery of sparse representations for high dimensional data. Ann Stat. 2009; 37: 246–270.
24. Herring AH. Nonparametric Bayes shrinkage for assessing exposures to mixtures subject to limits of detection. Epidemiology. 2010; 21(suppl 4): S71–S76.
25. Bach F. Bolasso: model consistent lasso estimation through the bootstrap. In: McCallum A, Roweis S eds. Proceedings of the 25th International Conference on Machine Learning (ICML). Helsinki, Finland: ACM; 2008.
26. Chatterjee A, Lahiri SN. Bootstrapping Lasso estimators. J Am Stat Assoc. 2011; 106: 608–625.
27. Pötscher BM, Schneider U. Confidence sets based on penalized maximum likelihood estimators in Gaussian regression. Electron J Stat. 2010; 4: 334–360.
28. Belloni A, Chernozhukov V. Least squares after model selection in high-dimensional sparse models. Bernoulli. In press.
29. Zou H. The adaptive Lasso and its oracle properties. J Am Stat Assoc. 2006; 101: 1418–1429.
30. Orriols L, Salmi LR, Philip P, et al.. The impact of medicinal drugs on traffic safety: a systematic review of epidemiological studies. Pharmacoepidemiol Drug Saf. 2009; 18: 647–658.
31. Momoli F, Abrahamowicz M, Parent ME, Krewski D, Siemiatycki J. Analysis of multiple exposures: an empirical comparison of results from conventional and semi-Bayes modeling strategies. Epidemiology. 2010; 21: 144–151.