The dissection of the manifold small-effect determinants of complex diseases remains a major challenge for human genetics and genetic epidemiologists. As genotyping affordability has grown exponentially in recent years, analytic approaches in genetic epidemiology are rapidly evolving. Statistical methods are not standardized and appear insufficient to tackle the inherent complexity of multifactorial disorders.^{1,2} Although this state of affairs has led some observers to express pessimism,^{3} it also motivates more rigorous exploration of new analytic approaches.

Apparently promising results have recently been reported and widely publicized for a “genomic pathway approach” applied to the study of Parkinson disease.^{4} The aim of the initial study was to consider whether genetic variation in a specific biologic pathway determines Parkinson disease risk. The researchers began with bioinformatics-supported identification of candidate genes belonging to this so-called “axon-guidance pathway” and the corresponding single nucleotide polymorphisms (SNP) in a genome-wide association dataset that included 443 pairs of cases and unaffected siblings. Beginning model selection with 1460 candidate SNPs in 117 genes, they constructed regression models of very high statistical significance and extremely accurate discrimination of case/control status and prediction of age-at-onset and Parkinson-free survival. Model construction was done by stepwise variable selection methods for logistic, linear, and Cox models. Similarly, impressive results in terms of *P* value and model performance were obtained through analysis of a second dataset. The encouraging performance of the approach, which also found numerous significant gene:gene interactions in its final models, has been heralded as a major development of potential applicability to a wide variety of complex phenotypes.^{4,5} The same group of researchers subsequently applied this approach to another neurologic disease, yielding highly interesting results.^{6}

The modeling strategy^{4} has immediate appeal because it relies solely on standard regression methods which are familiar to most researchers. It furthermore allows for uncomplicated inclusion of interaction terms. Unfortunately, stepwise model selection has many problems, particularly in small datasets. For example, the distribution of the test statistics and apparent measures of validity become misleading and overoptimistic through the model selection process.^{7} There are methods to quantify these and other problems related to model instability.^{8,9} We aimed to elucidate the performance of the proposed modeling on the “genomic pathway scale,” where the number of predictors evaluated for inclusion in multivariable models may greatly exceed even large sample sizes, while the predictors typically are strongly correlated with each other. We restricted our permutation^{8} and data-splitting approaches^{9} to logistic regression models for the sake of simplicity.

#### METHODS

As a starting point, we downloaded the genome-wide association dataset^{10} used for replication purposes in the publication motivating our work, and we attempted to reproduce the modeling strategy previously described. The primary dataset^{4} had not yet been made publicly available at the time of our efforts. Based on the information given in the previous reports,^{4,10} we constructed an analysis dataset with 268 cases and 268 controls (after exclusion of one African American subject, 3 records lacking phenotype information, one subject with reported age-at-onset of 13 years). We considered 1195 SNPs in 22 genes as candidates for model inclusion.

Model building was designed to resemble as closely as possible the algorithm described in the motivating paper,^{4} as follows: (1) evaluation of the significance of each individual SNP (coded as recessive, dominant, and additive) in age- and sex-adjusted logistic regression models, keeping those SNPs with a one degree of freedom likelihood ratio test (LRT) *P* < 0.05 in their most significant coding for further evaluation; (2) standard stepwise forward model selection, starting with a model including only age and sex as predictors and selecting from the individually significant SNPs based on *P*_{LRT} < 0.05; (3) standard backward elimination of SNPs from the model created in the previous step based on the same significance level; (4) standard stepwise forward model selection, considering as candidate predictors all 2-way interactions between genetic predictors in the model produced in the third step based on *P*_{LRT} < 0.01; and (5) standard backward elimination of any interaction term in the model produced in the fourth step based on the same significance level. Because the original report mentioned evaluating SNPs in batches defined by decreasing completeness, the second and third steps were first carried out with SNPs having complete data (n = 761), subsequently allowing increasing numbers of missing values (200 SNPs had 1 missing genotype, 194 had 2–10, 36 had 11–50, and the remaining 4 had 51–96). The overall significance of the genetic effects was assessed by a likelihood ratio test comparing the final model with one including only age and sex predictors, with degrees of freedom equal to the number of genetic main effects and interaction terms in the final model. We used the more stringent interaction inclusion criterion of *P*_{LRT} < 0.01 (as compared with the cutoff of 0.05 reported previously^{4}) because the more lenient value led to convergence failure in our analysis.

We used a permutation-based approach to determine the empirical significance of the model produced by the above selection procedure. Such “*y*-scrambling” approaches have been suggested, in particular, for genetic studies dealing with a large number of highly correlated markers.^{8,11–13} Thus, we randomly reordered the case/control status of the subjects included in our analysis dataset, and ran the entire model building procedure on the permuted datasets. The case/control shuffling breaks any nonrandom associations between predictor variables and the outcome (in our case, Parkinson disease status), thus creating a dataset with only chance associations, while preserving the pattern of correlation among the predictors. We then compared the *P* value obtained for the genetic effects based on the original data with the distribution of *P* values produced with 1000 models based on permuted datasets. The above approach also breaks the nonrandom association between case/control status and age and sex, while preserving any chance associations possibly present between these 2 *a priori* included variables and the genetic predictors. We, therefore, repeated the above with a modified permutation procedure in which individuals' case/control status was shuffled together with their respective age and sex values against the genetic predictors.

We obtained an impression of the stability of the model selection (as well as the optimism of model performance measures) by partitioning the dataset into 10 equally sized subsets (maintaining case/control ratios of 1). We then constructed models as described above after setting aside each of the 10% subsets for validation purposes, ie, using 90% of the data as learning dataset. We compared the resulting models for the set of selected predictors and their effect directions.^{7} The area under the receiver operating characteristic curve (AUC) of the 90% models (ie, apparent performance) was compared with the AUC produced when using these models to predict the case/control status in the respective 10% validation sets (test performance).

As these 10% validation sets were rather small, the estimates of expected model performance were anticipated to show large variability. We thus supplemented these analyses by 500 rounds of data splitting.^{9} In line with the approach above, we excluded a random 10% of the observations (always equal numbers of cases and controls) for validation and repeated this 500 times, each time constructing models on the remaining 90% of the data. The larger number of runs allowed more reliable judgment of the optimism in model performance associated with the model selection algorithm under investigation, and the extent of overoptimism was estimated as the median of the 500 resulting differences calculated as the AUC for the 90% learning set minus the AUC for the 10% test set.

#### RESULTS

The final model is compared with the originally reported model in Table 1. Our model contained 39 SNPs and 24 interactions, while the model previously put forward featured 33 SNPs and 6 interactions. Only about half of the SNPs included in our model also appeared in the original one, although in several cases our predictors included SNPs in close proximity to those reported previously. The overall significance of the genetic effects based on the standard likelihood ratio test was *P* = 3.5 × 10^{−69} with an AUC of 0.979. Several of the predictors maintained in the model were associated with excessive standard errors indicative of model overfitting, while essentially all estimated coefficients were much larger than could reasonably be expected for genetic determinants of a multifactorial disease. In 1000 permutations, *P* values smaller than the original 3.5 × 10^{−69} were obtained 3 times, giving an estimate of the empirical *P* value of 0.003 (95% CI = 1.02 × 10^{−3} to 8.78 × 10^{−3}). This probably is a conservatively low estimate, because an additional 4 permutation rounds ended with convergence problems, indicative of perfect prediction of case/control status in at least some covariable strata. The Figure shows histograms of the empirical *P* values, as well as the odds ratios for being a case in subjects with predicted probability π > 0.75 in comparison to those with π < 0.25, along with the values based on the original data. The latter measure was used to support the magnitude of joint genetic effects in the original publication.^{4} Both values lie in the tails of the distributions, but—consistent with the estimated permutation *P* value—are far less extreme than suggested by the standard likelihood ratio test *P* value. For the modified permutation procedure preserving the relationship between age, sex, and the case/control status, the empirical *P* value obtained was 0.002 (95% CI = 5.5 × 10^{−4} to 7.3 × 10^{−3}) with 8 non-converged permutation rounds.

Table 1 Image Tools |
Table 1 Image Tools |
FIGURE. Histograms o... Image Tools |

We also fitted the model reported previously^{4} using the same selection of SNPs and gene:gene interactions with the same genotype coding. The effect estimates closely resembled those reported previously (eTable, http://links.lww.com/A1048).

When we applied the model selection procedure to 10 learning datasets created by excluding 10% of the original data, the models constructed based on the learning datasets (each containing 90% of original data) were rather variable. While the models were roughly similar in size, including 25–42 predictors and 0–13 gene:gene interactions, only a single SNP (rs865569) was included in all 10 models with the same genotype model and effect direction. Fourteen other SNPs occurred in more than half of the models (10 times: rs884787; 8–9 times: rs10498114, rs9865620, rs17070430, rs10507033, rs4334396; 6–7 times: rs1568359, rs40701, rs6868410, rs10832242, rs1466373, rs1160400, rs7506794, rs221015); sometimes there were inconsistencies regarding genotype model or effect direction, however. Altogether, 121 SNPs were selected at least once. The only interaction term occurring in more than one model was rs1719895:rs9865620, which appeared 3 times. The apparent validity of the 10 models as measured by the AUC in the learning datasets ranged from 0.879 to 0.972 with a median of 0.922; the test performance in the respective 10% excluded observations was much lower, with median AUC of 0.585 (range = 0.384–0.663).

The 500 rounds of data-splitting confirmed that the apparent validity of the pathway models tended to be close to the AUC of the model based on the entire dataset (Table 2). The median overoptimism in the discriminatory performance of the genomic pathway models was 0.373 (95% CI = 0.363–0.381). In contrast, only very limited optimism was present in models that contained only age and sex (median AUC decrease = 0.010; Table 2), as might be expected for models including few predictors that did not undergo a selection procedure. Intriguingly, the expected discriminatory performance as measured by the AUC in the 10% test sets was significantly higher in the model containing only age and sex than in the full genomic pathway models, with a difference of −0.040 (95% CI = −0.049 to −0.031).

#### DISCUSSION

Based on our analysis of the performance of the “genomic pathway approach” to multifactorial diseases, we urge caution when judging the significance and validity of models constructed using such algorithms. In particular, *P* values and apparent measures of model performance obtained by standard approaches are strongly misleading. We furthermore observed a lack of stability regarding the predictors selected into the model.

It is well known that the interpretation of model performance measures is difficult if the predictors included in the model underwent a model selection process.^{7} Even so, we were surprised by the magnitude of the inflation of the significance of genetic effects in “genomic pathway models” constructed according to the proposed selection algorithm.^{4} The group who first advocated this approach attempted to substantiate the significance of their findings by comparing the test statistic of their final model with those from similar models based on randomly selected genomic SNPs.^{4} We instead applied the entire model selection procedure to a permuted dataset, which is arguably more appropriate^{12,13}; apart from random genomic SNPs being different from the pathway SNP set in terms of linkage disequilibrium patterns, proportions of synonymous and nonsynonymous SNPs, etc., the absence of selection based on marginal significance essentially avoids the substantial model overfitting. Overfitting is the core problem leading to the strong overoptimism regarding discriminatory performance and overall significance of the pathway models described, which renders such comparisons unfair. The permutation-based *P* value of 0.003 was still significant at the commonly used level α = 0.05. However, this finding may still be considered misleading because the expected discriminatory performance of models excluding any genetic predictors was higher than for the genomic pathway models. The permutation procedure primarily applied did not break possible chance associations between the SNPs and the established Parkinson predictors age and sex, and one could hypothesize that the stepwise genetic predictor selection might have capitalized on correlations of SNPs with these predictors that were included *a priori*. When we modified the permutation algorithm accordingly, the results remained similar, although the larger number of nonconverging models with this approach might suggest that the pathway model indeed was empirically less significant when the correlations among age, sex, and genetic predictors were removed.

Furthermore, *a priori* evidence for an involvement of genetic variations in the pathway genes examined in Parkinson risk determination was scarce, leading to a low prior probability of a genuine association and a high probability of obtaining a false-positive finding. In the case of the first dataset used in the publication motivating the present work,^{4} the apparent significance of the pathway model would furthermore have been compromised by what is known as the “Winner's curse,”^{14} since this pathway appears to have been selected for analyses primarily because a SNP located in one of its genes had been the most significant finding in a previous analysis involving the same dataset.^{4,15} Such bias also could not have been overcome by the permutation procedures as presented here, just as for any systematic error that might be present in a given dataset (such as population stratification or genotyping problems).

The genomic pathway model selection procedure appeared rather unstable when we attempted to reconstruct the original model.^{4} One explanation may be that we were not able to reconstruct the analysis dataset exactly; comparison with allele frequencies reported in supplementary materials^{4} suggested that our analysis set differed from the original one by one subject. Another explanation could be that we were not able to obtain detailed information regarding how SNPs with increasing numbers of missing values were handled during model selection. The model selection procedure was sensitive enough to these apparently trivial differences to yield a rather different model than originally reported. Fitting the previously reported final model produced regression coefficient estimates that were almost identical to the original estimates. Interestingly, we had to apply a more stringent inclusion criterion for the gene:gene interaction terms to achieve model convergence. When excluding 10% subsets of the dataset for validation purposes, the models based on the 90% learning sets further supported the notion of a lack of stability in the selection of genetic determinants. Very few SNPs were selected in all 10 models or even in the majority of models, thus possibly qualifying as truly associated “dominant” predictors.^{7} In contrast, a large number of SNPs occurred in very few models and most likely presented noise variables that were correlated with the outcome purely by chance.

Relaxed significance criteria have been considered paradoxically beneficial in genetic epidemiology, as they might ameliorate publication bias due to nonreporting of studies not producing statistically significant results.^{3} However, the repeated multiple testing that occurs in the individual evaluation of hundreds or thousands of SNPs for possible inclusion in the construction of a genomic pathway model, as attempted here, is an extreme setting. Despite the overall plausibility of the hypothesis that some biologic pathway is highly relevant for the determination of a complex trait, the probability that any individual SNP is indeed causally associated with the outcome (or significantly correlated with a hidden disease locus) remains very limited. Thus, it must be expected that most of individually significant predictors from which the pathway models are constructed present chance associations.^{16} Model selection procedures generally have to balance such overfitting with predictive performance^{7}; estimating the latter ultimately should not be based solely on apparent measures of validity.^{17} Our observation that the expected external performance was worse for models constructed according to the genomic pathway approach than for much more parsimonious models (predicting case/control status simply from age and sex) demonstrates the detrimental effect of an imbalance between overfitting and predictive performance on models constructed through extensive stepwise variable selection, particularly in datasets that are small in comparison with the number of candidate predictors. Our results suggest that the genomic pathway modeling approach in the way currently put forward is not useful when trying to elucidate the complex genetic structure of multifactorial phenotypes.

Some possible steps forward can be imagined. Generally, significance assessments should account for the model selection procedures applied, eg, by permutation approaches such as those applied in this work. Likewise, it appears crucial to estimate the overoptimism in discriminatory performance. To this end, the data-splitting approach we used provides a reasonable estimate of the magnitude of the problem, though much larger numbers of observations should be available as independent validation datasets for accurate estimation of expected external validity.^{18} Furthermore, our approach did not allow us to determine the actual overoptimism present in the model constructed based on the full dataset. More sophisticated procedures for this purpose are an active field of research. At this point, the immediate accessibility of pure data-splitting in terms of evaluating predictive performance led us to refrain from more complicated alternatives, which also have hardly been investigated in the context of datasets of comparable structure and by themselves are unlikely to be an optimal solution for all scenarios.^{19} Research focusing on microarray studies of gene expression is facing similar problems. Due to higher costs, the ratio of sample size to predictor variables tends to be even less favorable.^{20} Comparing a number of methods to estimate the optimism in predictive performance in several simulated microarray scenarios, Jiang and Simon^{21} also found that the optimal approach depends on sample size and extent of true associations present. Shrinkage-based methods might be a promising option in regard to ameliorating the overfitting bias in individual SNP effect estimates, but additional research is needed to determine their performance and applicability in scenarios such as those examined here.^{22}

The varying selection of predictors in resampled or split datasets also should be examined more systematically. It has been suggested previously that relevant predictors could possibly be identified by the inclusion frequency in resampled datasets.^{23} One could envision comparatively simple algorithms taking into account both inclusion frequencies and linkage disequilibrium patterns of genetic markers that yield very interesting additional information on relevant genetic determinants of complex phenotypes. The microarray literature is rife with ongoing developments in classification and predictor selection methods, such as those based on extensions of linear discriminant analysis^{24} or machine-learning algorithms combined with various model selection procedures.^{25} An emerging theme is that the predictors selected may vary strongly among methods, and comparative performance of the approaches can be profoundly influenced by the data structure. The so-called “lasso” procedure, which aims to reduce overfitting and bias in estimated coefficients in the presence of noisy predictors, has been evaluated for various regression models and may outperform stepwise model selection.^{26} Its combination with hierarchical clustering and predictor averaging for the purpose of reducing dimensionality in microarray studies has been suggested only very recently.^{27} A modified (penalized) logistic regression approach with stepwise model selection has recently been put forward explicitly for modeling gene interactions.^{28} The results are promising, but pertain to situations with fewer candidate predictors than observations.

Notably, another group of researchers very recently attempted to replicate the findings of the original report^{4} in an independent population, and the results were not consistent.^{29} In one of the 2 datasets (n_{1} = 622; n_{2} = 421) a score based on the coefficients presented originally was marginally significantly associated with case/control status (with *P* = 0.049, not adjusted for 2 tests). When the researches re-estimated coefficients for the 23 SNPs included in the original final pathway model, only 2 single SNPs were significantly associated with disease status (one SNP with uncorrected *P* = 0.049 in dataset 1; another SNP with uncorrected *P* = 0.027 in dataset 2). Subsequently, these researchers likewise applied permutation procedures using the dataset examined in the present report. To obtain null distributions against which to compare the originally reported *P* values, they first examined *P* values of “genomic pathway models” constructed from similar numbers of randomly selected candidate SNPs, and subsequently constructed models from these random SNPs after additionally permuting the case/control status. They neither attempted to recreate the originally reported model, nor obtained a permutation *P* value with their implementation of the model selection algorithm. However they did find that the originally reported *P* value fell somewhere in the middle of their 2 strongly overlapping empirical *P* value distributions, and they suggested that *P* values as originally reported should be compared with such null distributions rather than being taken at face value. An interesting difference between their model selection and the original selection was that they imputed missing genotypes. Because many subjects might be dropped from model selection steps due to relatively few missing SNP values, imputation seems a very sensible undertaking. Yet, the implications of the manifold possible imputation procedures for extensive stepwise model selection approaches in datasets as described here certainly deserve further study.

An approach to controlling for the number of SNPs in a model constructed from all SNPs in a pathway, ie, without model selection, was published during the revision stages of our work.^{30} In a new genome-wide dataset, the axon-guidance pathway showed the third most significant association with Parkinson case/control status (*P* = 0.04; corrected for number of SNPs in the pathway, although apparently not for the number of pathways tested), although it was clearly nonsignificant in the dataset used in the present work. These discrepant results were judged as a partial replication of the original pathway association, but are generally in line with our findings that a pathway model constructed as described in our study needs to be assessed based on significance and performance measures corrected for overoptimism introduced through model selection.

Two final caveats should be mentioned. First, the report motivating our work presented bootstrap confidence intervals around the overall *P* values. These do not convey information equivalent to our permutation approach. They essentially describe the sampling variability in *P* values but do not correct for the overfitting introduced through the model selection procedure. Second, we analyzed only one of the datasets and phenotypes examined in the original report, because the other dataset was not publicly accessible at the time of our analyses. Even so, we have no reason to believe that the overoptimism demonstrated for the case/control genomic pathway model developed and described here would not extend to other phenotypes and comparable datasets. Nonetheless, we encourage other researchers to consider our report as an invitation to refute our concerns through thorough analyses of similar pertinent data and careful presentation of contradictory findings.

There is no consensus in the area of dissecting the genetic structure of multifactorial, complex phenotypes. Furthermore, evaluations of model selection procedures based on datasets resembling the genomic pathway scale are virtually absent. This report demonstrates the extent of possible error in assessments of significance and model performance when models are obtained through a predictor selection procedure based on rather standard methods applied to large SNP datasets, as has recently been suggested for the purpose of multifactorial disease modeling. Our results highlight the necessity of rigorous examination of even traditional modeling strategies transferred to multifactorial disease genetics, and the urgent need for additional methodologic research in this exciting field.

#### ACKNOWLEDGMENTS

This study used data from the SNP Resource at the NINDS Human Genetics Resource Center DNA and Cell Line Repository (http://ccr.coriell.org/ninds), as well as clinical data. The original genotyping was performed in the laboratory of Drs. Singleton and Hardy (NIA, LNG), Bethesda, MD, USA.