It has been argued that the future of genetic studies of complex human diseases will rely more and more on the epidemiologic association paradigm. ^{1,2} In particular, an association approach, the application of the transmission/disequilibrium test (TDT) in a case-parents study, has received much attention. ^{3,4} TDT is a simple and robust method to detect linkage disequilibrium (LD) between a marker locus and a disease-susceptibility locus. ^{3,4} (In this paper, LD denotes the presence of both linkage and association.) By typing markers within a candidate gene/region and performing multiple TDTs, one can test whether a disease-susceptibility gene is in LD with at least one of the typed markers, and thereby infer whether the sought-after gene is within (or is physically close to) the candidate gene/region. Note that the focus here is on hypothesis testing rather than on estimation, because, as pointed out by Weinberg, ^{5} “…Measures of association between risk and a marker allele should thus vary across populations, and are not particularly meaningful in themselves.”

With the rapid progress in the Human Genome Project, it is expected that many single nucleotide polymorphisms (SNPs) will be identified and genotyped within a very short physical distance (∼1 per kb, on average, in the human genome). ^{6,7} Genotyping of SNPs has the potential for automation, which makes them the markers of choice for a large-scale epidemiologic study. To analyze such data of multiple tightly linked markers within a candidate gene/region poses a challenge, however. With this dense marker map, tests conducting at nearby markers may not be independent, and a simple use of Bonferroni correction (BC) for multiple comparison will lead to a conservative test and hence a power loss. ^{8}

In this paper, I propose a new method to test multiple markers simultaneously. The method is based on principal components of the transmission patterns of the marker genotypes. It is designed specifically to deal with a dense array of SNPs (biallelic markers with genotypic, but not haplotypic, information available). Because the number of SNPs could be very large, it is assumed that resolving haplotype phase ambiguity by other means is impossible or impractical.

##### Data Structure and Basic Notations

Assume that there are a total of n case-parents triads available for study, which are indexed by *i* (*i* = 1,…, n). For each member of the triads, a total of *p* tightly linked biallelic markers within a certain candidate gene/region are genotyped. The markers are indexed by *j* (*j* = 1,…, *p*). The index *j* may or may not correspond to the marker orders in the chromosome. In the *j* th marker locus, one allele is arbitrarily specified to be the *M**j* allele, and the other allele, the *m**j* allele. The first two columns in Table 1 present typical data of case-parents triads (for brevity, genotypes for only three linked markers are shown).

Following the same principle of TDT, ^{3,4} one can write down for the triad the nontransmitted genotype about the *p* markers given the genotypic information of cases and their parents. Note that the process can be done unambiguously without knowing the haplotypic information (see the “Nontransmitted Genotype” in Table 1, for example). Next, the difference in the ‘*M’*-allele counting between the transmitted genotype (the case) and the nontransmitted genotype is calculated for each triad and for each locus (see “Difference in ‘*M’*-Allele Counting” in Table 1, for example). These differences are denoted as *x**ij* (for the *i* th triad and the *j* th locus). Note that such defined *x**ij* can take on five possible values: −2, −1, 0, 1, 2. The *x**ij* s are the basic data to be analyzed.

Let the *X’* = (*X*_{1},…, *X**p*) denote the underlying *p*-dimensional random vector that gives rise to the observed data of *x**ij* (the *x**ij* s are the realization of *X’*). The random variables *X*_{1},…, *X**p* are naturally correlated to some degree, as they represent transmission patterns for a dense array of linked markers. To acknowledge this, let Σ be the covariance matrix of *X*.

##### The Principal Component Test

To integrate the information of the *p* markers, the principal component analysis ^{9} is used. The principal components are particular linear combinations of *X*_{1},…, *X**p*. Assume that the Σ has eigenvalue-eigenvector pairs (λ_{1}, *e*_{1}),…,(λ*p*, *e**p*) where λ_{1}≥…≥λ*p* ≥0, the *k* th (*k* =1,…, *p*) principal component is *Y**k* =*e**k* ’*X* =*e**1k**X**1* + … +*e**pk**X**p*, with variance of Var(*Y**k*) = λ*k*. Note that the development of the principal component theory does not require a multivariate normal assumption. ^{9}

If none of the *p* markers is in LD with the disease-susceptibility gene, the expected value of the allele countings at any marker locus in the transmitted and in the nontransmitted genotypes will tend to be equal, because of random segregation/union of gametes. That is, *E* (*X**j*) = 0 for *j* = 1,…, *p* under the null hypothesis of no LD. As a consequence, we also have *E* (*Y**k*) = 0 for *k* = 1,…, *p*.

To develop the principal component test, the sample covariance matrix *S* is calculated from *xij* without centering, that is, under the null hypothesis of no LD. The *j* row, *j* ′ column of *S* has entry of EQUATION*xij* ·*xij* ′/ *n*. Standard methods can then be applied to obtain eigenvalue-eigenvector pairs for *S*. ^{10} The *k* th sample eigenvalue is denoted as λ*k*, and the *k* th sample eigenvector is denoted as *êk*, with elements of *ê1k*, …, *êpk*. The *k* th sample principal component for the *i* th case-parents triad is then *yik* =*ê1k* *xi1* + … +*êpk**xip*. Note that we have λ*k* ≥0 for *k* = 1,…, *p*, because *S* is a non-negative matrix (see Corollary 12.2.2 of reference ^{11}). In the following, attention will be focused on the first *q* (*q* ≤*p*) principal components that have positive eigenvalues.

Under the null hypothesis of no LD, each of the following statistics, *T**k* (*k* = 1,…, *q*), is asymptotically distributed as a χ^{2} distribution with one degree of freedom (df) ^{9}:EQUATION

It is noteworthy that different specifications at the outset of the ‘*M’* and ‘*m’* alleles in the marker loci will lead to sign changes in the *x**ij* but always the same *T**k*.

The principal components with larger variances may be more informative and may have greater power to detect LD than those with smaller variances. (Simulation studies described in the next section confirmed this speculation.) The principal component test based on the first *L* (*L* ≤*q*) components (denoted as PCT *L*) is EQUATION

The test is to be referred to the 1-df χ^{2} distribution but with the significance level (α level) adjusted to α/ *L* (Bonferroni correction). The determination of a suitable *L* may pose a problem in real practice. Here, an adaptive approach, the Adaptive PRIncipal COmponent Test (APRICOT), is suggested. APRICOT considers the first and, successively, those higher-order principal components with eigenvalues “standing out.” The *l* th principal component has a “stand-out” eigenvalue if ψ*l* = (*λl* − λ*l* _{+ 1})/ *λl* >*c*, where *c* is a predetermined threshold. For example, if the eigenvalues of the first five principal components are λ_{1} = 10.0, λ_{2} = 8.5, λ_{3} = 8.0, λ_{4} = 7.8, and λ_{5} = 7.7 (ψ_{1} = 15.0%, ψ_{2} = 5.9%, ψ_{3} = 2.5%, ψ_{4} = 1.3%), we will take *L* = 3 for *c* = 2%, *L* = 2 for *c* = 5%, and *L* = 1 for *c* = 20% (the first principal component is always taken).

It is of interest to know how the PCT and the TDT compare with only one marker (*p* = 1). Under this situation, PCT = (EQUATION*x**i*)^{2}/(EQUATION*x*_{i}^{2}) (the index *j* has been dropped). This is not exactly the TDT but it is now algebraically identical to the (one marker) pedigree disequilibrium test (PDT), ^{12} a recently proposed statistic. Nevertheless, the PDT (and hence the PCT) and the TDT are asymptotically equivalent under the null hypothesis of no LD. ^{12}

Equation U4 Image Tools |
Equation U5 Image Tools |

#### Methods

##### Simulation Studies

A candidate gene (or candidate region) of 100 kb was considered in the simulation. A dense array of SNPs was genotyped within this candidate gene/region (the numbers of markers examined were 20 and 100, respec-tively). For a comparison, the effect of using only one marker was also examined. In each round of the simulation, the locations of these markers were randomly generated within the candidate gene/region according to a uniform distribution. The number of case-parents triads was 200. With candidate region of 100 kb, one should expect to observe only 0.2 recombination events in the data of 200 case-parents triads, which could reasonably be ignored. Therefore the recombination fractions (θs) between markers were assumed to be zero in generating the triads.

For the LD alternatives (the H_{1} hypotheses), the disease-susceptibility gene (biallelic, with D and d alleles) was assumed to be within the candidate gene/region with its location generated according to a uniform distribution in each round of the simulation. The penetrances of dd, Dd, and DD were assumed to be 0.1, 0.3, and 0.9, respectively. (These represent complicated situations of phenocopy and reduced penetrance.)

Three settings of population history were considered for an unstructured (homogeneous) population. The first setting envisions that the D allele was introduced into the population by a mutational process 2000 generations ago. The mutation occurred in an individual with a specific haplotype—the “ancestral haplotype.” Over time, it broke up because of meiotic recombination. (The distances from the disease locus to the right and to the left breakpoints were distributed as independent exponential random variates with a mean of 1/ *t* Morgan, where *t* is the number of elapsed generations.) ^{13} Allelic heterogeneity was considered in the second setting. It was assumed that only 80% of the D alleles could trace their origins to the same ancestral haplotype 2000 generations ago. The other 20% occurred throughout history in nonspecific haplotypes. In the third setting, the D allele was assumed to originate from a series of mutations, each occurring at a specific and distinct haplotype. The oldest mutation occurred 3000 generations ago and accounts for 20% of the extant D allele. The second mutation occurred 2000 generations ago, accounting for 50%. The most recent mutation occurred 1000 generations ago, accounting for the rest. The frequency of the D allele at the end was assumed to be 0.05 for all these settings.

In each round of simulation, the marker frequencies were uniformly generated between 0.1 and 0.9, and the Lewontin disequilibrium coefficients ^{14} between two adjacent markers were uniformly generated between −0.9 and 0.9. With these parameters, the markers on the ancestral haplotype(s) were generated according to a first-order Markov model. ^{13} (Each round of simulation corresponds to a different population history.) To generate the haplotypes in the present-day population, all the D-haplotypes of the same specific origin were assumed to have the same markers as the corresponding ancestral haplotype up to the breakup points, whereas the above-mentioned Markov model was used repeatedly to generate marker data for (1) a d-haplotype, (2) a nonspecific D-haplotype, and (3) an ancestral D-haplotype beyond the breakup points. Note that for this unstructured population, the LD between markers was in effect simulated as arising from two sources: the new mutation and the background disequilibrium.

The case of structured population (population stratification and population admixture) was also considered. For the former, the study population was assumed to be composed of two subpopulations (the first subpopulation constitutes 40%, and the second, 60%). Random mating occurs within the subpopulations but the two subpopulations do not intermix. The present-day frequency of the D allele is 0.25 in the first subpopulation, which was introduced 10,000 generations ago through a certain ancestral haplotype. In the second subpopulation, the present-day frequency of the D allele is 0.0001, which was introduced 5000 generations ago through a different ancestral haplotype. As before, the background disequilibrium was generated using the Markov model. Note that the two subpopulations can assume different sets of marker frequencies and different sets of Lewontin disequilibrium coefficients. This creates yet another source for the LD between markers. This could also induce an artificial association between the markers and the disease-susceptibility gene, even if they are not at the same chromosome. As for population admixture, everything remains the same as in the case of population stratification, except that the barrier between the two subpopulations fell down at the last generation and thus, random mating can take place in this admixed population at large.

Two types of null hypothesis were considered: H_{01} (no linkage) and H_{02} (no association). Under the H_{01}, the disease-susceptibility gene was assumed to be located at a different chromosome. Thus, the recombination fraction between the disease locus and each of the marker loci in the candidate gene/region is 0.5. The LD between markers arises solely from the background disequilibrium or, in the case of population structure, from the stratification/admixture additionally. As for the H_{02}, a disease-susceptibility locus is indeed located within the candidate gene/region, except that the dd, the Dd, and the DD were assumed to have equal penetrance for the disease under study. In this way, the LD between markers can originate from the new mutation (D mutation) as in H_{1}. Yet, the D mutation is not related to the disease under study, though it may confer susceptibility to other diseases.

Ten thousand simulations were performed for each scenario. Note that although haplotypes were simulated in this study, the analysis was restricted to genotype data. In each round of the simulation, TDT (the Bonferroni correction and the Monte-Carlo [MC] test of McIntyre *et al.*^{8} were used for multiple-testing adjustment) and APRICOT (*c* = 5%, 10%, and 20%, respectively) were applied to test for LD with the disease-susceptibility gene. The α level was set at 0.05. The SAS/IML package was used to calculate the eigenvalues and eigenvectors required by APRICOT.

#### Results

Table 2 presents the results for the unstructured populations. We see that the estimated sizes of the tests are close to the nominal α level for the various tests studied and for either the H_{01} or the three H_{02} hypotheses. As expected, the BC tends to be conservative, especially in the case of 100 markers, and the MC has roughly the correct nominal α level.

When there is only one marker, we see as expected that TDT and PCT have almost equal power. Furthermore, we see that the power increases as the number of markers increases. This is true even for the simple procedure such as BC. There is a power improvement of MC over BC, but only slightly. By contrast, the superiority of APRICOT over BC and MC is very striking, especially in the case of 100 markers. The above findings are the same regardless of single or multiple ancestral haplotypes or whether there is allelic heterogeneity or not.

The results for the structured populations are presented in Table 3. As before, APRICOT and MC have roughly the correct size and BC has a tendency toward conservatism. The overwhelming superiority of APRICOT over BC and MC in terms of power is again demonstrable. It has been known for the (single-marker) TDT that when two populations come into admix, we have better power to detect LD than when they are in sequestration. ^{15,16} Here, we see that the same applies to APRICOT.

In the simulation study, APRICOT typically picked up the first few principal components only (see Tables 2 and 3). These roughly account for ∼20% of the total variance when *p* = 20 and ∼10% when *p* = 100. To have a higher percentage of explained variance, one can always include more principal components by setting a lower threshold (*c* <5%). But this does not help to raise the power. Tables 2 and 3 suggest that a higher threshold of *c* = 20% may be satisfactory for unstructured population, and a moderate threshold of *c* = 10% may be satisfactory for admixed population. However, there seems to be not much of an effect using different thresholds.

The simulation was also conducted assuming a smaller genetic effect (penetrances: 0.2 [dd], 0.4 [Dd], and 0.8 [DD]). The superiority of the APRICOT was still demonstrable. For example, in the case of one ancestral haplotype in an unstructured population, the powers were 44.46% (APRICOT, *c* = 20%)≫12.85% (MC)>11.78% (BC) when typing 100 markers in 200 case-parents triads.

#### Discussion

Whether or not the epidemiologic association approach will become mainstream (a paradigm shift) for the genetic dissection of complex human diseases may ultimately depend on two issues. One is the creation of a marker map that is dense enough to have at least one marker in strong LD with the disease-susceptibility gene. The other is the integration of information from the vast number of markers in the map without too much penalty for multiple testing. This paper addresses the second issue. I found that by testing for the principal components of the markers rather than testing one by one for the individual markers themselves, we can, in effect, circumvent the multiple-testing problem. Principal components are particular linear combinations of the markers. Thus, testing for the principal components amounts to testing for all the markers simultaneously. APRICOT as proposed in this paper focuses attention only on the first few principal components. This is because they have larger variances and thus may contain more information to detect LD. The Monte-Carlo simulation conducted in this paper confirms this speculation, and demonstrates a substantial power advantage over the BC and MC. In the not-too-distant future (when the cost of typing SNPs goes down and a dense or even saturated array of SNPs becomes a practical reality), it will be of interest to test whether APRICOT lives up to expectations, when applied to real data.

In the recent literature, several multilocus extensions for TDT have been proposed. ^{17–25} Most of the methods assume that the haplotypes are known in advance or otherwise the phase ambiguity has to be resolved from data of other family members or from a probabilistic modeling. ^{17–24} This could limit their usage if many linked markers (such as a dense array of SNPs) are to be considered as a whole in a study, because under this condition, the number of potential haplotypes becomes astronomically large. ^{26} One recent proposal makes use of the genotype data. ^{25} It calls for a parametric model that assumes a single ancestral mutation in a random-mating population. Thus, the method is not robust to population structure, which is not in keeping with the basic principle of a TDT. ^{3,4} The method is also very computationally demanding.

In this paper, the principal component methodology is applied only to infer LD between the set of markers and the disease-susceptibility gene. I do not intend to interpret the principal components themselves, nor do I seek to refine further the disease gene location, given the eigenvalues and the eigenvectors. The reason is simple. The principal components are derived from the covariance matrix of the marker transmission patterns. The eigenvectors (the coefficients to combine the markers) will be a complex function of the marker allele frequencies, the LD between markers, the population structure, etc. Even if a large weight is found to be attached to a particular marker in the linear combination for the most substantial principal component, it may not necessarily imply that the marker is physically close to the disease-susceptibility gene. It may simply mean that the marker is more polymorphic (allele frequency closer to 0.5). Alternatively, it could also mean that there exists a large difference in the allele frequencies between the two subpopulations for the marker. Thus, the principal component approach as described in this paper lacks the ability to pinpoint the precise location of the disease-susceptibility gene. However, it can still be considered a fine (although not exact) mapping technique, because we considered an array of SNPs tightly packed within an extremely short candidate gene/region at the outset. (By contrast, if a loose array of markers with spacing, say >10 cM, is used, the markers will generally be in equilibrium, *ie*, independent, and the principal components test will simply pick out the more informative markers one by one. Because there is no pooling of information, APRICOT is not expected to outperform the conventional methods in such a situation.)

The proposed APRICOT could have further improvements and extensions. First, further studies are warranted to examine the optimal choice of the threshold value (*c*), or possibly even to refine the adaptive algorithm of APRICOT altogether. Second, APRICOT in its present form can deal only with the simplest family data of case-parents triads. In real practice, however, we could encounter more complex situations, such as families with multiple affected siblings, with affected as well as unaffected siblings, with parental information missing, or with multiple generations. Extension of the present methodology to the conventional case-control design is also interesting to consider. A case-control design does not have the missing-parent problem, but, on the other hand, it could produce an excess of false positive results if the study population has a hidden structure. ^{4} (Recent methodologic papers have suggested ways to recognize and adjust for such a confounding population structure.) ^{27–33} Finally, in this paper, APRICOT is used as an LD test for a preselected candidate gene/region. However, it can also be used in a genome-wide–scan scenario. In that case, one can divide the whole genome into a multitude of gene segments and then apply APRICOT to scan over them. Because the gene segments are many, a Bonferroni correction is mandatory. However, this would be done on a segment-by-segment basis rather than on the marker-by-marker basis that was originally suggested. ^{1} Further studies should examine the statistical properties of this approach.