A candidate gene (or candidate region) of 100 kb was considered in the simulation. A dense array of SNPs was genotyped within this candidate gene/region (the numbers of markers examined were 20 and 100, respec-tively). For a comparison, the effect of using only one marker was also examined. In each round of the simulation, the locations of these markers were randomly generated within the candidate gene/region according to a uniform distribution. The number of case-parents triads was 200. With candidate region of 100 kb, one should expect to observe only 0.2 recombination events in the data of 200 case-parents triads, which could reasonably be ignored. Therefore the recombination fractions (θs) between markers were assumed to be zero in generating the triads.
For the LD alternatives (the H1 hypotheses), the disease-susceptibility gene (biallelic, with D and d alleles) was assumed to be within the candidate gene/region with its location generated according to a uniform distribution in each round of the simulation. The penetrances of dd, Dd, and DD were assumed to be 0.1, 0.3, and 0.9, respectively. (These represent complicated situations of phenocopy and reduced penetrance.)
Three settings of population history were considered for an unstructured (homogeneous) population. The first setting envisions that the D allele was introduced into the population by a mutational process 2000 generations ago. The mutation occurred in an individual with a specific haplotype—the “ancestral haplotype.” Over time, it broke up because of meiotic recombination. (The distances from the disease locus to the right and to the left breakpoints were distributed as independent exponential random variates with a mean of 1/ t Morgan, where t is the number of elapsed generations.) 13 Allelic heterogeneity was considered in the second setting. It was assumed that only 80% of the D alleles could trace their origins to the same ancestral haplotype 2000 generations ago. The other 20% occurred throughout history in nonspecific haplotypes. In the third setting, the D allele was assumed to originate from a series of mutations, each occurring at a specific and distinct haplotype. The oldest mutation occurred 3000 generations ago and accounts for 20% of the extant D allele. The second mutation occurred 2000 generations ago, accounting for 50%. The most recent mutation occurred 1000 generations ago, accounting for the rest. The frequency of the D allele at the end was assumed to be 0.05 for all these settings.
In each round of simulation, the marker frequencies were uniformly generated between 0.1 and 0.9, and the Lewontin disequilibrium coefficients 14 between two adjacent markers were uniformly generated between −0.9 and 0.9. With these parameters, the markers on the ancestral haplotype(s) were generated according to a first-order Markov model. 13 (Each round of simulation corresponds to a different population history.) To generate the haplotypes in the present-day population, all the D-haplotypes of the same specific origin were assumed to have the same markers as the corresponding ancestral haplotype up to the breakup points, whereas the above-mentioned Markov model was used repeatedly to generate marker data for (1) a d-haplotype, (2) a nonspecific D-haplotype, and (3) an ancestral D-haplotype beyond the breakup points. Note that for this unstructured population, the LD between markers was in effect simulated as arising from two sources: the new mutation and the background disequilibrium.
The case of structured population (population stratification and population admixture) was also considered. For the former, the study population was assumed to be composed of two subpopulations (the first subpopulation constitutes 40%, and the second, 60%). Random mating occurs within the subpopulations but the two subpopulations do not intermix. The present-day frequency of the D allele is 0.25 in the first subpopulation, which was introduced 10,000 generations ago through a certain ancestral haplotype. In the second subpopulation, the present-day frequency of the D allele is 0.0001, which was introduced 5000 generations ago through a different ancestral haplotype. As before, the background disequilibrium was generated using the Markov model. Note that the two subpopulations can assume different sets of marker frequencies and different sets of Lewontin disequilibrium coefficients. This creates yet another source for the LD between markers. This could also induce an artificial association between the markers and the disease-susceptibility gene, even if they are not at the same chromosome. As for population admixture, everything remains the same as in the case of population stratification, except that the barrier between the two subpopulations fell down at the last generation and thus, random mating can take place in this admixed population at large.
Two types of null hypothesis were considered: H01 (no linkage) and H02 (no association). Under the H01, the disease-susceptibility gene was assumed to be located at a different chromosome. Thus, the recombination fraction between the disease locus and each of the marker loci in the candidate gene/region is 0.5. The LD between markers arises solely from the background disequilibrium or, in the case of population structure, from the stratification/admixture additionally. As for the H02, a disease-susceptibility locus is indeed located within the candidate gene/region, except that the dd, the Dd, and the DD were assumed to have equal penetrance for the disease under study. In this way, the LD between markers can originate from the new mutation (D mutation) as in H1. Yet, the D mutation is not related to the disease under study, though it may confer susceptibility to other diseases.
Ten thousand simulations were performed for each scenario. Note that although haplotypes were simulated in this study, the analysis was restricted to genotype data. In each round of the simulation, TDT (the Bonferroni correction and the Monte-Carlo [MC] test of McIntyre et al. 8 were used for multiple-testing adjustment) and APRICOT (c = 5%, 10%, and 20%, respectively) were applied to test for LD with the disease-susceptibility gene. The α level was set at 0.05. The SAS/IML package was used to calculate the eigenvalues and eigenvectors required by APRICOT.
Table 2 presents the results for the unstructured populations. We see that the estimated sizes of the tests are close to the nominal α level for the various tests studied and for either the H01 or the three H02 hypotheses. As expected, the BC tends to be conservative, especially in the case of 100 markers, and the MC has roughly the correct nominal α level.
When there is only one marker, we see as expected that TDT and PCT have almost equal power. Furthermore, we see that the power increases as the number of markers increases. This is true even for the simple procedure such as BC. There is a power improvement of MC over BC, but only slightly. By contrast, the superiority of APRICOT over BC and MC is very striking, especially in the case of 100 markers. The above findings are the same regardless of single or multiple ancestral haplotypes or whether there is allelic heterogeneity or not.
The results for the structured populations are presented in Table 3. As before, APRICOT and MC have roughly the correct size and BC has a tendency toward conservatism. The overwhelming superiority of APRICOT over BC and MC in terms of power is again demonstrable. It has been known for the (single-marker) TDT that when two populations come into admix, we have better power to detect LD than when they are in sequestration. 15,16 Here, we see that the same applies to APRICOT.
In the simulation study, APRICOT typically picked up the first few principal components only (see Tables 2 and 3). These roughly account for ∼20% of the total variance when p = 20 and ∼10% when p = 100. To have a higher percentage of explained variance, one can always include more principal components by setting a lower threshold (c <5%). But this does not help to raise the power. Tables 2 and 3 suggest that a higher threshold of c = 20% may be satisfactory for unstructured population, and a moderate threshold of c = 10% may be satisfactory for admixed population. However, there seems to be not much of an effect using different thresholds.
The simulation was also conducted assuming a smaller genetic effect (penetrances: 0.2 [dd], 0.4 [Dd], and 0.8 [DD]). The superiority of the APRICOT was still demonstrable. For example, in the case of one ancestral haplotype in an unstructured population, the powers were 44.46% (APRICOT, c = 20%)≫12.85% (MC)>11.78% (BC) when typing 100 markers in 200 case-parents triads.
Whether or not the epidemiologic association approach will become mainstream (a paradigm shift) for the genetic dissection of complex human diseases may ultimately depend on two issues. One is the creation of a marker map that is dense enough to have at least one marker in strong LD with the disease-susceptibility gene. The other is the integration of information from the vast number of markers in the map without too much penalty for multiple testing. This paper addresses the second issue. I found that by testing for the principal components of the markers rather than testing one by one for the individual markers themselves, we can, in effect, circumvent the multiple-testing problem. Principal components are particular linear combinations of the markers. Thus, testing for the principal components amounts to testing for all the markers simultaneously. APRICOT as proposed in this paper focuses attention only on the first few principal components. This is because they have larger variances and thus may contain more information to detect LD. The Monte-Carlo simulation conducted in this paper confirms this speculation, and demonstrates a substantial power advantage over the BC and MC. In the not-too-distant future (when the cost of typing SNPs goes down and a dense or even saturated array of SNPs becomes a practical reality), it will be of interest to test whether APRICOT lives up to expectations, when applied to real data.
In the recent literature, several multilocus extensions for TDT have been proposed. 17–25 Most of the methods assume that the haplotypes are known in advance or otherwise the phase ambiguity has to be resolved from data of other family members or from a probabilistic modeling. 17–24 This could limit their usage if many linked markers (such as a dense array of SNPs) are to be considered as a whole in a study, because under this condition, the number of potential haplotypes becomes astronomically large. 26 One recent proposal makes use of the genotype data. 25 It calls for a parametric model that assumes a single ancestral mutation in a random-mating population. Thus, the method is not robust to population structure, which is not in keeping with the basic principle of a TDT. 3,4 The method is also very computationally demanding.
In this paper, the principal component methodology is applied only to infer LD between the set of markers and the disease-susceptibility gene. I do not intend to interpret the principal components themselves, nor do I seek to refine further the disease gene location, given the eigenvalues and the eigenvectors. The reason is simple. The principal components are derived from the covariance matrix of the marker transmission patterns. The eigenvectors (the coefficients to combine the markers) will be a complex function of the marker allele frequencies, the LD between markers, the population structure, etc. Even if a large weight is found to be attached to a particular marker in the linear combination for the most substantial principal component, it may not necessarily imply that the marker is physically close to the disease-susceptibility gene. It may simply mean that the marker is more polymorphic (allele frequency closer to 0.5). Alternatively, it could also mean that there exists a large difference in the allele frequencies between the two subpopulations for the marker. Thus, the principal component approach as described in this paper lacks the ability to pinpoint the precise location of the disease-susceptibility gene. However, it can still be considered a fine (although not exact) mapping technique, because we considered an array of SNPs tightly packed within an extremely short candidate gene/region at the outset. (By contrast, if a loose array of markers with spacing, say >10 cM, is used, the markers will generally be in equilibrium, ie, independent, and the principal components test will simply pick out the more informative markers one by one. Because there is no pooling of information, APRICOT is not expected to outperform the conventional methods in such a situation.)
The proposed APRICOT could have further improvements and extensions. First, further studies are warranted to examine the optimal choice of the threshold value (c), or possibly even to refine the adaptive algorithm of APRICOT altogether. Second, APRICOT in its present form can deal only with the simplest family data of case-parents triads. In real practice, however, we could encounter more complex situations, such as families with multiple affected siblings, with affected as well as unaffected siblings, with parental information missing, or with multiple generations. Extension of the present methodology to the conventional case-control design is also interesting to consider. A case-control design does not have the missing-parent problem, but, on the other hand, it could produce an excess of false positive results if the study population has a hidden structure. 4 (Recent methodologic papers have suggested ways to recognize and adjust for such a confounding population structure.) 27–33 Finally, in this paper, APRICOT is used as an LD test for a preselected candidate gene/region. However, it can also be used in a genome-wide–scan scenario. In that case, one can divide the whole genome into a multitude of gene segments and then apply APRICOT to scan over them. Because the gene segments are many, a Bonferroni correction is mandatory. However, this would be done on a segment-by-segment basis rather than on the marker-by-marker basis that was originally suggested. 1 Further studies should examine the statistical properties of this approach.
1. Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science 1996; 273: 1516–1517.
2. Khoury MJ, Yang Q. The future of genetic studies of complex human diseases: an epidemiologic perspective. Epidemiology 1998; 9: 350–354.
3. Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 1993; 52: 506–516.
4. Ewens WJ, Spielman RS. The transmission/disequilibrium test: history, subdivision and admixture. Am J Hum Genet 1995; 57: 455–464.
5. Weinberg CR. It's time to rehabilitate the P-value. Epidemiology 2001; 12: 288–290.
6. Wang DG, Fan JB, Siao CJ, et al
. Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science 1998; 280: 1077–1082.
7. The International SNP Map Working Group. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 2001; 409: 928–933.
8. McIntyre LM, Martin ER, Simonsen KL, Kaplan NL. Circumventing multiple testing: a multilocus Monte Carlo approach to testing for association. Genet Epidemiol 2000; 19: 18–29.
9. Johnson RA, Wichern DW. Applied Multivariate Statistical Analysis. 2nd ed. Englewood Cliffs: Prentice-Hall, 1992.
10. Wilkinson JH, Reinsch C. Linear Algebra, Handbook for Automatic Computation. Vol. 2. New York: Springer-Verlag, 1971.
11. Graybill FA. Matrices with Applications in Statistics. 2nd ed. Belmont: Wadsworth Publishing Company, 1983.
12. Martin ER, Monks SA, Warren LL, Kaplan NL. A test for linkage and association in general pedigrees: the pedigree disequilibrium test. Am J Hum Genet 2000; 67: 146–154.
13. McPeek MS, Strahs A. Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine-scale genetic mapping. Am J Hum Genet 1999; 65: 858–875.
14. Devlin B, Risch N. A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 1995; 29: 311–322.
15. McKeigue PM. Mapping genes underlying ethnic differences in disease risk by linkage disequilibrium in recently admixed populations. Am J Hum Genet 1997; 60: 188–196.
16. Kaplan NL, Martin ER, Morris RW, Weir BS. Marker selection for the transmission/disequilibrium test, in recently admixed populations. Am J Hum Genet 1998; 62: 703–712.
17. Wilson SR. On extending the transmission/disequilibrium test (TDT). Ann Hum Genet 1997; 61: 151–161.
18. Chiano MN, Clayton DG. Fine genetic mapping using haplotype analysis and the missing data problem. Ann Hum Genet 1998; 62: 55–60.
19. Clayton D. A generalization of transmission/disequilibrium test for uncertain-haplotype transmission. Am J Hum Genet 1999; 65: 1170–1177.
20. Clayton D, Jones H. Transmission/disequilibrium tests for extended marker haplotype. Am J Hum Genet 1999; 65: 1161–1169.
21. Bourgain C, Genin E, Quesneville H, Clerget-Darpoux F. Search for multifactorial disease susceptibility genes in founder populations. Ann Hum Genet 2000; 64: 255–265.
22. Dudbridge F, Koeleman BPC, Todd JA, Clayton DG. Unbiased application of the transmission/disequilibrium test to multilocus haplotypes. Am J Hum Genet 2000; 66: 2009–2012.
23. MacLean CJ, Martin RB, Sham PC, Wang H, Straub RE, Kendler KS. The trimmed-haplotype test for linkage disequilibrium. Am J Hum Genet 2000; 66: 1062–1075.
24. Zhao H, Zhang S, Merikangas KR, et al
. Transmission/disequilibrium tests using multiple tightly linked markers. Am J Hum Genet 2000; 67: 936–946.
25. Morris AP, Whittaker JC. Fine scale association mapping of disease loci using simplex families. Ann Hum Genet 2000; 64: 223–237.
26. Hodge SE, Boehnke M, Spence MA. Loss of information due to ambiguous haplotyping of SNPs. Nat Genet 1999; 21: 360–361.
27. Pritchard JK, Rosenberg NA. Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet 1999; 65: 220–228.
28. Devlin B, Roeder K. Genomic control for association studies. Biometrics 1999; 55: 997–1004.
29. Bacanu S-A, Devlin B, Roeder K. The power of genomic control. Am J Hum Genet 2000; 66: 1933–1944.
30. Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. Association mapping in structured populations. Am J Hum Genet 2000; 67: 170–181.
31. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics 2000; 155: 945–959.
32. Reich DE, Goldstein DB. Detecting association in a case-control study while correcting for population stratification. Genet Epidemiol 2001; 20: 4–16.
33. Satten GA, Flanders WD, Yang Q. Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am J Hum Genet 2001; 68: 466–477.
Keywords:© 2002 Lippincott Williams & Wilkins, Inc.
case-parents triads; epidemiologic methods; genetic epidemiology; principal component analysis; single nucleotide polymorphism; transmission/disequilibrium test