Epidemiologic enquiry to uncover new risk factors for cancer continues at a fast pace. In recent years, the focus has been on genetic risk factors. A few high penetrance genes have been identified, but current thinking is that the risk in individuals is likely to be polygenic in nature, with no way to predict the prevalence of these unknown variants, or the strength of their influences.1 Current array technology permits the simultaneous study of large numbers of single nucleotide polymorphisms (SNPs), but these focus on relatively common variants.2,3 Future sequencing technology is likely to redirect attention towards rarer variants.
In searching the genome for genes and variants that affect risk, it is necessary to use available resources of subjects as efficiently as possible. An idea that has been advocated by a few commentators is to make greater use of patients with multiple primary malignancies.4–6 The fundamental idea is based on the premise that patients with 2 or more independent cancers of the same anatomic site represent an exceptionally high-risk group. As such, they must possess considerably increased prevalence of all factors that increase the risk of the cancer under investigation. The use of multiple primaries is especially attractive for studying rare risk factors. It has been shown that, under an assumption that the relative risk is unrelated to the background risk, studies that compare patients with double malignancies to patients with single malignancies can be more powerful for studying rare risk factors than conventional case–control studies.4 The increase in statistical power is due simply to the greatly increased relative frequency of the risk variants in the study participants. The power advantages of this design are eliminated when the risk variant is common.
A related idea has been to take advantage of patients with double malignancies by creating case groups that represent an extreme phenotype, and to compare these directly with population controls.5,7–9 This “comparison of extremes” approach affects study power by increasing the detectable “signal.” Specifically, if, as before, the relative risk is assumed to be independent of background risk, the odds ratio from the “extremes” design will be the square of the odds ratio for a conventional design. That is, if the odds ratio comparing patients with double primaries to those with single primaries is the same as the odds ratio comparing patients with single primaries to population controls, then the odds ratio from a comparison of double primaries with population controls will be the square of this baseline odds ratio. If true, this increase in signal has the potential to increase the efficiency/power of case–control studies for detecting common variants, such as low penetrance SNPs on a SNP array. In a recent study, Fletcher et al10 tested this idea empirically by assembling data from studies of the role of the CHEK2*1100delC variant in causing breast cancer. This large pooled study of 1828 cases of contralateral breast cancer and 7030 population controls produced an odds ratio estimate of 6.4, a result that is consistent with the square of the odds ratio estimate of 2.3 obtained from meta-analyses of conventional studies of the CHEK2*1100delC variant.11
The purpose of the present study is to provide a broader evaluation of the validity of these novel designs. We review the breast cancer literature to uncover reported studies of other known or suspected breast cancer susceptibility genes that have used patients with contralateral breast cancer as cases (BRCA1, BRCA2, and FGFR2). Breast cancer appears to be the only malignancy that has been comprehensively analyzed in both its single (unilateral) and multiple (bilateral) forms. Bilateral breast cancer occurs frequently, and as a result is an especially attractive target for this study design. The results are examined to see whether the observed prevalences of the variants among these cases are consistent with the hypothesis that the odds ratio of bilateral breast cancer is the square of the odds ratio of unilateral breast cancer. We also examine the mathematical consequences of this assumption, to determine the potential magnitude of the gains in statistical power of studies that use second primaries, contrasting the design implications for common low penetrance variants with rare high penetrance variants. The power advantages of studying second primaries are considered in the light of other theoretical and practical strengths and limitations of these designs.
Alternative Design Options and Relative Statistical Power
The study of second primaries as cases has the potential to reduce the required sample size in comparison to a conventional case–control study. This theory was initially presented to justify a study design in which cases comprise incident occurrences of a second primary malignancy.4 The ideal control group for such a study would be survivors of a first primary cancer of the same type. The key assumption of this strategy is that the relative risk due to the risk factor is unrelated to the baseline risk in the population under study, in which case the relative risk would be the same in cancer survivors (a group with a high absolute risk) as in the general population at risk for a first primary. We test this assumption empirically.
We consider 3 design options and the sample sizes required for each design to deliver equivalent statistical power. Let nfc represent the total sample size for a conventional case–control study, ie, a study in which first primary cases are compared with population controls. Let nsf represent the sample size for a study of second primary cancers (the “second cancers” design, with first primary cancers as controls) that will deliver equivalent power to the conventional study with sample size nfc. Thus, this design is more powerful if nsf is smaller than nfc. Finally, let nsc represent the corresponding sample size for a study in which the cases are second primaries and the controls are unaffected population controls, which we term the “enriched” design. Let the risk factor under investigation have a relative risk in the population denoted by ψ and let its prevalence be p. Further, for generality we assume that the corresponding relative risk contrasting first and second primaries is ϕ. That is the second-cancers design involves an attenuated relative risk if ϕ < ψ. Then the sample sizes in the second-cancers design and the enriched design required to deliver the same power as a sample size of nfc for the conventional design are given by
Further details of the derivation of these formulas are provided in the Appendix. The formulas allow us to specify any sample size (nfc) that will provide an arbitrarily large power for the conventional study, and to use this to compute the required sample sizes for the other design options that will possess equivalent power.
Studies of mechanisms of breast cancer predisposition have found several genetic variants reproducibly associated with risk.12,13 These can be broadly categorized as high-risk gene defects (BRCA1, BRCA2), deleterious gene mutations with moderate breast cancer predisposition (CHEK2*1100delC), and low-penetrance gene polymorphisms (for example, FGFR2 rs2981582). We also considered a polymorphic haplotype suggested as a possible risk factor (ATM composite allele ins38(-8) and 5557A), but after reviewing the very limited literature,14–16 we elected not to present the results of the ATM haplotype because there is no persuasive evidence that this variant affects breast cancer risk. For BRCA1 and BRCA2, we reviewed studies that involved mutations with proven deleterious effect on the gene's function. We examined the common BRCA1 variant 5382insC separately, as a few studies examined this mutation in isolation. Each of these categories of predisposing alleles has been studied with various levels of scrutiny in bilateral breast cancer cases, unilateral breast cancer cases, and nonaffected controls.
We identified case–control studies of these selected genes and the risk of breast cancer published before 1 March 2009, through computer-based searches of PubMed. Using FGFR2 as an example, we performed 2 consecutive searches as follows: (1) (“breast cancer” OR “breast neoplasms” OR “breast carcinoma”) AND (FGFR2 OR rs2981582) NOT review (pt); (2) (“bilateral breast cancer” OR “multiple breast cancer” OR “multiple primary cancer”) AND (FGFR2 OR rs2981582) NOT review (pt). This was repeated for the other genes and variants by replacing “FGFR2 OR rs2981582” with “BRCA1 OR BRCA2,” and “CHEK2.” Studies were included if they involved women with unilateral breast cancer or bilateral breast cancer as cases and unilateral breast cancer or breast cancer-free as controls. We further restricted the studies to those for which there was no evidence that the participants were selected on the basis of a family history of breast cancer. We also included case-series studies (without controls) or studies consisting solely of unaffected individuals, as these contribute to our estimates of the population prevalences of the genes in 1 of the 3 comparison groups (ie, population controls, unilateral breast cancer cases, bilateral breast cancer cases) even though these studies do not contribute to our summary odds ratio estimates. We extracted information on study design, geographic location, ethnicity, age, and numbers of cases and controls from each study (eTable 1, http://links.lww.com/EDE/A360).
The mutation frequencies were compared for heterogeneity across studies using the Fisher-Freeman-Halton Test (StatXact).17 We applied conventional meta-analytic techniques to groups of studies employing the same study design (ie, conventional, second-cancers, or enriched).18 This involved checks for between-study heterogeneity19 and publication bias,20 and calculation of summary odds ratios (ORs) and 95% confidence intervals (CIs). We used the statistical software STATA (version 10.0; STATA Corp, College Station, TX) and StatXact (Cytel Corporation, Cambridge, MA).
Our search identified 21 studies of CHEK2*1100delC, 20 studies of various BRCA1 truncating mutations (7 in Ashkenazi populations, 13 in other populations), 12 reports involving solely the BRCA1 5382insC variant (5 in Ashkenazi populations, 7 in other populations), 17 studies of various BRCA2 truncating mutations (7 in Ashkenazi populations, 10 in other populations), and 3 studies of FGFR2 rs2981582. Detailed results from each of these studies are provided in eTable 2 (http://links.lww.com/EDE/A360) along with the results of the various statistical analyses. These include heterogeneity tests of the prevalence estimates of the variants in each of the 3 key groups (population controls, unilateral breast cancer, and bilateral breast cancer), summary odds ratios from meta-analyses of the studies for each of the different study types (conventional case–control design, second-cancers design, enriched design), and tests of heterogeneity and publication bias for the various summary odds ratio estimates.
Summary results by gene are provided in Table 1. The overall relative frequencies show the anticipated pattern of increasing prevalence from population controls to unilateral cases to bilateral cases. For example, for CHEK2*1100delC the estimated frequencies of the variant rise from an average of 0.5% in the populations studied to 1.5% in unilateral cases to 2.8% in bilateral cases. These prevalence estimates vary considerably from study to study (note heterogeneity P values), reflecting possible differences in frequencies among ethnically distinct populations. However, for genes with large effects, such as BRCA1 and BRCA2, these variations are minor relative to the differences between the case groupings.
In analyzing the odds ratios, these variations across studies justify stratification by individual study, as is conventional in any meta-analysis. However, the heterogeneity tests for the meta-analyses of odds ratios are all nonsignificant. This provides some reassurance about the validity of the summary odds ratio estimates (right 3 columns of Table 1), although the relatively small numbers of component studies provides limited power for these heterogeneity tests. Note that these analyses have combined studies of Ashkenazi and non-Ashkenazi for both BRCA1 and BRCA2, because the odds ratios appear to be consistent. These summary odds ratios confirm our thesis that associations in conventional case–control investigations (unilateral cases vs. control column) will also be detected in studies that compare bilateral cases with unilateral cases.
The odds ratio for BRCA1 appears to be attenuated in the bilateral versus unilateral design, although for the other 3 genes the estimates with those 2 designs are very similar. The results also support the corollary to this thesis, that the enriched design (bilateral versus control column) provides a stronger signal. For CHEK2*1100delC and FGFR2, the estimates from this design are close to the square of the odds ratio estimates from the corresponding conventional studies. For BRCA1 and BRCA2, the odds ratio estimates from this design actually exceed the square, although there are few studies and the confidence intervals are wide. All of these comparisons are limited by the relatively few studies available for meta-analysis, especially for BRCA2 and FGFR2, and the wide confidence intervals throughout.
The power implications of these findings are shown in Table 2. This table provides the relative effective sample sizes of the 3 study designs required to achieve a given level of statistical power, using a conventional study with 1000 subjects as the reference. The parameters represent various risk-factor prevalences similar to the ones under investigation, including a common variant (prevalence 0.3), an uncommon variant (prevalence 0.01) and a very uncommon variant (prevalence 0.001), and for modest (odds ratio of 1.5) and high strength (odds ratio of 5.0) associations. For example, under the first set of parameter assumptions, a study comparing double primaries with population controls would require only 240 participants to achieve equivalent power. (The absolute power differs from row to row in the table, so it is not meaningful to compare results between rows within a column.)
In the top half of the table, it is assumed that the relative risk of disease for a given risk factor for the ratio of second primaries to first primaries is not attenuated compared with the ratio of first primaries to ordinary controls. The bottom half of the table repeats the results assuming that the relative risk of disease for a given risk factor is reduced 25% for the ratio of second primaries to first primaries. If there is no attenuation of relative risk with second primaries (top half of table), the second-cancers design is more efficient than a conventional design for rarer and stronger risk factors, ie, fewer participants are required to achieve equivalent study power). This advantage is reduced if the signal is attenuated (bottom half of table). Across the board, the enriched design has greater power than the conventional design.
In using second primary cancers for epidemiologic studies of cancer risk, the fundamental premise is that any risk factor that increases the risk of the cancer under study in people previously unaffected with the disease will also increase the risk of a second primary among cancer survivors. Our results support this hypothesis for genetic risk factors associated with breast cancer. This hypothesis assumes there is nothing unique or etiologically distinct about the occurrence of double malignancies (ie, that they represent 2 independent occurrences of the disease), and 2 occurrences will typically occur in people at high risk. Given this presumption, we have explored the further hypothesis that the degree of elevation of risk for any risk factor is similar (on a multiplicative scale) in the setting of cancer survivors to what is observed in the general population. Our results suggest that on the relative scale (ie, using relative risks as opposed to, say, risk differences) the risk elevation among cancer survivors is typically similar, although with possible attenuation in some cases. Of the 4 genes investigated, only one (BRCA1) had a summary relative risk in the second cancers design substantially smaller than the summary estimate from the case–control studies. In contrast, the 3 enriched studies of BRCA1 produced a summary odds ratio higher than expected (62). A modest attenuation of the relative risk is consistent with an investigation of known risk factors for melanoma in a similar setting.21 The possibility of such attenuation has implications for statistical power. Our calculations show that studies involving second primaries have major advantages in terms of statistical power if the relative risk can be considered to be constant. Attenuation of the relative risk can alter this balance of power, depending on the magnitude of the attenuation. However, studies that compare second primaries with population controls have power advantages even with attenuation, and across the spectrum of risk factor characteristics.
Our results provide only an imprecise investigation of these phenomena. Many of the individual studies in the literature are vague with respect to criteria for case and control selection. Just as with conventional case–control studies, the other 2 designs would ideally be constructed using population-based sampling of both cases and controls. Also, these designs have practical merit only for selected cancer sites, ie, those for which second primaries in the same organ type are common and clearly distinguishable from metastatic lesions. This includes cancers of the breast, lung, colon-rectum, and skin, but excludes rare cancers and those for which much of the primary site is typically removed by surgery, eg, prostate cancer. Also, multiple primaries are common for head and neck cancers, but reliable discrimination between multiple primaries and superficial metastases is complicated.
In what circumstances might the relative risk of any risk factor be different in cancer survivors than in the general population? One possibility (as just discussed) is diagnostic error whereby metastases are misdiagnosed as second primaries. There is a substantial literature of studies evaluating this issue, but the consensus for breast cancer is that most contralateral occurrences are indeed independent occurrences of the disease.22–24 Recent studies support this conclusion for melanoma, but suggest that misdiagnoses may be common for multiple primary lung cancers.25–27 Another possibility is interaction with treatment. Common treatments for primary breast cancer include agents such as tamoxifen that reduce the incidence of the disease by 50%. If the subtypes of tumors, which are prevented by treatment, are associated with a genetic risk factor then the impact of this risk factor overall will be different in second primaries compared with first primaries. These influences could affect the relative risk, but they are unlikely to affect the detectability of any risk factor. Another possibility is simply that the relative effect of individual risk factors diminishes as the background risk increases. Indeed studies showing an approximately constant risk in BRCA1/2 carriers by age would seem to support this thesis, in that the “background” risk increases markedly with age.6 A final possibility is that the variant may be associated with case survival. If so, the odds ratio could be either attenuated or enhanced, depending on the direction of this association.
Many investigators studying genetic risk factors elect to “enrich” the case selection by restricting attention to cases with a family history of the disease. This approach is similarly designed to increase power by genetically enriching the case base. These studies are rarely population-based, and it is more difficult to make a precise estimate of the extent to which the power is likely to be enhanced. Interestingly, Antoniou and Easton28 have studied this issue using a polygenic model with a normally distributed polygenic component estimated from a large population-based study; they conclude that if one restricts case selection to breast cancer cases with an affected mother and sister, the study will deliver increased power of a similar order of magnitude to using cases with bilateral breast cancer. Of course, a consideration in the use of any enriched design (whether using cases with multiple primary or cases with a strong family history of cancer) is the added difficulty in identifying and recruiting these enriched cases. Our statistical power comparisons assume equivalent sample sizes but do not address the relative ease by which these can be obtained in practice.
The conventional case–control study has a long history in cancer epidemiology. The gold standard is the population-based design, whereby incident cases from a defined population are compared with controls randomly selected from the same population. However, this ideal is increasingly challenged by the difficulty in enrolling population controls with a high response rate.29 While it is suitable for relatively common risk factors, the design is problematic for important but rare risk factors (ie, those that confer a high relative risk). An enriched design based on second primaries offers an attractive alternative in that it provides substantial power advantages across a broad spectrum of risk-factor prevalences and relative risks. However, like the conventional approach, it requires population controls with the attendant difficulties. An enriched design may be especially attractive for genome-wide association studies of candidate SNPs due to its power advantages for a broad spectrum of SNP prevalences. The second-cancers design, by its case-only nature, promises higher participation rates, especially when biologic samples are required (as for genetic analysis). This design has more statistical power than the conventional design for rare risk factors, although its advantages can be muted if there is risk attenuation. Its power is competitive with that of the enriched design for very rare, strong risk factors. This is likely to be an increasingly important area as knowledge of strong genetic risk factors emerges. Major genes such as BRCA1 and BRCA2 possess hundreds of individual variants, many extremely rare, and consequently individual studies require a high level of power to distinguish harmful rare variants from the harmless ones.30
In summary, our study provides strong empirical evidence that studies of second primary cancers are capable of detecting cancer risk factors, in many circumstances with greatly improved power. This underused resource could be employed to facilitate the ongoing search for cancer risk factors.
We thank Peter Devilee and Petra Huijts (Leiden University Medical Center) for access to original data from their study.
1. Pharoah PD, Antoniou A, Bobrow M, Zimmern RL, Easton DF, Ponder BA. Polygenic susceptibility to breast cancer and implications for prevention. Nat Genet
2. Easton DF, Pooley KA, Dunning AM. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature
3. Hunter DJ, Kraft P, Jacobs KB. A genome-wide association study identifies alleles in FGFR2
associated with risk of sporadic postmenopausal breast cancer. Nat Genet
4. Begg CB, Berwick M. A note on the estimation of relative risks of rare genetic susceptibility markers. Cancer Epidemiol Biomarkers Prev
5. Imyanitov EN, Cornelisse CJ, Devilee P. Searching for susceptibility alleles: emphasis on bilateral breast cancer. Int J Cancer
6. Peto J, Mack TM. High constant incidence in twins and other relatives of women with breast cancer. Nat Genet
7. Kuligina ES, Togo AV, Suspitsin EN, et al. CYP17 polymorphism in the groups of distinct breast cancer susceptibility: comparison of patients with the bilateral disease vs. monolateral breast cancer patients vs. middle-aged female controls vs. elderly tumor-free women. Cancer Lett
8. Suspitsin EN, Buslov KG, Grigoriev MY, et al. Evidence against involvement of p53 polymorphism in breast cancer predisposition. Int J Cancer
9. Buslov KG, Iyevleva AG, Chekmariova EV, et al. NBS1
657del5 mutation may contribute only to a limited fraction of breast cancer cases in Russia. Int J Cancer.
114:585–589. PMID: 15578693.
10. Fletcher O, Johnson N, Dos Santos Silva I, et al. Family history, genetic testing, and clinical risk prediction: pooled analysis of CHEK2
1100delC in 1,828 bilateral breast cancers and 7,030 controls. Cancer Epidemiol Biomarkers Prev
11. CHEK2 Breast Cancer Case-Control Consortium. CHEK2
*1100delC and susceptibility to breast cancer: a collaborative analysis involving 10,860 breast cancer cases and 9,065 controls from 10 studies. Am J Hum Genet.
12. Antoniou AC, Pharoah PD, McMullan G, et al. A comprehensive model for familial breast cancer incorporating BRCA1
and other genes. Br J Cancer
13. Garcia-Closas M, Hall P, Nevanlinna H, et al. Heterogeneity of breast cancer associations with five susceptibility loci by clinical and pathological characteristics. PLoS Genet
14. Angèle S, Romestaing P, Moullan N, et al. ATM haplotypes and cellular response to DNA damage: association with breast cancer risk and clinical radiosensitivity. Cancer Res
15. Langholz B, Bernstein JL, Bernstein L, et al; The WECARE Study Collaborative Group, Concannon P. On the proposed association of the ATM variants 5557G>A and IVS38–8T>C and bilateral breast cancer. Int J Cancer
16. Tommiska J, Jansen L, Kilpivaara O, et al. ATM variants and cancer risk in breast cancer patients from Southern Finland. BMC Cancer
17. Freeman GH, Halton JH. Note on an exact treatment of contingency, goodness of fit and other problems of significance. Biometrika
18. Sutton AJ, Abrahms KR, Jones DR, Sheldon TA, Song F. Methods for Meta-Analysis in Medical Research.
New York: Wiley; 2000.
19. Dersimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials
20. Begg CB, Mazumdar M. Operating characteristics of a rank correlation test for publication bias. Biometrics
21. Begg CB, Hummer AJ, Mujumdar U, et al. A design for cancer case-control studies using only incident cases: experience with the GEM study of melanoma. Int J Epidemiol
22. Saad RS, Denning KL, Finkelstein SD, et al. Diagnostic prognostic utility of molecular markers in synchronous bilateral breast carcinoma. Mod Pathol
23. Tse GM, Kung FY, Chan AB, Law BK, Chang AR, Lo KW. Clonal analysis of bilateral mammary carcinomas by clinical evaluation and partial allelotyping. Am J Clin Pathol
24. Imyanitov EN, Suspitsin EN, Grigoriev MY, et al. Concordance of allelic imbalance profiles in synchronous and metachronous bilateral breast carcinomas. Int J Cancer
25. Orlow I, Tommasi DV, Bloom B, et al. Evaluation of the clonal origin of multiple primary melanomas using molecular profiling. J Invest Dermatol
26. Girard N, Ostrovnaya I, Lau C, et al. Genomic and mutational profiling to assess clonal relationships between multiple non-small cell lung cancers. Clin Cancer Res
27. Wang X, Wang M, MacLennan GT, et al. Evidence for common clonal origin of multifocal lung tumors. J Natl Cancer Inst
28. Antoniou AC, Easton DF. Polygenic inheritance of breast cancer: implications for design of association studies. Genet Epidemiol
29. Olson SH, Voigt LF, Begg CB, Weiss NS. Reporting participation in case-control studies. Epidemiology
30. Capanu M, Orlow I, Berwick M, Hummer AJ, Thomas DC, Begg CB. The use of hierarchical models for estimating relative risks of individual genetic variants: an application to a study of melanoma. Stat Med
The goal is to develop concise formulas to characterize the relative power (efficiency) of the 3 candidate designs, denoted conventional, second primaries, and enriched. We use the following notation.
- Conventional healthy controls (subscript c): prevalence of risk factor = p
- First primary cases (subscript f): prevalence of risk factor = q
- Second primary cases (subscript s): prevalence of risk factor = r
We assume that the “true” relative risk of the risk factor in the general population at risk is ψ, but that this could be different (eg, attenuated) in the population of cancer survivors. Setting this attenuated relative risk to be ϕ the detectable “signals” for the various designs are as follows:
- Conventional Design: comparison of f versus c: signal = log(ψ)
- Second Primaries Design: comparison of s versus f: signal = log(ϕ)
- Enriched Design: comparison of s versus c: signal = log(ψ) + log(ϕ)
Each design ultimately involves calculation of the odds ratio linking the risk factor with case–control status. The variances of these odds ratios are in effect the variances of the estimates of the signals above. These variances are functions of the prevalences of the risk factor in the relevant comparison groups, and also of the underlying relative risks, ψ and ϕ. Recognizing that q = pψ/(1 – p + pψ) and r = pψϕ/(1 – p + pψϕ), it follows that the 3 variances are as defined below:
where nfc is the number of subjects per group (cases, controls), assumed to be of equivalent size for convenience;
Second Primaries Design:
The power of the any of these designs is dependent on the ratio of the signal to the standard error of the estimate of the signal. This is because the power takes the form
where v is the relevant variance from above. Thus, to obtain the relative efficiency of any 2 designs we need to determine the relative sample sizes required to achieve equivalent
targets. These are obtained simply by equating the corresponding signal to noise ratios. Consequently the sample sizes required to achieve equivalent power are given by