Registers of cancer and other diseases are widely recognized for their contributions to epidemiologic research when linked to other population registers (such as birth, death, or census information) to create study cohorts. Additional information on family relationships provides a valuable resource for genetic epidemiology, and the few countries with family registers have made major contributions to this area of research.1–5 Despite the abundance of familial risk estimates from register data, little work has been done to evaluate how these estimates might be affected by various familial disease models, trends in incidence and demographic changes. Even when these registers cover nation-wide populations over a long period, inferences based on these data may be limited by truncation due to start-up date and censoring dates,6 ascertainment bias from inclusion and exclusion criteria,7–9 length-time bias,10,11 and broken family links due to unknown parents.12 Various statistical methods for modeling such data to estimate familial disease risk9,13–15 do not provide easy solutions.
By generating an “ideal” family register where all relationships are known, simulation tools can explore how standard estimates of disease aggregation in families are affected by the nature of available data and the analytic approach. By modeling the observed patterns of disease risk in relatives, such tools can also provide important clues to disease etiology.
Since the 1980s, demographers have created electronic populations to study factors affecting household formation,16 forecast kin counts,17 and predict kinship networks.18 Numerous computer programs have been developed,19–23 but these do not fit the kind of data of interest in the study of familial aggregation of disease.
In this paper, we present a simulation method for creating a virtual population of complete families. The method allows this simulated population to evolve dynamically over time, not only providing a correct representation of the present but approximating the population in all its intermediary states. We illustrate the utility of this environment for epidemiologic investigations by simulating several familial models of female breast cancer. The input demographic and incidence data and the disease model of the desired population are controlled by the user. The software package, Population Lab, that emerged from this work is written for personal computers, and is available for free download from the authors' website24 or on request.
METHODS
Data Sources
Three readily available data resources for Sweden were used to simulate our population: age-specific population counts, fertility rates, and mortality rates. The earliest population data were those of 1955, available online from Statistics Sweden25 as 1-year age specific counts. Thus we chose 1955 as the base year for our simulations. The fertility and mortality rates were prepared from information from the same web site, augmented by data from older publications.26
We add to this simulation incident female breast cancer that aggregates in families. We do this using several models described in detail below. To simplify our analysis of familial aggregation, we use the Swedish 1980 age-specific breast cancer incidence rates27 for the entire calendar period simulated. We then compare the simulated population with the real Swedish population using the Swedish MultiGeneration Register,28 which identifies the parents of index persons born in Sweden since 1932 together with the sex, year of birth, and year of death of these individuals.
Software
We chose R29 (version 2.0.1 for Windows) as the programming environment due to its flexibility in data handling and storage, especially the ease of using matrix structures. R is freely available and comes with a large collection of statistical techniques and graphical facilities for data analysis, and benefits from regular add-ons from a wide community of users.
OVERVIEW OF SIMULATION
The simulated population is stored in a pedigree structure ie, a matrix where each row represents an individual with his or her vital information: ID number (ID), year of birth (YOB), sex, mother's ID number (ID.M), father's ID number (ID.F), and year of death (YOD). These individuals are called index persons. If death has not yet occurred, YOD is set to missing (coded as 0 in our data). With every simulated calendar year, the data matrix increases, as it contains all individuals who ever belonged to the population at any time from the base year to the latest simulated year.
For each new birth, the baby is added to the pedigree file with the parents correctly identified. When a person “dies”, they are not removed from the pedigree; the year of death is simply recorded. Figure 1 highlights a baby girl (ID = 99117) “born” in 2002. Note also the death event (ie, year 2002 is recorded as YOD) for individual number 50001. The pedigree structure enables dynamic storage of all individuals and straightforward identification of kinship at any time point.
The simulation starts with the creation of the baseline population of related individuals for the base year (1955). The birth and death processes are applied each year and the population augmented and updated accordingly.
Creation of the Baseline Population
The baseline population is a pedigree structure containing a scaled version of the real population for the base year; it constitutes the input population for the calendar period to be simulated. To construct a correct representation of the base year we need to generate a population of related individuals of a realistic age profile.
Starting with 50,000 females and 50,000 males with the age distributions of Swedish females and males in 1955, we assigned unique ID numbers and an indicator for sex. We “set back the clock” 100 years, assuming these individuals were alive in 1855, and set their year of birth to the difference between 1855 and their age, and their year of death to missing. Their parents' ID numbers were set to missing: in genetic epidemiology, such individuals would be referred to as “founders.”
We programmed a “run-in” simulation of 100 years, where these founders and their descendants give birth with the 1955 fertility rates, and die with the 1955 mortality rates. The details of the fertility and mortality processes are presented in Appendix 1 (procedure GiveBirth() and AssignDeath(), respectively). This run-in produces a population in which no one is left unrelated; even the oldest individuals (100-years-old) in the final population were born during the run-in simulation, thus linking them to their parents. Finally, we trim the created population to yield the real age profile for the base year, using the algorithm presented in Appendix 2. Because our aim was to create a baseline population of complete families, we complete the trimmed population with any parents that were removed by the trimming.
Simulation of the Evolving Population
The evolution of the baseline population over time is achieved by 2 simulated events, birth and death, executed by the procedures GiveBirth() and AssignDeath(), which take as input parameters the current simulated calendar year and the current living population from the pedigree file, and update the pedigree accordingly. All simulated events are the results of Bernoulli experiments with probabilities determined by the fertility or death rates for that calendar year.
Simulation of Evolving Populations With a Disease That Aggregates in Families
To demonstrate our simulation tool, we include female breast cancer incidence during the simulation, adding a year-of-incidence (YOI) column to the pedigree file. Incident cancers are assigned each year during run-in and evolution of the population, using a Bernoulli process that operates on individuals who are alive and cancer free (see procedure AssignCancer() in Appendix 1). By including disease incidence in the run-in period, the baseline population has prevalent cancer cases with the year of incidence recorded. As explained above, we used constant incidence rates over calendar time throughout the simulated calendar period. In simulating death in this population, the age-specific mortality applied by procedure AssignDeath() is increased by a constant factor (2) for diseased individuals.
We simulate several models of familial association: (i) the parental relative-risk model of disease aggregation, where a woman's age-specific risk of disease incidence is increased by a constant factor if her mother is a case, (ii) the parental odds-ratio model, where the odds of disease is increased by a constant factor in daughters of cases, and (iii) a model where the relative risk is modified by maternal age at incidence. For each of the first 2 models, we simulate separately a “null hypothesis” population of no familial aggregation of disease, and an “alternative hypothesis” population where the risk and the odds, respectively, are doubled in daughters of affected mothers. In our third model, we increase the risk of disease by a factor of 4 for women whose mothers where younger than 50 years of age at diagnosis, compared with daughters of unaffected mothers, and by a factor of 2 for daughters of women diagnosed after the age of 50. In addition, we considered the sibling relative-risk model where a woman's age-specific disease risk is doubled after a diagnosis in any of her sisters.
Statistical Analysis for Estimating the Familial Risk of Disease
We analyze incident breast cancer using appropriate statistical models to verify that the familial association parameter employed in the simulation can be “extracted” accurately from the pedigrees. Depending on the analyzed population, familial exposure enters analyses either as a binary variable (parental- and sibling-risk and parental-odds models) or a categorical variable with the reference group consisting of unexposed daughters (the model where familial risk changes with maternal age at incidence).
Poisson Regression
Data are first summarized for each calendar year to yield the total number of persons at risk and the total number of cases in each stratum defined by the familial exposure and age group. If age-specific incidence rates are constant over the entire simulated time interval (as in our illustration), the data can be further collapsed to obtain the total number of cases and total number of individuals at risk in each age stratum over the entire period.
Nested Case–Control Analysis
For each cancer case incident in 1955–2002, we randomly select 3 controls born in the same year who are alive and cancer-free in the year of incidence of the case. The data are analyzed by means of conditional logistic regression.
Cox Regression Analysis
The study population consists of all individuals who were alive and cancer-free at the beginning of follow-up (1955). The entry time is age in 1955 or zero for those born later. The exit time is age at incidence, death, or end of follow-up (2002), whichever is smallest. If, during the follow-up of any individual, their relative becomes an incident case, family history enters the analysis as a time-varying covariate.
RESULTS
The Baseline Population of 1955
Figure 2A illustrates the age profile of the untrimmed simulated female population and the real 1955 Swedish female population. Figure 2B compares the age profile of the female baseline (trimmed and completed) population with the real 1955 population. The male baseline population exhibits the same good agreement (data not shown).
Table 1 compares the sibship size distribution for individuals younger than 15 years of age in the 1955 baseline population with the earliest year available (1960) from the Swedish MultiGeneration Register. When a parent from the real population had more than one partner we considered only those children born to this parent and their first partner. It has been documented elsewhere28 that parents' IDs are more often missing in the early years of the MultiGeneration Register, so our comparison can be only approximate. Small empirical standard errors from repeated creations of the baseline population (100 times) indicate stability of the sibship-size distribution across simulations (data not shown).
The Population Evolving Over the Calendar Period 1955–2002
Figure 3 shows the simulated population alive in 2002 versus the corresponding real Swedish population. The fertility and mortality processes applied during the simulation resulted in a virtual population with a very realistic age profile, despite our simplistic assumption of a closed population (ie, no migration was considered).
The comparison of sibship size in the real and simulated populations of 2002 (Table 2) indicates reasonable agreement; this was stable across repeated simulations.
Figure 4 plots the average year of first birth recorded in the simulated population and in the MultiGeneration Register, by mother's birth cohort. The 2 sources show excellent agreement for births since 1932 (the inclusion criterion for the MultiGeneration Register).
The Population Evolving With a Disease That Aggregates in Families
We performed standard epidemiologic analyses to validate that the parameters of familial aggregation used in the simulations are faithfully estimated. Table 3 presents the estimated parameters of familial association from single realizations of each of our populations. When analyzing the alternative-hypothesis populations, with a 2-fold increase in risk and odds, respectively, we obtained estimates very close to the true value. For the population simulated under the model where familial risk depends upon maternal age at incidence, we show the results of the analyses that correctly model the level of exposure, and also the results of “naive” analyses where exposure is treated as an indicator variable. The former analyses give confidence intervals that include the true values. We obtained similar results for the sibling relative risk model; for example, the estimate of IRR from a Poisson analysis of the alternative hypothesis population was 1.89, with the 95% CI (1.64–2.18) including the true value of 2.0.
We used the population simulated under the parental relative-risk model to illustrate the bias in familial risk estimates due to truncation. We mimicked the start-up effect of cancer registration by truncating maternal cancers if the mother was an incident case before 1955. Figure 5 plots 10-year period-specific IRRs from Poisson analyses, based on the population where exposure is complete and the truncated population. The estimates of familial risk from the complete population are close to the true value (ie, 2.0). In the truncated population, the bias decreases with time after registry initiation and has disappeared after 20 years of follow-up.
DISCUSSION
Our objective was to create an environment that allows an investigator to assess the performance of population-based studies of familial aggregation of diseases. We have demonstrated that it is possible, by using simple vital statistics, to simulate a realistic virtual population of related individuals. Our population agreed well with the real population on important features such as age profile, sibship size distribution, and average age-gap between mother and first born (ie, age at first birth). These features exhibited very good reproducibility in repeated simulations. For illustration we used breast cancer as a general example of a disease that aggregates in families, with its approximate 2-fold increase in risk for first degree relatives.30,31 We demonstrated how our simulated population can be used to explore various familial disease models and to provide insights into how incomplete family history in register data may affect estimates of familial aggregation.
We began with the creation of a baseline population with desired age profile for Sweden in 1955, using as input the simple population counts and vital rates available in many countries. This stand-alone tool generates the web of family relations at a given point in time, which by itself may be of interest and is the first step in starting our virtual population register. On simulating the evolution of this population over time, an ideal population was mimicked, where disease models and all relationships are fully known, and demographic and disease characteristics can be easily extracted for any given year. The key to the ease of handling the data is the flexible “pedigree file” format in which they are stored. The final simulated population follows closely the age profile of the real population.28 The baseline and evolved populations both exhibit a reasonable sibship-size distribution, but they differ somewhat from the real population. Some of this discrepancy may be due to the incomplete family links in the MultiGeneration Register, especially for earlier birth cohorts. However, the surplus of families of size one and the deficit of families of size 2 are to be expected from simulating each birth as an independent event. In reality, a woman's childbearing is influenced by many factors, including desired family size26 and societal norms. It is noteworthy that the average age at first birth showed excellent agreement between the real and simulated populations, as expected when using the correct age-specific fertility rates for each year.
To create a framework suitable for a wide range of applications, we made a number of simplifying assumptions: (i) For each mother, we chose a spouse close in age (from 1 year younger to 4 years older), which is realistic in our Swedish data. Although this could be extended to a stochastic model, it is not necessary here where spouses' age is not the primary research interest. (ii) In applying fertility rates we assigned each new birth as an independent event. Future extensions could accommodate more realistic fertility patterns (for example, influenced by parity and gap between offspring) and other family structures such as half-siblings and adoptions, though this would require the availability of appropriate population data and additional programming. (iii) In our simulation, diseased women cease reproduction. This assumption would be reasonable for almost all adult cancers since age at onset is after reproductive years for the great majority of cases. (iv) We modeled age-specific disease incidence as constant over time to simplify our illustrative analyses, but the software can accept year-specific incidence rates. (v) We simulated a closed population without immigration or emigration. Thus our method is suitable for studies of homogeneous populations; it could be extended to include immigration/emigration, provided the data are available.
There are several immediate applications of our simulation tool. For example, researchers can specify their own input data to create realistic settings for exploration of epidemiologic hypotheses. Various diseases can be modeled by specifying the disease parameters and choosing from several familial models. The simulation can also be run to a specified time point in the future (using projected vital rates or the latest rates available) to extrapolate the estimates of population disease burden or other features of interest. The influence of family structure on definitions of exposure and estimates of familial risk can also be explored. Studies of disease aggregation in families are complicated by the nature of the “exposure” of interest (ie, affected family members, such as parents or siblings), whose definition is affected by family size and age-gap between relatives.32 The power of a research study under various models of familial aggregation can also be calculated. Such investigations are especially relevant when needing to incorporate expensive genetic data, such as biomarker or molecular information. Another useful application is in assessing the magnitude of biases of familial aggregation estimates due to truncation of disease events or family history in the context of population registers, and the performance of methods proposed to deal with them. For example, Figure 5 illustrates the bias in the estimates of the familial aggregation at the beginning of registration due to the loss of family history information.
Our main disease model assigns a relative risk of incident cancer to individuals with a family history, in keeping with the usual practice of analyzing cohort data using Poisson or Cox regression to estimate IRRs and HRs. However, since it is common practice to extract nested case–control studies from population cohorts, we also investigated an odds ratio model. In addition, we explored models where family risk varied with age at onset in the affected mother (a known feature of many heritable diseases), and a familial risk posed by an affected sibling. A simulation tool equipped with several disease models enables the investigation of the usual analytical strategies for their robustness to the assumed underlying disease model. One can also investigate the performance of new methods proposed for handling special data structures: for example, the analysis of independent family clusters9 for correction of the standard errors of familial risk estimates due to the correlation between related individuals, or the case-cohort families approach recently proposed for applications where analysis of all affected families in the population is computationally intensive.33,34
This simulation tool can be extended in various ways to deepen our understanding of familial disease risk. For example, the disease model can specify an increased risk for individuals having an affected parent from the time they reach that parent's age at incidence, a genetic “doom” (from birth) for the descendants of a subpopulation of cases, or offspring risk modified by sex of parent (ie, affected mother versus affected father). In addition, differential mortality for familial versus nonfamilial cases could be easily accommodated. Features of reproductive history known to be associated with disease, such as delayed childbearing, age at first birth and parity, can be included in future risk models to study their role in familial aggregation. For example, in breast cancer studies one could model the input risk parameter as a function of age at first birth, parity and age at menopause.35
The practicality of applying our methodology is enhanced by both the simple input data (ie, vital statistics that are easily available for many countries) and the provision of our software for free download. With a running version of R29 and minimal programming effort, the user can investigate the performance of various analytic approaches to family data. Extensions of the package to incorporate additional features specific to various research questions can create valuable tools for experimentation and investigation.
ACKNOWLEDGMENTS
We gratefully acknowledge the crucial advice of Stefano Calza in making Population Lab a fast and feasible software for personal computers. Warm thanks to Margareta Larsson for providing us with copies of mortality statistics in old publications from Statistics Sweden.
REFERENCES
1. Hemminki K, Li X, Plna K, et al. The nation-wide Swedish family-cancer database–updated structure and familial rates.
Acta Oncol. 2001;40:772–777.
2. Daugherty SE, Pfeiffer RM, Mellemkjaer L, et al. No evidence for anticipation in lymphoproliferative tumors in population-based samples.
Cancer Epidemiol Biomarkers Prev. 2005;14:1245–1250.
3. Nielsen NM, Westergaard T, Rostgaard K, et al. Familial risk of multiple sclerosis: a nationwide cohort study.
Am J Epidemiol. 2005;162:774–778.
4. Esplin MS, Fausett MB, Fraser A, et al. Paternal and maternal components of the predisposition to preeclampsia.
N Engl J Med. 2001;344:867–872.
5. Lie RT, Rasmussen S, Brunborg H, et al. Fetal and maternal contributions to risk of preeclampsia: population based study.
Br Med J. 1998;316:1343–1347.
6. Cupples LA, Risch N, Farrer LA, et al. Estimation of morbid risk and age at onset with missing information.
Am J Hum Genet. 1991;49:76–87.
7. Smith DG, Sing CF. Sampling biases in longitudinal genetic-epidemiologic surveys.
Hum Biol. 1976;48:529–539.
8. Burton PR, Palmer LJ, Jacobs K, et al. Ascertainment adjustment: where does it take us?
Am J Hum Genet. 2000;67:1505–1514.
9. Pfeiffer RM, Gail MH, Pee D. Inference for covariates that accounts for ascertainment and random genetic effects in family studies.
Biometrika. 2001;88:933–948.
10. Davidov O, Zelen M. Referent sampling, family history and relative risk: the role of length-biased sampling.
Biostatistics. 2001;2:173–181.
11. Lindström L, Pawitan P, Reilly M, et al. Estimation of genetic and environmental factors for age-of-onset of disease from population-based family data.
Stat Med. 2006;25:3110–3123.
12. Paltiel O, Schmit T, Adler B, et al. The incidence of lymphoma in first-degree relatives of patients with Hodgkin disease and non-Hodgkin lymphoma: results and limitations of a registry-linked study.
Cancer. 2000;88:2357–2366.
13. Rao DC, Wette R. Nonrandom sampling in genetic epidemiology: maximum likelihood methods for multifactorial analysis of quantitative data ascertained through truncation.
Genet Epidemiol. 1987;4:357–376.
14. Andersen EW, Andersen PK. Adjustment for misclassification in studies of familial aggregation of disease using routine register data.
Stat Med. 2002;21:3595–3607.
15. Pfeiffer RM, Goldin LR, Chatterjee N, et al. Methods for testing familial aggregation of diseases in population-based samples: application to Hodgkin lymphoma in Swedish registry data.
Ann Hum Genet. 2004;68:498–508.
16. Wachter KW, Hammel EA, Laslett P.
Statistical Studies of Historical Social Structure. New York: Academic Press; 1978.
17. Hammel EA, Wachter KW, McDaniel CK. The Kin of the Aged in A.D. 2000. In: Kiesler S, Morgan J, Oppenheimer V, eds.
Aging. New York: Academic Press;1981:11–40.
18. Hammel EA, Mason C, Wachter KW, et al. Rapid population change and kinship: the effects of unstable demographic changes on Chinese kinship networks, 1750–2250. In: Tapinos G, Blanchet D, Horlacher D, eds.
Consequences of Rapid Population Growth in Developing Countries. New York: Taylor and Francis;1991:243–271.
19. Smith JE. The computer simulation of kin sets and kin counts. In: Bongaarts J, Burch T, Wachter K, eds.
Family Demography: Methods and Their Applications. Oxford: The Clarendon Press;1987:249–266.
20. Hampe J, Wienker T, Schreiber S, Nurnberg P. POPSIM: a general population simulation program.
Bioinformatics. 1998;14:458–464.
21. Hammel EA, Hutchinson D, Wachter KW, et al.
The SOCSIM Demographic-Sociological Microsimulation Program Operating Manual. Institute of International Studies Monograph No. 27. California: University of California, Berkeley; 1976.
22. Hammel EA, Mason C, Wachter KW.
SOCSIM II, a Sociodemographic Microsimulation Program, Rev. 1.0, Operating Manual. Program in Population Research Working Paper No. 29. California: Institute of International Studies, University of California, Berkeley; 1990.
26. Johansson L, Finnas F.
Fertility of Swedish Women Born 1927–1960. Orebro: Statistics Sweden; 1983.
27. CANCERMondial Statistical Information System [database online]: International Agency for Research on Cancer. Available at:
http://www-dep.iarc.fr/ Updated March 23, 2006. Accessed June 30, 2006.
30. Collaborative Group on Hormonal Factors in Breast Cancer.Familial breast cancer: collaborative reanalysis of individual data from 52 epidemiological studies including 58,209 women with breast cancer and 101,986 women without disease.
Lancet. 2001;358:1389–1399.
31. Hemminki K, Granstrom C, Czene K. Attributable risks for familial breast cancer by proband status and morphology: a nationwide epidemiologic study from Sweden.
Int J Cancer. 2002;100:214–219.
32. Zelen M. Risks of cancer and families (editorial).
J Natl Cancer Inst. 2005;97:1556–1557.
33. Moger TA, Pawitan Y, Borgan O.
Case-Cohort Methods for Survival Data on Families From Routine Registers. University of Washington Biostatistics Working Paper Series, Paper 227. 2006. Available at
http://www.bepress.com/uwbiostat/paper277 Accessed March 14, 2007.
34. Li H, Yang P, Schwartz AG. Analysis of age at onset data from case-control family studies.
Biometrics. 1998;54:1030–1039.
35. Lambe M, Hsieh CC, Tsaih SW, et al. Parity, age at first birth and the risk of carcinoma in situ of the breast.
Int J Cancer. 1998;77:330–332.
APPENDIX 1: SIMULATION STEPS
For every year to be simulated, procedures Give Birth(), Assign Death() and Assign Cancer() are applied to the index individuals in the pedigree. The main features of these procedures are as follows:
Procedure Give Birth()
1. Select all women alive.
2. From these women, select only those who are in the childbearing age interval (assumed to be between 15 and 46 years).
3. From this selection, 2 sets are created: those who are already mothers and those who do not yet have children.
4. For women who are already mothers, identify their partners (ie, the fathers of their children). If this father has died, the woman is no longer considered a potential mother. For each of the remaining couples, generate a Bernoulli event based on the age-specific probability of a woman giving birth to a child in that calendar year.
5. For the women who do not already have children, generate a Bernoulli event based on the age-specific probability of giving birth to a child in that calendar year. For each woman chosen to become a mother (ie, the Bernoulli outcome was a “success”) assign a partner from the living males who have not been fathers before, and who are at most 1 year younger or 4 years older than the mother.
6. For each couple chosen as parents, add a new child to the pedigree, recording the newly generated ID, year of birth (ie, current simulated calendar year), sex (using a Bernoulli experiment with probability 0.5), and the ID of the mother and the father. The year of death is set to missing.
Note 1: In simulations involving cancer incidence, Step 1 of the Give Birth procedure selects women who are alive and cancer free, ie, it is assumed that women will not bear children after a cancer diagnosis.
Note 2: The age-gap between spouses (from −1 to 4 years) was chosen after inspecting the real Swedish population.
Procedure Assign Death()
Procedure Assign Death() does not erase the individuals from the population when they “die”, but rather assigns the current (simulated) calendar year as their year of death.
1. Select all individuals alive in the pedigree during the the current year.
2. For every individual in this selection, assign a period-, age-, and sex-specific probability of dying.
3. Generate a Bernoulli process based on this probability for each living person and assign the current calendar year as the year of death where this was a success. Individuals who reach 100 years of age are automatically assigned the current simulated year as their year of death.
Note: In our illustration, the age-specific mortality rates of diseased individuals are increased by a constant factor of 2.
Procedure Assign Cancer()
1. Identify all women alive in the pedigree in the current year and assign to each of them an age-specific probability of an incident cancer. Generate a Bernoulli outcome based on this probability and if this is a success, assign the current calendar year as this woman's year of incidence.
2. If we wish to simulated a pedigree where a positive family history (affected mother) changes risk, we apply the additional step of using different incident rates for individuals with an affected mother. For example, if a woman's mother is known to be a cancer case, we double the woman's risk of cancer compared to a woman the same age whose mother is cancer free.
Cited Here...
APPENDIX 2: IMPOSING THE REAL AGE DISTRIBUTION ON THE SIMULATED POPULATION
Choosing the modal age in the simulated population as reference, we denote the relative numbers of individuals in any other age group as α (for the real population) and β (for the simulated population). In order to have sufficient individuals in all age groups in the simulated population to impose the real age distribution, we may need to first adjust (ie, reduce) the reference age count. We identify the age a for which the ratio β/α is a minimum and denote this ratio as βa/αa. Let sa denote the simulated count for age a and sr the simulated count for the reference age. Denoting the relative numbers of individuals for any other (nonreference) age as α* and β*, then clearly βa/αa < β*/α*.
We adjust the reference count sr by a factor f so that αafsr = sa. Thus
and the number of individuals required in any other group will be
The adjusted reference count ensures that there are enough individuals in all age categories to impose the real age distribution. Cited Here...