Background: The National Center for Health Statistics conducts the National Health and Nutrition Examination Survey and other national surveys with probability-based complex sample designs. Goals of national surveys are to provide valid data for the population of the United States. Analyses of data from population surveys present unique challenges in the research process but are valuable avenues to study the health of the United States population.
Objective: The aim of this study was to demonstrate the importance of using complex data analysis techniques for data obtained with complex multistage sampling design and provide an example of analysis using the SPSS Complex Samples procedure.
Methods: Illustration of challenges and solutions specific to secondary data analysis of national databases are described using the National Health and Nutrition Examination Survey as the exemplar.
Results: Oversampling of small or sensitive groups provides necessary estimates of variability within small groups. Use of weights without complex samples accurately estimates population means and frequency from the sample after accounting for over- or undersampling of specific groups. Weighting alone leads to inappropriate population estimates of variability, because they are computed as if the measures were from the entire population rather than a sample in the data set. The SPSS Complex Samples procedure allows inclusion of all sampling design elements, stratification, clusters, and weights.
Discussion: Use of national data sets allows use of extensive, expensive, and well-documented survey data for exploratory questions but limits analysis to those variables included in the data set. The large sample permits examination of multiple predictors and interactive relationships. Merging data files, availability of data in several waves of surveys, and complex sampling are techniques used to provide a representative sample but present unique challenges. In sophisticated data analysis techniques, use of these data is optimized.
Jennifer Saylor, PhD, RN, ACNS-BC, is Assistant Professor, School of Nursing, University of Delaware, Newark.
Erika Friedmann, PhD, is Professor, School of Nursing, University of Maryland, Baltimore.
Hyeon Joo Lee, PhD, RN, CRNP, is Nurse Practitioner, Bay West Endocrinology Associates, Baltimore, Maryland.
Accepted for publication February 28, 2012.
No source of funding was provided for any authorsstated above with regard to this manuscript.
The authors have no conflicts of interest to disclose.
Corresponding author: Jennifer Saylor, PhD, RN, ACNS-BC, School of Nursing, University of Delaware, McDowell Hall, Newark, DE 19716 (e-mail: firstname.lastname@example.org).
The National Center for Health Statistics (NCHS) and other federal agencies conduct surveys to provide statistical information to guide actions and policies to improve the health of the American people. National health surveys provide data that can be publicly accessed online or available via limited data set access policies (Table 1). National surveys allow researchers to ask and answer questions on a population level from previously collected and well-documented data without requiring replication of effort or the prohibitive cost of obtaining primary data. However, using data from national health surveys presents challenges, including finding appropriate variables, missing data, and making generalizations to the population from surveys that used complex sampling techniques. Researchers must first locate a data set that includes the variables needed to address a research question. In different waves of data collection, variables are added or deleted. Even the same variable obtained in successive years may be assessed with a variety of instruments and in various ways. This limits the available variables to the researchers and complicates the data analysis. Prior to conducting the data analysis, researchers must merge multiple data files to create a working analytical file. Missing data must be addressed and managed before completing data analysis. Rue, Thompson, Rivara, Mackenzie, and Jurkovich (2008) and Feng, Cong, and Silverstein (2012) published extensive discussions of techniques for managing missing data.
A major analytical challenge occurs when a national health survey uses probability-based complex sample designs to represent the variability within a population. Complex sampling designs are more efficient than simple random samples both in the choice of participants and in the feasibility of collecting data. Complex sample cluster-based designs require enumeration of only clusters that are chosen for sampling and do not require complete enumeration of the population, which would be time consuming at best and impossible at worst (Lohr, 1999). Designs with cluster sampling also allow researchers to visit compact areas rather than very scattered individual residences to obtain in-person interview or laboratory data. In sparsely populated areas, the cost of travel to obtain in-person data from scattered individuals would be prohibitive. The stratified component of complex sampling designs allows researchers to oversample small or sensitive subgroups to represent the variability of these populations (Ciol et al., 2006; Cunningham & Huguet, 2012; Lohr, 1999). Simple random sampling requires considerably larger overall samples to obtain adequate representation of variability within smaller subgroups, for example, the elderly, or racial and ethnic subgroups (Henry, 1990; Lohr, 1999).
Secondary analysis of national survey data presents many challenges for a researcher. These challenges include finding appropriate surveys to measure the variables of interest, dealing with missing data, organizing data, and analysis of data (Huguet, Cunningham, & Newson, 2012). The analysis of data from complex national surveys is especially difficult because the data may be drawn from samples that include groups that are not representative of the population. Oversampling of certain groups or characteristics to insure inclusion within the sample may create this problem (Henry, 1990; Huguet et al., 2012). Oversampling may be necessary to capture information, but it can introduce bias or error if the data are not analyzed correctly.
Complex Sample Analysis of National Survey Data
The statistical analysis of national survey data must reflect the survey’s complex sample design. Complex sample data analysis programs such as SUDAAN, SPSS Complex Samples analysis, and the survey procedure in SAS are designed to address the sampling design elements such as weights, stratification, and clusters.
Data analysis using weights without complex sample data analysis adjusts for over- and undersampling of groups within the population (Ciol et al., 2006; Cunningham & Huguet, 2012; Lohr, 1999) and provides estimates for groups based on their proportion in the population rather than the proportion in the sample. However, with weighting alone, estimates are computed as if the measures were obtained from the number of cases in the entire population rather than the number of cases in the sample. This can lead to biased results because of low variance estimates and the effects of clusters within the sample (Cunningham & Huguet, 2012; Lohr, 1999).
Complex sample data analysis adjusts for weights, cluster, and stratification of the sampling design to produce unbiased national estimates of population means and frequencies from the sample after taking into account weights for over- or undersampling of specific groups. Complex sample analysis provides estimates of variability based on the number of cases in the sample rather than the number of cases in the population (Cunningham & Huguet, 2012; Lohr, 1999). These larger estimates of variability are appropriate for testing hypotheses from data obtained through complex sample designs and reduce bias in the results. In addition to addressing the problem of estimating variability from an inflated sample size, complex sample data analysis addresses issues related to independence of cases. Clusters include participants who are more similar to one another than to cases from another cluster (Lohr, 1999), so data from clusters of cases violate the assumption of independence common to most statistical techniques (Tabachnick & Fidell, 2001). Complex sample data analysis addresses issues related to independence of cases by accounting for the correlations within clusters.
The purpose of this presentation is to illustrate the problems associated with analyzing data from complex surveys using a secondary analysis of data obtained from a large national survey to assess risk factors for metabolic syndrome. The illustrative example of the metabolic syndrome study uses data from the National Health and Nutrition Examination Survey (NHANES) to demonstrate differences in results of analysis with normal statistical procedures, weighted data with normal statistical procedures, and complex sample data analysis.
Metabolic Syndrome Research
Metabolic syndrome is a cluster of medical disorders (obesity, hypertension, dyslipidemia, and insulin resistance or glucose intolerance) that characteristically occur together in individuals (Lorenzo, Williams, Hunt, & Haffner, 2007) and are risk factors for the development of cardiovascular disease. Individuals with metabolic syndrome are twice as likely to develop cardiovascular disease (Cannon, 2008; Church et al., 2009; Lorenzo et al., 2007; McNeill et al., 2004; Mottillo et al., 2010).
The metabolic syndrome study was designed to examine potential contributions of biopsychosocial factors to the presence of metabolic syndrome. This cross-sectional descriptive study of risk factors associated with metabolic syndrome used a secondary data analysis from NHANES. The NHANES data were collected between January 2007 and December 2008.
The NHANES is a combination of health and nutrition questionnaires and physical examination designed to assess the health and nutritional status of adults and children in the United States (NCHS, 2007). Research utilization of NHANES includes estimating the prevalence of major diseases and risk factors for these diseases in the U.S. population and selecting subgroups and analyzing risk factors for selected disease (NCHS, 2009b).
The NHANES data obtained includes demographic, socioeconomic, dietary, and health-related questions. The data are collected via interview using a compute-assisted personal interview methodology and physical examination. Most of the physical examinations are performed in mobile examination centers (MEC) conveniently located throughout the US. The MEC contain high-tech medical equipment for the collection of data that includes medical, dental, physiological measurements, and laboratory tests depending on the participants’ age and gender (NCHS, 2008). Data and documentation files from NHANES surveys are released on an ongoing basis and those conducted after 1971 are available online at http://www.cdc.gov/nchs/nhanes/nhanes_questionnaires.htm.
Probability-based complex sample designs are used in NHANES to represent the civilian, noninstitutionalized U.S. population. Individuals who are residing in nursing homes, members of the armed forces, those who are institutionalized, and U.S. nationals living outside the US are excluded (NCHS, 2009a; Figure 1). Similar to other national surveys, the NHANES data files contain the stratification and cluster variable information necessary for conducting complex sample data analysis.
The study design includes a representative sample of the civilian, noninstitutionalized U.S. population by age, gender, and income level. To increase reliability and precision of health status indicator estimates, persons who were over 60 years, African American, with low income, or Hispanic (NCHS, 2009a) were oversampled in the NHANES 2007–2008. Each participant represents approximately 50,000 U.S. residents.
Sample of the Metabolic Syndrome Study
The sample of this study consisted of participants in NHANES 2007–2008 who (a) completed the interview and medical examination and fasting laboratory test, (b) were older than 20 years, and (c) were not pregnant. In NHANES 2007–2008, among the participants who completed both the interview and examination (n = 9,762), 6,917 were over 20 years. Out of those 6,917 participants, 330 women were pregnant at the time of the interview and were excluded from the study. The participants who did not complete fasting laboratory tests were excluded, yielding a final sample size of 2,583 participants.
Data Analysis Preparation
An analytical file was created before beginning statistical analysis. The NHANES data are provided in a number of separate files; each contains information from one data form completed at one administration time. For example, the baseline demographic variables and weights are included in one data file. The laboratory data required for the metabolic syndrome study were included in three separate files. Each case is identified by a unique case sequence identification number that is used to match information for cases from different data files. Data are included as individual items in the data files and summary scores as necessary. In other instances, researchers must calculate summary scores from the raw data for a particular variable. For the metabolic syndrome study, an analytical data file was created by using variables from 11 individual data files and scoring questionnaires when required (Table 2). A detailed list of the data file name, variable name, and unit of measure is shown in Table 3.
The purpose of survey weights is to account for oversampling, survey nonresponse, and poststratification. Sample weights are assigned to each person based on the number of people they represent within the U.S. Census noninstitutionalized civilian population. Three different types of weights are provided in the NHANES, in 2- and 4-year increments (NCHS, 2004). The interview weights pertain to the participants who are interviewed, and the medical examination weights relate to participants who completed the interview and medical examination. Finally, fasting laboratory weights are applied to participants who completed the interview, medical examination, and fasting laboratory testing. The variables used in each research study determine which weight is appropriate for statistical analyses; the weight for the most restrictive subgroup is used.
In SPSS Complex Samples analysis, a complex sample plan file is created with NHANES 2007–2008 2-year fasting laboratory weight (WTSAF2YR) and design variables: strata (SDMVSTR) and cluster (SDMVPSU). In the metabolic syndrome study, the 2-year fasting laboratory weight (WTSAF2YR) was chosen over other available weights, interview weight (WTINT2YR) and physical examination weight (WTMEC2YR). The 2-year fasting laboratory weight (WTSAF2YR) is the most restrictive weight and included appropriate weighting for cases with interview, medical examination, and fasting laboratory data. In contrast, the examination weight includes cases that had physical examinations with or without fasting laboratory data.
Three sets of data analysis were performed to illustrate differences in results while making different assumptions about the data. Unweighted analyses assumed that the sample of cases was a representative sample of a population in proportion to how they were sampled. Weighted analyses assumed that the sample of cases was corrected so that the proportions of cases in the sample were representative of the proportions of cases in the population and the number of cases was the number of cases in the population. Complex sample analyses assumed that the sample of cases was corrected so that the proportion of cases in the sample were representative of the proportion of cases in the population and that the number of cases was the number of cases in the sample rather than the number of cases in the population. Descriptive and inferential analyses were performed with each sampling assumption to illustrate different statistical outcomes for various types of analyses. Descriptive analyses were provided for frequencies, proportions, and means, and inferential statistics were used to demonstrate logistic regression and linear regression results based on the three sampling assumptions.
The results of the metabolic syndrome study data analysis of descriptive and inferential statistics with three different statistical approaches demonstrate the differences between statistical analyses using unweighted data compared with weighted data and complex sample analysis. The frequencies and proportions of weighting and complex sample analysis yield the same results (Table 4). However, these frequencies and proportions differ considerably from those obtained with simple unweighted data analysis. Racial minorities account for 52% of the sample without weighting, because they were oversampled in the 2007–2008 survey. However, racial minorities only accounted for 30% of the sample using complex sample analysis, which is more a representative of the U.S. population. The NHANES 2007–2008 oversampled the low-income population, which is evident in the unweighted data; 55.3% of the sample has less than a high school education. When using complex sample analysis, results decreased by 10% (44.2%), which represents the U.S. population.
When comparing the results of continuous variables, the mean for each variable changed when the variable was weighted as compared with the unweighted data (Table 5). However, the mean remained the same in weighting and complex sample analysis because the proportion of cases with each value remains constant. The standard error of the mean is almost nonexistent when using weighting because variability is calculated with the sample size of the entire population. The unbiased mean and standard error are obtained from the complex sample data analysis because the mean is estimated for the entire population based on calculations of the number of cases in the sample. The mean age from the unweighted analysis was 51.15 years (SE = 0.348 year), and the complex sample data analysis mean was 46.91 years (SE = 0.595 year). The unweighted mean age was slightly higher because participants over 60 years were oversampled in the NHANES 2007–2008 data.
The bias that occurs in descriptive statistics is exacerbated in inferential statistical analyses. In the metabolic syndrome study, education and metabolic syndrome were dichotomous variables (Table 6). Logistic regression analysis was used to examine differences in odds of metabolic syndrome occurrence according to educational status. Because this analysis addresses frequencies that are not affected by dispersion, there are small differences in the results from the three analytical approaches (Table 6). In the unweighted analysis, those who have less than a high school education have 62% greater odds of having metabolic syndrome compared with those who have continued their education beyond high school. When using complex sample analysis, the odds increases to 74%. The odds ratio is the same for the weighted and complex sampling analysis, but the 95% confidence intervals are unrealistically narrow for the weighted analysis (1.733, 1.736) as compared with the complex sample analysis (1.264, 2.381). These narrow intervals would also result in deflation of significance levels, which will lead to results from inferential statistics indicating significant differences between groups when none exist.
Linear regression analysis of the metabolic syndrome study revealed bias results without the use of complex sample design. In this complex sample data analysis, depressive symptoms did not predict (p = .151) the number of calories consumed per day (diet). In contrast with unweighted and weighted data analysis, having depressive symptoms was a significant predictor of the number of calories consumed per day (p = .006 and p < .001, respectively). The unbiased and valid statistical analysis of the metabolic syndrome study is that depressive symptoms do not predict caloric intake (Table 7).
Use of national data sets allows use of extensive, expensive, well-documented survey data for exploratory questions but limits analysis to those variables included in the data set. Researchers conducting secondary data analysis must be aware of the methods used in the original study, and the methods must be appropriate for addressing the aims of the secondary analysis. Attention must be paid to the population, inclusion and exclusion criteria, tools, and sampling design (Ciol et al., 2006; Cunningham & Huguet, 2012; Henry, 1990; Huguet et al., 2012). A cross-sectional design limits the investigator’s ability to determine causality. Researchers are unable to control variable definition, measurement, data collection, and other crucial aspects of design (Aaronson & Kingry, 1988; Rue et al., 2008; Tabachnick & Fidell, 2001). For example, researchers are unable to exclude subjects with a history of psychosis because this is not collected in NHANES.
The large sample permits examination of multiple predictors and interactive relationships. Unique challenges of a study using national databases include merging data files, availability of data in several waves of surveys, and use of complex sampling design techniques to provide a representative sample. Oversampling of small or sensitive groups provides necessary estimates of variability within small groups. Without complex sampling analysis, relationships or problems of the oversampled group would appear stronger or more pervasive than they are at the population level (Cunningham & Huguet, 2012). Use of weights without complex samples accurately estimates population means and frequency from the sample after accounting for over- or undersampling of specific groups. Weighting alone leads to inappropriate population estimates of variability, because weights are computed as if the measures were from the entire population rather than the sample in the data set (Cunningham & Huguet, 2012). The SPSS Complex Samples procedure allows inclusion of all sampling design elements, stratification, clusters, and weights. Sophisticated complex sample data analysis methodology optimizes use of these data to produce unbiased population estimates and inference.
Aaronson L. S., Kingry M. J. (1988). A mixed-method approach for using cross-sectional data for longitudinal inferences. Nursing Research, 37, 187–189.
Cannon C. P. (2008). Mixed dyslipidemia, metabolic syndrome, diabetes mellitus, and cardiovascular disease: Clinical implications. The American Journal of Cardiology, 102, 5L–9L.
Church T. S., Thompson A. M., Katzmarzyk P. T., Sui X., Johannsen N., Earnest C. P., Blair S. N. (2009). Metabolic syndrome and diabetes, alone and in combination, as predictors of cardiovascular disease mortality among men. Diabetes Care, 32, 1289–1294.
Ciol M. A., Hoffman J. M., Dudgeon B. J., Shumway-Cook A., Yorkston K. M., Chan L. (2006). Understanding the use of weights in the analysis of data multistage surveys. Archives of Physical Medicine and Rehabilitation, 87, 299–303.
Cunningham S. D., Huguet N. (2012). Weighting and complex samples design adjustments in longitudinal studies. In Newson J. T., Jones R. N., Hofer S. M. (Eds.), Longitudinal data analysis: A practical guide for researchers in aging, health, and social sciences (pp. 43–69). New York, NY: Routledge Taylor & Francis Group.
Feng D., Cong Z., Silverstein M. (2012). Missing data and attrition. In Newson J. T., Jones R. N., Hofer S. M. (Eds.), Logitudinal data analysis: A practical guide for researchers in aging, health, and social sciences (pp. 71–96). New York, NY: Taylor & Francis Group.
Henry G. T. (1990). Practical sampling. Newbury Park, CA: Sage.
Huguet N., Cunningham S. D., Newson J. T. (2012). Existing longitudinal data sets for the study of health and social aspects of aging. In Newson J. T., Jones R. N., Hofer S. M. (Eds.), Logitudinal data analysis: A practical guide for researchers in aging, health, and social sciences (pp. 1–42). New York, NY: Routledge Taylor & Francis Group.
Lohr S. L. (1999). Sampling: Design and analysis. Pacific Grove, CA: Brooks/Cole.
Lorenzo C., Williams K., Hunt K. J., Haffner S. M. (2007). The National Cholesterol Education Program—Adult Treatment Panel III, International Diabetes Federation, and World Health Organization definitions of the metabolic syndrome as predictors of incident cardiovascular disease and diabetes. Diabetes Care, 30, 8–13.
McNeill A. M., Rosamond W. D., Girman C. J., Heiss G., Golden S. H., Duncan B. B., Ballantyne C. (2004). Prevalence of coronary heart disease and carotid arterial thickening in patients with the metabolic syndrome (The ARIC study). The American Journal of Cardiology, 94, 1249–1254.
Mottillo S., Filion K. B., Genest J., Joseph L., Pilote L., Poirier P., Eisenberg M. J. (2010). The metabolic syndrome and cardiovascular risk a systematic review and meta-analysis. Journal of the American College of Cardiology, 56, 1113–1132.
Rue T., Thompson H. J., Rivara F. P., Mackenzie E. J., Jurkovich G. J. (2008). Managing the common problem of missing data in trauma studies. Journal of Nursing Scholarship, 40, 373–378.
Tabachnick B. G., Fidell L. S. (Eds.), (2001). Using multivariate statistics (4th ed.). Needham Heights, MA: A Pearson Education Company.
Keywords:© 2012 Lippincott Williams & Wilkins, Inc.
data analysis; healthcare surveys; methods