Share this article on:

Self-reported Health Status Differs for Amazon’s Mechanical Turk Respondents Compared With Nationally Representative Surveys

Mortensen, Karoline, PhD*; Alcalá, Manuel, G.; French, Michael, T., PhD; Hu, Tianyan, PhD§

doi: 10.1097/MLR.0000000000000871
Brief Reports

Background: Amazon’s Mechanical Turk (MTurk) platform has become a data source for peer-reviewed academic research publications, with over 24,000 Google Scholar search results. Although well-developed and supportive in other disciplines, the literature in health and medicine comparing results from samples generated on MTurk to gold standard, nationally representative health and medical surveys is beginning to emerge.

Objective: To compare the demographic, socioeconomic, and self-reported health status variables in an MTurk sample to those from 2 prominent national probability surveys, including the Medical Expenditure Panel Survey (MEPS) and the Behavioral Risk Factor Surveillance System (BRFSS).

Research Design: We analyze weighted and unweighted tabulations of the MTurk, MEPS, and BRFSS. Wald tests identify statistical significance.

Measures: Demographic, socioeconomic, and health status variables in an adult MTurk sample collected in 2016 (n=1916), the 2015 MEPS household survey component (n=21,210), and the 2015 BRFSS (n=283,502).

Results: Our findings indicate statistically significant differences in the demographic, socioeconomic, and self-perceived health status tabulations in the MTurk sample relative to the unweighted and weighted MEPS and BRFSS. The MTurk sample is more likely to be female (65.8% in MTurk, 50.9% in MEPS, 50.2% in BRFSS), white (80.1% in MTurk, 76.9% in MEPS, and 73.9% in BRFSS), non-Hispanic (91.1%, 82.4%, and 81.4%, respectively), younger, and less likely to report excellent health status (6.8% in MTurk, 28.3% in MEPS, and 20.2% in BRFSS).

Conclusions: We find significant differences across variables that warrant hesitation in using MTurk data as a replacement for the gold standard datasets in health services research.

*Department of Health Sector Management and Policy, University of Miami School of Business Administration, Coral Gables, FL

Department of Economics, University of Pennsylvania, Philadelphia, PA

Department of Health Sector Management and Policy, and Sociology, University of Miami, Coral Gables

§Department of Health Policy and Management, Florida International University, Miami, FL

The authors declare no conflict of interest.

Reprints: Karoline Mortensen, PhD, Department of Health Sector Management and Policy, University of Miami School of Business Administration, 417L Jenkins Building, Coral Gables, FL 33146. E-mail: kmortensen@bus.miami.edu.

Crowdsourced data collection initiatives have increasingly been used in academic research, as studies consistently conclude they can improve the quality, efficiency, and timeliness of a research project at a significantly lower cost.1 Amazon’s Mechanical Turk (MTurk) is the most analyzed nonprobability crowdsourced platform used by researchers.2 Internet-based surveys are now the predominant platform for survey data collection.3 Researchers have few choices beyond online panels to address academic issues when large scale primary data or diverse samples are necessary, and as random digit dial techniques become “nonapplicable and no longer useful.”3 There is rapid growth in the literature across a variety of disciplines that compares the generalizability of MTurk survey responses to other data collection methods. Systematic reviews conducted in political science, psychology, economics, and other fields conclude that the data obtained via MTurk surveys are at least as reliable as those obtained through traditional methods.4,5

A growing body of literature supports and validates the use of online crowdsourced samples for health services research, and researchers have begun to enroll and analyze nonprobability samples recruited online.5,6 With the goal of creating a taxonomy to characterize past uses of crowdsourcing, a systematic review of primary data collection efforts in peer-reviewed health and medical journals found that 4 of the 21 papers used crowdsourcing to administer surveys, and all of these papers used the MTurk platform.1 The studies in the systematic review conclude that MTurk data closely mirror those gathered from other sources. However, a study comparing chronic conditions among MTurk and Behavioral Risk Factor Surveillance System (BRFSS) respondents found significant differences in the samples, and the authors concluded that researchers should approach crowdsourced data with caution.6

Despite its increasing popularity as a survey tool in the health services research literature, no study has established whether the self-perceived health characteristics of an MTurk sample are comparable to responses gathered through more conventional data collection methods. Moreover, few of the papers which use crowdsourced data report summary statistics on their demographic variables, so it is difficult to assess the quality of their data.1 Only one study has compared the responses to the demographic, socioeconomic variables, and a few health-related variables gathered from an MTurk survey to those of BRFSS.6 Ours is the first to compare 2 national surveys widely used in the health services research literature, with a focus on self-reported health status, current alcohol consumption, and current smoking status (these were not explored in the prior study). The present study seeks to contribute to the literature by examining responses gathered via a recent survey launched on MTurk to data obtained from the Medical Expenditure Panel Survey (MEPS) and the BRFSS, 2 of the widely used gold standard surveys in health services research.

We focus on the self-reported health status measure given its high predictive and concurrent validity, as has previously been measured by its association with subsequent mortality and a variety of measures of health services utilization, morbidity, and disability.7 It is pervasively used in the health services literature as a summary measure of overall health status and predictor of mortality.8 It captures health-related quality of life in a simple, single measure that takes little time to collect and can be measured routinely.9 Despite concerns that self-reported health may not be reliable,7 it is the health-related qualify of life surveillance measure with the highest retest reliability in the BRFSS.10

Back to Top | Article Outline

METHODS

Data

We compare summary statistics from a dataset that was collected via a survey instrument we designed and administered on MTurk in July 2016 (n=1916), adult respondents to the 2015 MEPS household survey component (n=21,210), and the 2015 BRFSS (n=283,502). For comparability across surveys, we limit the samples to adults in the United States aged 18–64 years. The design of each survey is different, which poses some challenges. MTurk is a convenience sample generated from the internet. MEPS is a nationally representative survey collected through an overlapping panel design using computer-assisted personal interviewing on a laptop computer.11 BRFSS is a cross-sectional telephone (landline and cellular) survey conducted monthly by state health departments.12 The distinct differences in sampling frames and mode of administration may influence the findings.

To partially address sampling design issues, we provide both weighted and unweighted estimates for the 2 national surveys. Survey weights are incorporated for the 2 national probability surveys to adjust for strata, primary sampling unit, and oversampling of certain populations. The unweighted results are provided for reference, as the nationally representative survey data are designed to be used with weights, and thus are not described in the results section. MTurk data are not weighted, as the purpose of the study is to compare the crowdsourced sample in a manner similar to how researchers would incorporate the data into their analyses.6 Similar to Yank and colleagues, we apply a weight equal to one for all respondents in the MTurk sample. All analyses are conducted in Stata 14.1. The University of St. Andrews Ethics Committee approved the survey instrument. The demographic, socioeconomic, and health variables are constructed similarly and follow a similar pattern to those in the MEPS and BRFSS, albeit in a much shorter survey. The MTurk survey instrument is available from the authors.

Analyzing these 3 datasets allows us to make a unique contribution to the literature by comparing common measures of demographics including sex, age, race, ethnicity, marital status, education, region of residence, as well as socioeconomics (eg, employment status and income). We also contribute to a gap in the existing literature by examining health-related variables including self-reported health status, current alcohol use, and current tobacco consumption in MTurk and 2 widely used surveys.

Back to Top | Article Outline

RESULTS

Demographic and Socioeconomic Characteristics

The MTurk sample differs from the weighted data in the MEPS and BRFSS along several demographic dimensions: the MTurk sample is more likely to be female, younger, white, and non-Hispanic than the other samples after survey weights have been included in the analysis (Table 1). Specifically, females make up 65.8% of respondents in the MTurk sample compared with 50.9% in the MEPS and 50.2% in the BRFSS (P<0.001). The age profile of the MTurk sample skews younger than the respondents to the other surveys (P<0.001). Almost 21.0% of the MTurk sample is age 18–24 years, compared with 14.9% of MEPS and 16.1% of BRFSS respondents. At the other end of the distribution, 13.0% of the MTurk sample is age 50–64 years, relative to 32.5% of the MEPS and 32.5% of the BRFSS respondents.

TABLE 1

TABLE 1

The racial composition of the MTurk sample roughly reflects that of the US population, as measured in the weighted MEPS and BRFSS, but the ethnic diversity does not. At 80.1% white, MTurk respondents are 3.2 and 6.2 percentage points (P<0.001) more likely to be white than those of MEPS (76.9%) and BRFSS (73.9%). African-Americans (9.3%) are underrepresented in MTurk by fewer than 5 percentage points compared with the MEPS (12.9%) and BRFSS (13.7%) (P<0.001). The proportion of Asians in the MTurk sample is within a 1-point difference from both national surveys, which signals appropriate representation, but is statistically significantly different (P=0.03). The MTurk sample is almost 10 percentage points less likely to be Hispanic (8.9%) than respondents in MEPS (17.6%) and BRFSS (18.6%) (P<0.001).

Given the younger age structure in the MTurk sample, it is not surprising that these respondents are less likely to be married and have lower incomes than other samples, yet they have higher levels of education. Our results show that the MTurk sample is 12 and 9 percentage points less likely to be married (40.6%) than MEPS (52.1%) and BRFSS (49.7%) respondents (P<0.001). The MTurk sample was more educated, with 55.8% having a college degree or more, while 30.0% in the MEPS and 26.9% in the BRFSS have a college degree or more (P<0.001). About 36% of MTurk respondents report having an annual income of <$10,000 compared with 6.5% in BRFSS. This disparity is less extreme when compared with the MEPS sample, with 28.9% of subjects earning <$10,000 a year. It is worth noting that even the nationally representative surveys have disparate statistics on annual income. Respondents in the MTurk sample are more likely than those in the other samples to be in the labor force, while no major differences are present in terms of geographic dispersion.

Back to Top | Article Outline

Self-perceived Health Status

The MTurk sample reports substantially lower self-perceived health status than the respondents in the other surveys (Fig. 1). While 28.3% and 20.2% of subjects in MEPS and BRFSS report having excellent health, only 6.8% of the MTurk sample rate their health status as excellent (P<0.001). The proportion of individuals who report having poor health was roughly similar across all 3 samples. Collapsing categorical self-reported health status into a binary variable lessens the disparity. Excellent/very good/good health status is perceived by 76.4% of the MTurk sample, still lower than the 88.4% in MEPS and 84.2% in BRFSS. These health status differences are somewhat surprising, given that MTurk subjects are younger and therefore expected to score higher on health indicators. As the MTurk sample has a greater proportion of females, and females report poorer health than men,13 we stratify by sex and reran the analysis (results available upon request). Differences by sex are similar to those for the full sample, likely due to the higher proportion of female respondents.

FIGURE 1

FIGURE 1

To investigate potential explanatory factors for the differences in self-reported health, we tabulated current alcohol consumption and smoking responses and found that MTurk participants have similar behaviors in these areas relative to nationally representative surveys (Fig. 2). Roughly similar percentages report being a current smoker in the MTurk (18.9%), MEPS (16.9%) (P<0.001), and BRFSS (18.8%). The MEPS does not ask about current alcohol consumption, but 55.0% of the BRFSS sample report being a current drinker compared with 62.9% in MTurk (P<0.001).

FIGURE 2

FIGURE 2

Back to Top | Article Outline

DISCUSSION

Our findings suggest that convenience samples collected via crowdsourcing platforms such as MTurk are not necessarily equivalent to nationally representative samples generated through more conventional approaches. Some meaningful differences in demographic, socioeconomic, and health-related variables in the MTurk sample relative to the nationally representative samples from MEPS and BRFSS. The MTurk sample is more likely to be female, younger, white, non-Hispanic, single, higher educated, lower income, and in the workforce than respondents in the MEPS and BRFSS. This finding is consistent with the demographic and socioeconomic differences documented between MTurk and BRFSS.6

Although the demographic and socioeconomic comparisons to 2 gold standard datasets build upon the literature, the primary goal of our investigation was to examine responses to self-perceived health status across MTurk, MEPS, and BRFSS surveys, as this question has not been addressed in the literature. We find that the MTurk sample is much less likely to report that they are in excellent health status, or even excellent/very good/good health status, relative to respondents in the MEPS and BRFSS. The results from our study are consistent with findings comparing morbidity in an MTurk sample to the BRFSS.6

Despite the numerous advantages MTurk offers to a wide range of disciplines (eg, low cost, ease of administration, large sample sizes), its validity, reliability, and generalizability as a source for high-quality data still requires scrutiny by health services researchers. This issue garners significant importance as the National Institutes of Health (NIH) Precision Medicine Initiative plans to use nonprobability based sampling to recruit and monitor a longitudinal patient cohort.14 Our findings reveal some meaningful differences in demographic, socioeconomic, and health status–related characteristics relative to national surveys, suggesting the need for additional research comparing other health-related variables such as chronic diseases, health insurance, exercise, and risky behaviors. Consequently, the surge in use of data obtained from MTurk and other crowdsourcing platforms warrants caution and close examination in the field of health services research. Our results suggest that crowdsourced surveys are not adequate to replace nationally representative surveys if researchers are seeking generalizable findings. The challenge going forward is to determine when crowdsourced data are appropriate, and when to improve upon the design and administration of the surveys so their features match the research value of the platform.

Back to Top | Article Outline

REFERENCES

1. Ranard BL, Ha YP, Meisel ZF, et al. Crowdsourcing—harnessing the masses to advance health and medicine, a systematic review. J Gen Intern Med. 2014;29:187–203.
2. Chandler J, Shapiro DN. Conducting clinical research using crowdsourced convenience samples. Annu Rev Clin Psychol. 2016;12:53–81.
3. Smith SM, Roster CA, Golden LL, et al. A multi-group analysis of online survey respondent data quality: comparing a regular USA consumer panel to MTurk samples. J Bus Res. 2016;69:3139–3148.
4. Berinsky AJ, Huber GA, Lenz GS. Evaluating online labor markets for experimental research: Amazon.com’s Mechanical Turk. Polit Anal. 2012;20:351–368.
5. Bardos J, Friedenthal J, Spiegelman J, et al. Cloud based surveys to assess patient perceptions of health care: 1000 respondents in 3 days for US $300. JMIR Res Protoc. 2016;5:e166.
6. Yank V, Agarwal S, Loftus P, et al. Crowdsourced health data: comparability to a US national survey, 2013-2015. Am J Public Health. 2017;107:1283–1289.
7. Zajacova A, Dowd JB. Reliability of self-rated health in US Adults. Am J Epidemiol. 2011;174:977–983.
8. Idler Ellen L, Benyamini Y. Self-rated health and mortality: a review of twenty-seven community studies. J Health Soc Behav. 1997;38:21–37.
9. Robert Wood Johnson Foundation. Poor or Fair Health County health rankings & roadmaps. 2017. Available at: www.countyhealthrankings.org/measure/poor-or-fair-health. Accessed November 10, 2017.
10. Andresen EM, Catlin TK, Wyrwich KW, et al. Retest reliability of surveillance questions on health related quality of life. J Epidemiol Community Health. 2003;57:339–343.
11. Agency for Healthcare Research and Quality. MEPS-HC panel design and collection process. Available at: https://meps.ahrq.gov/survey_comp/hc_data_collection.jsp. Accessed November 10, 2017.
12. Centers for Disease Control and Prevention. The BRFSS data user guide. 2013. Available at: www.cdc.gov/brfss/data_documentation/pdf/UserguideJune2013.pdf. Accessed November 10, 2017.
13. Caroli E, Weber-Baghdiguian L. Self-reported health and gender: the role of social norms. Soc Sci Med. 2016;153:220–229.
14. Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372:793–795.
Keywords:

self-perceived health status; gold standard dataset; nonprobability samples; probability surveys; MTurk; crowdsourcing

Copyright © 2018 Wolters Kluwer Health, Inc. All rights reserved.