Share this article on:

Big Data and Population Health: Focusing on the Health Impacts of the Social, Physical, and Economic Environment

Hu, Howarda; Galea, Sandrob; Rosella, Laurac,d; Henry, Davidc,d,e

doi: 10.1097/EDE.0000000000000711
From the ISEE

We are at the dawn of a data deluge in health that carries extraordinary promise for improving the health of populations. However, current associated efforts, which generally center on the "precision medicine" agenda, may well fall short in terms of its overall impact. The main challenges, it is argued, are less technical than the following: (1) identifying the data that matter most; (2) ensuring that we make better use of existing data; and (3) extending our efforts from the individual to the population by exploiting new, complex, and sometimes unstructured, data sources. Advances in Epidemiology have shown that policies, features of institutions, characteristics of communities, living and environmental conditions, and social relationships all contribute, together with individual behaviors and factors such as poverty and race, to the production of health. Examples are discussed, leading to recommendations that focus on core priorities for data linkage, including those relating to marginalized populations, better data on socioeconomic status, micro- and macro-environments, collaborating with researchers in the fields of education, environment, and social sciences to ensure the validity and accuracy of multilevel data, aligning research aims with policy decisions that must be made, and heightening efforts to protect privacy.

From the aOffice of the Dean, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada; bOffice of the Dean, Boston University School of Public Health, Boston, MA; cDivision of Epidemiology, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada; dInstitute for Clinical Evaluative Sciences, Toronto, ON, Canada; and eInstitute for Health Policy Management & Evaluation and the Division of Epidemiology, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.

A number of the ideas in this manuscript were originally conceived and discussed by HH in the John R. Goldsmith Lecture at the 2015 Annual Meeting of the International Society for Environmental Epidemiology, August 2015, Sao Paolo, Brazil.

The authors report no conflicts of interest.

This commentary was sponsored by the International Society of Environmental Epidemiology (ISEE).

The contents are the sole responsibility of the author(s) and do not necessarily reflect the official views of the ISEE.

Correspondence: Howard Hu, Office of the Dean, Dalla Lana School of Public Health, University of Toronto, 155 College Street, Toronto, ON, M5T 3M7, Canada. E-mail:

The growing abundance of data on the factors that produce health, and the capacity to link these to data from individuals, afford tremendous potential for improving population health. Although public health is built on a long history of creatively using health data going back to the pioneering work of Langmuir1 and Snow,2 the full potential of contemporary rich data sources is not being fully realized. Instead, the impact of this revolution is being seen mainly in the burgeoning precision medicine agenda, globally, and through the Precision Medicine Initiative (PMI) in the United States.3 In that context, linkage of omics data to phenotypic information from health records is being exploited as new research platforms, with the hope that this will transform clinical practice.4 Similar momentum has not yet been achieved for the population health sciences.

We propose that there are two challenges for the population health agenda. First, ensuring that we make better use of existing data, and in some cases, enhance data linkage and methods to do this. Second, in extending our efforts from the individual to the population by exploiting new complex, and sometimes unstructured, data sources.

We recognize that there are major technological challenges when dealing with the massive and rapid flows of data that come from both traditional data sources, such as large administrative databases, and new sources, such as genomics, land-use, neighborhood and climate data, and unstructured social media feeds. However, we argue that the more important challenge is conceptual and perhaps ideological. How do we identify the data that matter most to improve health for the whole population? In some cases, these are existing data and we need to determine how to enrich, link, and analyze these routinely collected data to maximize their value. We also need to think about new sources of big data that can improve our understanding of health, and how we better integrate these in our approaches to studying population health while avoiding the temptation to analyze everything that may present itself as a new opportunity. Distinct from personalized medicine, the “what matters most” question can be addressed through an understanding of the pressing health challenges of our time and an understanding of how factors across economic, social, behavioral, and biological domains may interact.

The past 2 decades have shown how factors at multiple “levels of influence” are associated with both individual and population health.5 Multilevel causal frameworks suggest that policies, features of institutions, characteristics of neighborhoods and communities, living conditions, and social relationships all contribute, together with individual behaviors and individual factors such as genotype poverty and race, to the production of health.6 For example, quality of the built environment has been shown to be associated with mental ill health and diabetes.7,8 Certain social network characteristics are associated with the risk of obesity.9 It is a logical extension of the goals of big data collection to introduce measures that can capture potential risks at multiple levels of influence. The feature that turns these into “big data” is the coverage and complexity of this information, which enables the population health approach, across geographies, time, and the life course.

Back to Top | Article Outline


Two recent papers illustrate the importance of linking health, ethnicity, and personal financial data. Chetty and colleagues10 linked personal taxation and social security data of the US population to study the independent effects of income and place on life expectancy. This entailed analyses of 1.4 billion person-years of observations (big data). Being poor in Detroit is more hazardous to health than being similarly disadvantaged in New York or San Francisco. Case and Deaton11 documented alarming rises in death rates in middle-aged non-Hispanic white males and females after the global financial crisis. Both studies show that physical and social factors interact strongly to mitigate or enhance the effects of poverty. Even partial mitigation of these effects could be life saving on a scale that would compare well with any impact of precision medicine interventions.

Consider how the value of such studies would be enhanced by more complex data that enable longitudinal analyses to determine the effects of migration (e.g., moving from Detroit to New York), the role of behavioral factors, genomics and epigenomic modifications, and access to healthcare, etc. This would enable the study of gene/environment interactions that determine health, the biologic changes induced by stresses associated with poverty and marginalization, the role of health systems, and mediating factors that could potentially afford new opportunities for disease prevention. There are a growing number of centers where such linkages and studies will be possible. One notable example is the Big Data Institute at the University of Oxford, where data from the UK Biobank can be linked to administrative data and electronic health records (

These types of data and research can also inform health services planning and delivery. Just as genomic and phenotypic data can predict an individual’s response to treatment, linked socioeconomic, environmental, and health services data can be used to predict population risk for certain diseases. Examples include a population diabetes risk prediction tool for policy makers, and evidence that the development of chronic disease in low socioeconomic circumstances identifies individuals who will become very high cost users of the healthcare system.12,13

Linkage with data on education, crime, and occupations can also help elucidate the wider impacts of health status. Such linkage and longitudinal follow-up enabled researchers in Manitoba to document, among teen mothers, the full negative impact of their health status on the educational attainment of their off spring.14 In British Columbia, researchers used routinely collected data from the Early Development Instrument (EDI) to measure the negative impact of early childhood vulnerability (a composite of physical health, social competence, emotional maturity, language and cognitive development, communication skills, and general knowledge in the majority language and culture) on school achievement, standardized test scores, and criminality.15

With regards to environmental health, the availability of repeated, increasingly detailed, and geospatially mapped measures of air pollution has allowed studies linking living near major roads and/or exposure to fine particulate matter and other air toxics with cardiovascular disease, diabetes, autism, and very recently, neurodegenerative diseases.16,17 Linkage with social data, which currently occurs rarely, would help researchers determine, e.g., the environmental contribution to the health risks of living in high-poverty areas.

Back to Top | Article Outline


Finally, we highlight the current disconnect between big data for health agendas and the health of marginalized populations. Undocumented immigrants, migrant workers, the homeless, and indigenous populations have some of the worst health outcomes in societies.18 This is in contrast to documented immigrants who typically have better health than resident populations.19 Marginalized groups should benefit from big data efforts that identify new targets for prevention and patterns of health care utilization and inform studies of the epigenomic and other biologic impacts of poverty and social exclusion. These populations are rarely included in large cohort studies; however, advances in electronic medical records, data linkage and data security, and socialization make it possible to track and document the health experience of individuals who may be migratory or homeless while protecting their privacy. As an example with respect to indigenous populations, the administratively linked data on over 13.5 million Ontarians held at the Institute for Clinical Evaluative Sciences which is enabling an accelerated “Big Data for Population Health” initiative now include data on over 200,000 First Nations individuals. This was made possible by especially close collaboration with and the approval of indigenous communities and their associated governance bodies,20 a strategy and principle that must extend to research on other disadvantaged groups.

Back to Top | Article Outline


At a time when there is so much interest and activity in big data analytics and precision medicine, allowing such efforts to diverge from the interests of population health is unwise. Rather, we have an opportunity to use the momentum that has been created by this movement to do two things: first, to enhance our uses of large population datasets to gain a better understanding of the determinants of health and health inequities to support policies that will optimize health, and second, to push for more complex data that more closely reflects a person’s context from the social, demographic, economic, and biologic perspectives over time. In so doing, we need to be strategic about how we prioritize data efforts and achieve consensus on where to start to build the momentum to a level that is at least as high as the precision medicine realm. Our challenge is not so much employing all types of data and methods, as avoiding the temptation to do everything.

We identify five priorities:

  • (1) Data on socioeconomic status must be improved and linked to health services and health outcomes data. It is possible to securely link income tax data if revenue agencies can be assured that client privacy will not be breached. In keeping with the big data agenda, as much as possible these data should be longitudinal, reflect the dynamic and contextual nature of socioeconomic status and environmental factors.
  • (2) We must strive to include data that identify the most vulnerable members of the population, including indigenous peoples, migrants, refugees, and the homeless. Governance of the use of such data must involve those affected.
  • (3) We should accelerate the shift from ecological studies by securing linkage of data at the level of the individual, which enables analyses at different levels, including micro- and macro-environments, such as occupational and community-level exposures to pollution; neighborhood walkability, “food desert” measures, occupational data, and social isolation among others.
  • (4) Collaborations are needed with the following: (a) researchers in the fields of education, environment, and social sciences to ensure the validity and accuracy of multilevel data and (b) decision-makers in the public health and healthcare sectors to ensure that research questions and findings are those likely to have maximal impact on health.
  • (5) All reasonable efforts must be made to protect privacy. Analyses should use de-identified data in facilities that provide the necessary combination of policies, staff training, physical, and electronic security.

It is only by addressing these priorities that we can assure that big data initiatives that pertain to health are not limited to the relatively narrow allopathic goals of the precision medicine agenda. By pursuing, instead, a more expansive Big Data for Population Health vision, initiatives so aligned carry the promise of generating insights on the drivers of health that can lead to interventions and policies that promote health on a population scale: the very core mission of public health.

Back to Top | Article Outline


1. Langmuir AD. William Farr: founder of modern concepts of surveillance. Int J Epidemiol. 1976;5:13–18.
2. Snow J. Cholera and the water supply in the south districts of London in 1854. J Public Health Sanitary Rev. 1856;2:239–257.
3. Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372:793–795.
4. Weber GM, Mandl KD, Kohane IS. Finding the missing link for big biomedical data. JAMA. 2014;311:2479–2480.
5. Roos LL, Magoon J, Gupta S, Chateau D, Veugelers PJ. Socioeconomic determinants of mortality in two Canadian provinces: multilevel modelling and neighborhood context. Soc Sci Med. 2004;59:1435–1447.
6. Siddiqi A, Nguyen QC. A cross-national comparative perspective on racial inequities in health: the USA versus Canada. J Epidemiol Community Health. 2010;64:29–35.
7. Creatore MI, Glazier RH, Moineddin R, et al. Association of neighborhood walkability with change in overweight, obesity, and diabetes. JAMA. 2016;315:2211–2220.
8. Latkin CA, Curry AD. Stressful neighborhoods and depression: a prospective study of the impact of neighborhood disorder. J Health Soc Behav. 2003;44:34–44.
9. Christakis NA, Fowler JH. The spread of obesity in a large social network over 32 years. N Engl J Med. 2007;357:370–379.
10. Chetty R, Stepner M, Abraham S, et al. The association between income and life expectancy in the United States, 2001-2014. JAMA. 2016;315:1750–1766.
11. Case A, Deaton A. Rising morbidity and mortality in midlife among white non-Hispanic Americans in the 21st century. Proc Natl Acad Sci U S A. 2015;112:15078–15083.
12. Rosella LC, Manuel DG, Burchill C, Stukel TA; PHIAT-DM team. A population-based risk algorithm for the development of diabetes: development and validation of the Diabetes Population Risk Tool (DPoRT). J Epidemiol Community Health. 2011;65:613–620.
13. Fitzpatrick T, Rosella LC, Calzavara A, et al. Looking beyond income and education: socioeconomic status gradients among future high-cost users of health care. Am J Prev Med. 2015;49:161–171.
14. Jutte DP, Roos NP, Brownell MD, Briggs G, MacWilliam L, Roos LL. The ripples of adolescent motherhood: social, educational, and medical outcomes for children of teen and prior teen mothers. Acad Pediatr. 2010;10:293–301.
15. Kershaw P, Warburton B, Anderson L, Hertzman C, Irwin LG, Forer B. The economic costs of early vulnerability in Canada. Can J Public Health. 2010;101(suppl 3):S8–S12.
16. Feng S, Gao D, Liao F, Zhou F, Wang X. The health effects of ambient PM2.5 and potential mechanisms. Ecotoxicol Environ Saf. 2016;128:67–74.
17. Chen H, Kwong JC, Copes R, et al. Living near major roads and the incidence of dementia, Parkinson’s disease, and multiple sclerosis: a population-based cohort study. Lancet. 2017;18:718–726.
18. Hwang SW. Homelessness and health. CMAJ. 2001;164:229–233.
19. Kennedy S, Kidd MP, McDonald JT, Biddle N. The healthy immigrant effect: patterns and evidence from four countries. J Int Migr Integr. 2015;16:317–332.
20. Walker J, Jones C; First Nations Data Sovereignty in action: Creation of the largest First Nations health research study cohort in Canada. Director’s Seminar Series, Australian National University. 3 November 2016. Available at: Accessed 6 February 2017.
Copyright © 2017 Wolters Kluwer Health, Inc. All rights reserved.