Secondary Logo

Journal Logo

Role of big data in reducing the bias of a cohort study

Lu, Zuhong; Zhu, Lei; Jia, Zhongwei

doi: 10.3760/cma.j.issn.0366-6999.20141683

Department of Biomedical Engineering, Peking University, Beijing 100871, China (Lu ZH) National Institute of Drug Dependence, Peking University, Beijing 100191, China (Zhu L and Jia ZW)

Correspondence to: Dr. Jia Zhongwei, National Institute of Drug Dependence, Peking University, Beijing 100191, China (Email:

This work was supported by grants from National Natural Science Foundation (No. 81372958) and 973 Program (No. 2012CB316501).

(Received July 17, 2014)

Edited by Cui Yi

Validity in research is referred to the closeness of the observed results to the truth or the degree to which an inference drawn from a study.1 Guarantee and pursuit of validity is always a challenge through the whole course of a research. What blocks research to reach validity? -Bias. Generally speaking, the more complicate the study is, the larger the bias and the lower the validity are. The cohort study, especially for a prospective cohort study, is a kind of epidemiological study which can provide strong evidences for medical practices and policy-making. However, this kind of cohort study usually needs a long-term observation following large investment of various resources about human and material, and a rigid and systemic design is required to avoid big biases through all stages during the long period of follow-ups.2 In a cohort study, bias can be classified into three categories in generally: selection bias, information bias and confounding bias which are all attributed to imperfect information on a specific issue.1Big data” is bringing opportunities to get information at various scales with low cost, which improve or might change traditional epidemiological studies, including bias in cohort study. “Big data” is the term of a large and complex database which is gathered from different sources, for example, climate data, remote sensing data and geographic data and social network data, which are different from merely clinical data but undoubtedly related to people's health.

Back to Top | Article Outline

Reduction of selection bias

Selection bias occurs when there are systematic differences of characteristics of people who would be selected for a study from those who are not. “Big data” can help get an overview of the population before sampling, including how to stratify population in study. For instance, massive internet social media embedded software can extract and analyze a large amount of data efficiently which provide additional information for traditional epidemiology survey. In particular for study about sensitive population (e.g. men who have sex with men (MSM), sexually transmitted disease (STD) patients), it is impossible to get a representative sample, but the additional information is very useful for researchers to explain their work if they are clear where the bias existing and how to explain them in their study. The increasing adoptions of Electrical Health Records (EMR) in healthcare field can also provide much more information about the backgrounds and details of individuals, such as individuals’ real-time mood, behavior characteristics and history of diseases, etc. The private information is always important to estimate health status of people, but might not be suitable to be involved in a study directly. With these public information technologies and available data, people can give an insight of population characteristics before the investigators conduct a substantial study, which will be helpful to reduce the selection bias at sampling step.

Big data” is always companying the technologies of “big data”. For example, telemedicine and wearable medical devices make it possible for people to collect continuous data from interesting population wirelessly. The internet tools such as Facebook, Twitter and QQ can provide a relaxing platform of online communication between researcher and subject. Through the platform, the researcher can interview the subjects and provide intervention by message, note or voice as they like, and the subject can pick suitable time to answer the questions and avoid being dropped out of cohort because of limitation of time and places in traditional interview. For the sensitive studies, the studies might recruit good representative subjects by this pattern because no geographic biases in an internet-based platform exist. In fact, the subjects probably would like to take part in such a study related with themselves when they do not need to worry about that they will be identified among the population in the future because of face to face interview. In other words, information technologies integrating biomedical technologies strengthen capabilities of people to produce and collect “big” medical data in different ways, which provides a more comfortable communication and is helpful to avoid bias in study to some extent, especially for sensitive issues. The new medical practices not only have all merits of traditional cohort study, but also are changing idea and style of a new medical study.

Back to Top | Article Outline

Reduction of information bias

Information bias, also known as observation bias or measurement bias, results from systematic differences in data collection, measurement or classification.3 The prospective cohort study can provide relatively strong evidence, one major reason is that it repeatedly measures several observed factors during follow-up and can detect the causal association among outcome and exposed factors. However, the observed values are discrete and can not really cover observation period, some information might be missed while some information be magnified. For example, a few years ago, the only way to monitor blood glucose was via finger sticks, providing intermittent assays of the glucose level. Clearly, the test is going in daytime, having a very high or low blood glucose during the night would probably go unnoticed, even during the day, glucose levels can vary considerably in response to exercise, food intake, fluids status, and medications, such as insulin, that all are taken to manage the disease.4 Biomedical technologies provide possibilities to collect continuous values of observation which can avoid omit and rapidly increase quality of study. For example, with the assistance of wearable medical devices, the individual-based, non-intervention and dynamic measurement of blood glucose can be available now. The dynamic and continuous spectrums of blood glucose present more values and accurate information about blood glucose, so measurement bias is absolutely reduced in this situation. Recall bias is another big issue resulting from subjects during follow-up because the subject might forget the exact details or they are unwilling to tell the truth. For the former situation, the researcher can trace back to their history of health by accessing records storing in medical information systems or backgrounds of subjects in population systems which include geographic information, behavior information and culture information. For the later one, especially for some sensitive issues, the hidden follow-ups with assistance of technologies, like internet interview, might be a good choice while we also refer to their background information.

Back to Top | Article Outline

Reduction of confounding bias

Confounding occurs when the effects of two associated exposures have not been separated, resulting in the interpretation that a result is due to one variable rather than the other,3 which might lead to an estimated association is not the same as the true effect. A potential confounding variable must meet three conditions: (1) The variable must be independently associated with the outcome (i.e. be a risk factor); (2) The variable must be associated with the exposure under study in the source population; (3) It should not lie on the causal pathway between exposure and disease.3 Obviously, the failure to recognize important confounders would lead to different outcomes of a study, even a completely opposite result because of a significant confounder. The best way to avoid confounding bias in a prospective cohort study is to administer a randomized control trial (RCT), which means assigning participants to the treatment and the control group randomized. A challenge to achieve RCT is that it is impossible or unethical in many studies. However, randomized or not, the prospective cohort study have to be affected by loss during follow-up. If those who drop out of the study differ from those who stay in study, the confounding is rising. As mentioned, big data with supporting technologies would be helpful to avoid dropping out or reduce effect of drop-out in a cohort, especially for a study on a sensitive issue. In a retrospective cohort study, trying to involve all possible variables related with outcomes or exposures in analysis sounds a scientific and reasonable design, but because too many variables involved in one work will pose complexities and difficulties for analyzing the data to get results, the traditional design for such a study generally covers several critical factors based on professional experience which probably miss the potential confounders on a new question. As above, the increasing adoptions of EMR in healthcare field can provide much more information of an individual (which absolutely should be security under a specific mechanism). These large shared information is much more helpful to find real factors related with outcome in comparison with professional experience. Also, the big data can help to find latent confounders which might be omitted in previous collections. In addition, the developments of technologies inspired by big data provide promise for digging and analyzing the complicated data to discover the hidden association and reduce bias to an acceptable level.

Back to Top | Article Outline

Challenge of “big data” in a cohort study

Big data in healthcare can improve cohort study's validity by reducing bias at each stage, but there are no perfect studies in the world. The big data faces several challenges in application. First, big data is not collected for a specific study, and so is not always well-suited for a cohort study. For instance, EMR is a system for clinical and billing purposes rather than for a cohort study, what information we need in a specific cohort study may not be included clearly, so how to dig and find useful information for study is left to the researchers. In fact, recognizing the value of big data require researchers having integrated knowledge of health and getting an insight of health issues among mountain of information. Second, there are no mechanisms of information sharing or management. For individual's medical information, the record is fragmented over different healthcare settings, which is difficult to make them complete for using. For public datasets supported by financial budgets, all data are also scatted in divergent departments and there are no regulations to tell how to get or use these data. Generally, the researchers have to negotiate and acquire these data in private ways, and the luck depends and the approach is usually declined by the datasets owners. Data sharing is not only a need for human health, but also it is kind of fully using the limited health resources by deeply mining existing medical data, especially for huge existing data. For example, there are many genomes and proteomes projects which produced huge data, but few have been combined to be analyzed and translated into clinical practice. Find ways to share and combine these data to mine hidden information might be more urgent than produce new data. Third, technologies and ethics for privacy are urgently needed. “Big data” in a cohort study, based on a large amount of information collected in a period of follow-up at different scales on a person, is like a double-edged sword. When it is used to improve quality of cohort study, it is also possible and easy to leak the individual's personal information, which might drag people into troubles. The consequence is much more serious than, at least the same as, that disaster resulted from attacks of hackers. In the Summer of 2012, Senior Writer Mat Honan claims that hackers had destroyed his entire digital life in the span of an hour by hacking his Apple, Twitter, and Gmail passwords in order to gain access to his Twitter handle and in the process.5 In a specific research, the collected data is systematical, clustered and causal based on design, so calamity is hard to imagine once privacy is hacked in a cohort study supported by “big data”.

Back to Top | Article Outline


In summary, “big data” is very promising in reducing bias in a cohort study at each stage and would also have profound impacts on new epidemiological study. However, before making use of it in reality, a series of laws and regulations on how to share data and how to protect privacy should be given priority.

Back to Top | Article Outline


1. Zhang Siyan. Epidemiology. Beijing: People's Medical Publishing House, 2012; 144.
2. Beaglehole R, Bonita R, Kjellstram T. Basic epidemiology. World Health Organization; 2006: 46-48.
3. Maria Kirwan. Epidemiology for Practitioners. Public Health Action Support Team. (Accessed at
4. Topol E. The creative destruction of medicine: how the digital revolution will create better health care. New York: Basic Books; 2011.
5. Honan M. Kill the password: why a string of characters can't protect us anymore. (Accessed at

big data; bias; prospective cohort

© 2014 Chinese Medical Association