Informative Presence in Electronic Health Record Data: A Challenge in Implementing Study Exclusion Criteria : Epidemiology

Secondary Logo

Journal Logo

Secondary Data Sources

Informative Presence in Electronic Health Record Data: A Challenge in Implementing Study Exclusion Criteria

Chubak, Jessicaa,b; Dalmat, Ronit R.b; Weiss, Noel S.b; Doria-Rose, V. Paulc; Corley, Douglas A.d; Kamineni, Arunaa

Author Information
Epidemiology 34(1):p 29-32, January 2023. | DOI: 10.1097/EDE.0000000000001542
  • Free

Informative presence occurs in electronic health record (EHR) data when some patients have “more opportunities to receive diagnoses (correctly or erroneously)” than others based on their level of interaction with the healthcare system.1 McGee et al. provided the example of autism spectrum disorder (ASD) and clinical diagnoses of comorbidities.1 Patients with ASD have more interactions with the healthcare system than patients without ASD and therefore have a greater opportunity for comorbidities to be diagnosed and documented.1 This can lead to bias in analyses of associations between ASD and comorbidities. Adjusting for the number of healthcare visits may seem to be a straightforward solution to this problem; however, because the number of visits may be a consequence of both diagnosed disease and diagnosed comorbidities (i.e., a collider2 on the path between these two variables), controlling for the number of healthcare visits can introduce bias. In this commentary, we describe a common related challenge posed by informative presence: identification of the study population of interest. Specifically, we discuss this challenge in the context of conducting EHR-based analyses that require exclusion of potential participants with a particular condition.

In an ideal world, one could assume that if a diagnosis code is not present in a patient’s comprehensive EHR that the patient does not have the condition of interest. In reality, this is unlikely to be true for several reasons. Patients without healthcare encounters do not have opportunities for conditions to be recorded in their EHR. Among patients with healthcare encounters, the condition of interest might not be recorded in the chart if the provider does not diagnose the condition or if they fail to record it. The likelihood that a condition that is truly present will not be diagnosed and recorded in the EHR probably depends on the number and types of clinical encounters a patient has. To identify a cohort without a particular condition, a researcher might restrict their study to patients with a prior visit during a specified period in which the condition, if present, would have been diagnosed and recorded. (The relevant type of prior visit might depend on the condition of interest.) However, restricting a study to patients with a recent healthcare visit (of any type or a specific type, like a primary care visit) might have undesirable consequences. We explore this challenge in the context of studying the effectiveness of cancer screening.


By definition, only tests that are conducted among people who are free of signs and symptoms of a particular disease are classified as screening tests. Thus, observational studies of cancer screening effectiveness must exclude people who, when testing status is assessed, have signs or symptoms of cancer. Doing this with EHR data can be challenging.3,4 Patients who receive in-person cancer screening and diagnostic tests (e.g., mammography, colonoscopy, prostate specific antigen test) will likely have had an opportunity for symptoms, if present, to be documented at some point, that is, when the test was administered or at the visit at which they were referred for testing. However, many people without testing will not have had a recent healthcare visit at which symptoms could have been documented in the EHR. Under such circumstances, exclusion criteria would be implemented unevenly: symptomatic patients with cancer will be excluded to a greater extent from the screened group than the unscreened group. Screening could thus be associated with a reduced risk of cancer mortality (or more reduced than it truly is) because patients with symptomatic cancer would be removed more from the tested group than the tested group at the start of the study. Thus, even after exclusion of people with documented symptoms, the association between screening and cancer mortality could be biased.

The structure of this bias can be illustrated in a directed acyclic graph. We define “true screening” as testing among patients without signs or symptoms of the cancer under study. Ideally, complete and accurate information on cancer signs and symptoms would be available and used to classify screening status. The Figure shows that symptoms and screening at the start of follow-up in a cohort study are both measured with error, as indicated by variables with asterisks. Patients without a healthcare encounter cannot have symptoms recorded. Symptoms may also be missed or miscoded in patients with healthcare encounters, especially if patients can self-schedule testing without a referral visit. It follows, therefore, that screening is measured with error due to incomplete or erroneous documentation of symptoms. Although the causal relationship of interest is the one between true screening and cancer mortality (and is only of interest in people without signs or symptoms of cancer), EHR-based observational studies estimate the association between measured screening* and cancer mortality (assuming, for present purposes, that there is no outcome misclassification.) There are several backdoor paths from measured screening* to cancer mortality that bias their association (e.g., screening* to true screening to recent healthcare visit to true symptoms to cancer to cancer mortality). Restricting a study to people without documented symptoms—a necessary step in identifying the population of interest—blocks some, but not all, backdoor paths. Thus, the directed acyclic graph matches the intuition that restriction to people without documented symptoms is not sufficient to remove bias.

DAG for cancer screening, symptoms, healthcare utilization, and cancer mortality. Asterisks indicate variables that are measured with error. DAG, directed acyclic graph.


Further restricting the study population to patients with a recent healthcare visit (of any type or a specific type, such as primary care) can help ensure that all patients, not just those who were screened, had an opportunity for any symptoms present to be documented. It also blocks some (but not all) backdoor paths through unmeasured variables to cancer mortality. However, restricting a study to people with a recent healthcare visit introduces the potential for selection bias. People who come in for visits and are not screened may have come in for some other reason. If that other reason is related to cancer mortality risk, a study restricted to healthcare utilizers might not accurately estimate the association between screening and cancer mortality. For example, let us suppose (albeit unrealistically) that each healthcare visit is either for prevention (which could include cancer screening) or for diagnosis or treatment of a medical condition but not for both screening and diagnosis or treatment of another condition. Under this assumption, someone who comes in for diagnosis or treatment will not be screened. If their underlying risk of cancer mortality does not reflect what the cancer mortality risk would have been in screeners had they not been screened, there will be bias. In many situations, the reasons for healthcare visits among nonscreeners will be unrelated to their cancer mortality risk conditional on measured variables. However, it is possible that some reasons for healthcare visits may be related to cancer risk. For example, assume people with diabetes have a higher risk of colorectal cancer mortality than people without diabetes. If people with diabetes come in for primary care visits to manage their disease and are not referred to screening, they will enrich the unscreened group with people at higher risk of colorectal cancer mortality. Screening would thus appear falsely to be associated with a reduced risk of cancer mortality (or more so than it truly is).


The foregoing sections describe a potential problem of competing biases due to informative presence that can arise in the implementation of study exclusion criteria. Structurally, the Figure shows that conditioning on a recent healthcare visit blocks some of the backdoor paths from screening* to cancer mortality that were not blocked simply by conditioning on no measured symptoms*. However, having a recent healthcare visit is a collider on the path from measured screening* through true screening, true symptoms, recent healthcare visit, unmeasured variables, to cancer mortality (Figure). Previous work has suggested that in many scenarios with when a variable is both a confounder and a collider, it is better to adjust for that variable than it is to ignore that variable5–10; however, the direction and relative magnitude of competing biases depend on many factors.

Restricting an EHR-based study of a healthcare intervention to people with a recent healthcare visit (of any type or a particular type) may be necessary to increase confidence in making study exclusions. But conditioning on a recent healthcare visit—a collider—can also introduce selection bias. Within a healthcare system, people who have encounters are likely different from those who do not, but the nature and importance of those differences may depend on the scientific question of interest. How should researchers balance these potentially competing considerations? As others have advocated, we too believe it is important to understand subject-specific EHR data provenance.11 We also encourage empirical studies to collect data directly from patients regarding characteristics (e.g., symptoms, medical history) of people without recent health encounters. These data can be used in simulations to help epidemiologists gauge the potential for different types of bias. One must be open to the possibility that uncertainty will remain or that neither approach (restricting vs. not restricting to people with any recent healthcare visits or a certain type of healthcare visit) will sufficiently reduce bias in some applications. In the best-case scenario, it will be clear which approach is likely to be least bias. In some cases, it may be unclear. We encourage authors to explicitly consider and discuss these tradeoffs and perform quantitative bias analyses or sensitivity analyses12 to evaluate the appropriateness of restriction.

About the Authors

Dr. Chubak is a scientific investigator at Kaiser Permanente Washington Health Research Institute and Affiliate Professor in the Department of Epidemiology at the University of Washington. Her methodologic interests include using electronic healthcare data for epidemiologic research.

Dr. Dalmat is a recent graduate of the University of Washington, Department of Epidemiology and will be a postdoctoral researcher at the International Clinical Research Center at the University of Washington. Her interests include epidemiologic methods for studying diagnostic tests and tools.

Dr. Weiss is an Emeritus Professor in the Department of Epidemiology at the University of Washington. His primary research interests have been in the areas of cancer epidemiology, clinical epidemiology, and epidemiologic methods.

Dr. Doria-Rose is an epidemiologist and the Chief of the Healthcare Assessment Research Branch at the National Cancer Institute. His research interests are focused on the delivery of cancer screening in community healthcare settings.

Dr. Corley is a gastroenterologist and epidemiologist at the Kaiser Permanente, Northern California, Division of Research, where he also directs the Delivery Science and Applied Research Program. His interests include cancer screening, discovery epidemiology, and methodologic development of pragmatic research.

Dr. Kamineni is a scientific investigator at Kaiser Permanente Washington Health Research Institute. Her research focuses on cancer screening, addressing evidence gaps for clinical guidelines, and developing infrastructure and methods for research using electronic clinical and administrative health care data.


We thank Dr. Barbra Dickerman for helpful conversations and Dr. Sebastien Haneuse for comments on the article.


1. McGee G, Haneuse S, Coull BA, et al. On the nature of informative presence bias in analyses of electronic health records. Epidemiology. 2022;33:105–113.
2. Cole SR, Platt RW, Schisterman EF, et al. Illustrating bias due to conditioning on a collider. Int J Epidemiol. 2009;39:417–420.
3. Weiss NS. Observational studies that seek to emulate a randomized trial of screening to reduce the incidence of cancer: do they address the question to which we’d like to have an answer? Am J Epidemiol. 2019;188:1761–1763.
4. Weiss NS. Commentary: cohort studies of the efficacy of screening for cancer. Epidemiology. 2015;26:362–4.
5. Goldstein BA, Bhavsar NA, Phelan M, Pencina MJ. Controlling for informed presence bias due to the number of health encounters in an electronic health record. Am J Epidemiol. 2016;184:847–855.
6. Wallmark J. Selection Bias When Estimating Average Treatment Effects in the M and Butterfly Structures. Umea Universitet; 2019.
7. Kelcey B, Carlisle J. The Threshold of Embedded M Collider Bias and Confounding Bias. Society for Research on Educational Effectiveness; 2011.
8. Ding P, Miratrix LW. To adjust or not to adjust? Sensitivity analysis of M-bias and butterfly-bias. J Causal Inference. 2015;3:41–57.
9. Greenland S. Quantifying biases in causal models: classical confounding vs collider-stratification bias. Epidemiology. 2003;14:300–306.
10. Liu W, Brookhart MA, Schneeweiss S, et al. Implications of M bias in epidemiologic studies: a simulation study. Am J Epidemiol. 2012;176:938–948.
11. Haneuse S, Daniels M. A General framework for considering selection bias in EHR-based studies: what data are observed and why? EGEMS (Wash DC). 2016;4:1203.
12. Lash TL, Fox MP, MacLehose RF, et al. Good practices for quantitative bias analysis. Int J Epidemiol. 2014;43:1969–1985
Copyright © 2022 Wolters Kluwer Health, Inc. All rights reserved.