Journal Logo

Review Article

A Root Cause Analysis Into the High Error Rate in Clinical Immunohistochemistry

Bogen, Steven A. MD, PhD

Author Information
Applied Immunohistochemistry & Molecular Morphology: May/June 2019 - Volume 27 - Issue 5 - p 329-338
doi: 10.1097/PAI.0000000000000750
  • Open


A root cause analysis (RCA) is an investigative method designed to not only identify a problem and explain how it came about, but also determine why it happened. The goal is to identify 1 or more underlying identifiable causes that can be remedied. This RCA will follow the 4 steps of: (1) data collection (from the published literature), (2) identification of possible causal factors, (3) elucidation of the underlying root cause(s), and (4) generation of recommendations. This review’s perspective is different from previously published Clinical Immunohistochemistry (IHC) reviews. It is written from the vantage point of well-established clinical laboratory practice in other laboratory disciplines, such as Clinical Chemistry, Immunology, and Hematology. Preanalytic and analytic factors associated with Clinical IHC testing are benchmarked against those of other clinical laboratory disciplines. The entire analysis is summarized in Figure 1. Also, the text is interspersed with frequently asked questions based on feedback from colleagues.

Schematic illustration of the various test components (top) and the corresponding root cause analysis process. IHC indicates immunohistochemistry; LDTs, laboratory developed tests; PT, proficiency testing.


Definition of an Error

For this paper, a Clinical IHC “error” is the reporting of an incorrect test result, regardless of the reason. “Regardless of the reason” means that the cause of the error may be due to preanalytic, analytic, or postanalytic reasons, conforming to the definition of laboratory error per ISO/TS223671 and others.2 The phrase “reporting of an incorrect test result” implies that quality control (QC) is acceptable and that there is nothing about the stained slide or the day’s batch of slides that gives rise to concern that an error may have occurred. For example, if the slide were part of a proficiency testing (PT) survey, the slide passes the laboratory’s QC checks, appears to be correct, and the test result (and slide, if appropriate) would be submitted to the PT organization. This definition of error acknowledges that some will not actually impact the patient, if the pathologist recognizes that the test result does not fit with the rest of the case. The stained tissue’s appearance may give rise to an erroneous “readout” even if the “interpretation” is unaffected (as these terms are defined by Cheung et al).3

Clinical IHC error rate data are based on published split-sample studies where 2 or more laboratories perform an immunohistochemical test on the same formalin-fixed paraffin-embedded tissue block. Disagreement among laboratories is interpreted as one of the laboratories having erred. Two types of data sources describe these error rates: (1) interlaboratory studies such as are often associated with clinical trials, and (2) PT programs where participants stain serial sections of the same samples.

Interlaboratory Studies Error Data

The split-sample discordance rates from clinical trials are summarized in Table 1. Many of these studies involve referral of oncology patients into a clinical study or to another hospital. Paraffin tissue blocks are transferred from a local hospital laboratory to a central one for verification of the test result. Serial sections from the same paraffin tissue block were stained and the 2 test results compared. The discordance rates vary widely, from 1.6% to 75%. Several studies separately examined whether the reason for test result discrepancies were due to the IHC test protocol or the IHC test readout.9,10,12 They generally found that both factors are important. The broad variability in error rate estimates reflects differences associated with the particular protein being detected, the criteria for measuring discordance or detecting error, and variability in performance of participating laboratories. The conclusion from Table 1 is that error rates are in the single- and double-digit range.

Analytic Error Rates in Clinical IHC: Interlaboratory Comparisons

PT Survey Error Data

The error rates from national PT surveys fall into a similar high range. The most useful data derive from national PT organizations that perform central review of stained slides. Those data are most useful because a single group of pathologist scorers evaluate each stained slide. This format offers the benefit of minimizing readout variability. Instead, the discrepancies are primarily due to differences in the IHC test protocol (staining and antigen retrieval). For example, the NordicQC PT program’s cumulative experience from 2003 to 2015 is that ∼10% of laboratories scored as “poor” and another 10% to 20% as “borderline.”18 Although there is no recent cumulative compilation of UK-NEQAS ICC data, the organization routinely cites single-digit failure rates and double-digit borderline laboratory performance rates. For example, a series of 4 PT anaplastic lymphoma kinase IHC surveys conducted by UK-NEQAS disclosed an average proficiency test failure rate of 17%.19 Although the UK-NEQAS data showed that part of this error rate was due to a number of laboratories opting to use laboratory developed tests (LDTs), an unacceptable level of failure, 7%, was also seen with the FDA-approved IHC test. Similar error rates were identified in a PT scheme organized by the European Society of Pathology. Error rates for anaplastic lymphoma kinase and ROS1 IHC ranged up to 9.0% and 13.3%, respectively.20 The majority of PT errors are associated with false-negative test results.18

As the PT survey slides in the above-described studies are all scored by the same expert panel, the readout component is common to all laboratories. Consequently, errors detected by PT surveys that incorporate central review of slides are mainly attributable to problems with the IHC protocol (antigen retrieval and/or the IHC staining process), not the readout.

FAQ: These high error rate data don’t reflect the performance of the best Clinical IHC laboratories, those with high test volumes and experienced staff that follow best practices.

Reply: These data represent the field at large. The fact that expert Clinical IHC laboratories are able to achieve lower error rates does not contradict the conclusion of this RCA. Instead, it speaks to the fact that expert laboratories are able to accurately validate and maintain their immunohistochemical tests without the existence of international standards, calibrators, traceable units of measure, and quantitative controls. This is more fully addressed later (see the Elucidation of the underlying root causes section).

Error Rates in Other Clinical Laboratory Disciplines

In fields such as Clinical Chemistry, Immunology, and Hematology, the rate of laboratory errors due to the analytic portion of the test is <1%.21–25 Consequently, the focus of attention for improvement has shifted to understanding and preventing preanalytic and postanalytic errors, which are more frequent than analytic errors.26–29

Regardless of the exact rate of Clinical IHC errors, all of the split-sample data—from both PT surveys and interlaboratory studies—argue for single to double-digit analytic error rates in Clinical IHC. In contrast, the rate of analytic errors in other laboratory testing disciplines is <1%. This striking difference cannot be ascribed to trivial causes such as the difference in the definition of errors. If anything, the field of Clinical IHC ought to have lower error rates because test results are qualitative. With qualitative tests, significant deviations in the stain intensity can occur and still yield the right positive or negative test result. Accurately reporting a broad category should be easier than a quantitative test reported as a continuous variable.


The Role of Preanalytic Variables

Preanalytic variables describe all of the possible permutations in the handling of a tissue sample, from the patient until the start of the IHC test. Most attention has focused on cold ischemic time and fixation time, as summarized in several 2018 reviews.30–32 Preanalytic factors can lead to erroneous IHC test results and are justifiably an important focus of attention for Clinical IHC laboratories. However, preanalytic variables are not a causal factor in the aforementioned published high error rates. The previously cited split-sample study designs for measuring error rates in Clinical IHC are inherently blind to preanalytic variables (Fig. 1, far left). As each laboratory is measuring the same sample, from the same paraffin-embedded tissue block, the preanalytic sample conditions are common to all of the study’s participating laboratories. Preanalytic specimen attributes cannot account for different test results from laboratories testing the same specimen. Therefore, for this root cause analysis, preanalytic factors are discounted as a causal factor. All of the errors detected in split-sample tests can only be attributed to analytic factors. Although variability in the depth of the tissue block is a potential explanation for different test results, it seems unlikely to account for the consistently high error rates as measured by split-sample testing over the years.

Analytic Variables

As already mentioned, the causes of errors described in the published studies fall into this (analytic variables) category. Analytic variables include the IHC test protocol (antigen retrieval, reagents, protocol, and instrument) and the test readout. For Clinical IHC, the test readout is judged by a pathologist with or without the assistance of computer image analysis. These variables are represented at the top center of Figure 1. Examples of a reagent problem include an expired reagent or one that was used at an inappropriate concentration. Examples of an instrument problem include a reagent dispenser that misses a dispense, a faulty heating element, or an operator using the instrument incorrectly. These are just a few examples. Per Murphy Law, there are innumerable ways that analytic problems can potentially arise, given enough time, leading to a test result error. Any and all of these are potential causal factors that can result in a test error. It is not the intent to try to review the many potential causal factors and describe best practices. This has already been done by others.3,33–37 Instead, the important question for this root cause analysis is what makes the analytic component of Clinical IHC more error-prone compared with the analytic components as found in other laboratory disciplines?

Although each laboratory testing discipline has its own unique technical details, the analytic test components and procedures in Clinical IHC appear to be of comparable complexity and quality to those of other laboratory testing disciplines. For example, Clinical IHC reagents are generally manufactured under current Good Manufacturing Practices, just like other in vitro diagnostic (IVD) tests in Clinical Chemistry or Immunology. The level of instrument reliability appears similar to the reliability of instruments in other laboratory testing fields. Antigen retrieval is somewhat unique but other IVD tests have their own similarly unique sample preparation features. For example, viral load testing requires an initial nucleic acid extraction. In short, the analytic features associated with Clinical IHC do not appear to account for the higher error rate.

Against this backdrop, there are 2 analytic test aspects that are comparatively unique in Clinical IHC. These 2 aspects (to be described, below) likely increase the error rate. However, the most important finding is that, regardless of the analytic problem, it is surprising that the errors are not detected before entry into the medical record. No matter what the exact cause of the analytic error, the common denominator is a change in the test result. This should be detected. Depending on the problem, it would be noticed either during the initial test validation or with the daily control. This suggests that the underlying root cause relates to quality assurance (QA)/QC systems in Clinical IHC for detecting errors.

Analytic Differences From Other Laboratory Disciplines

There are at least 2 unique aspects of Clinical IHC—causal factors—that may disproportionately contribute to the analytical error rate. First, Clinical IHC has a comparatively high proportion of LDTs. Many IHC tests require the Clinical IHC laboratory staff to decide on various IHC protocol conditions. This includes making decisions, for example, on an appropriate reagent dilution, incubation time, pH of an antigen retrieval solution, and (for some instruments) the dispense volume. The situation is somewhat unique to Clinical IHC. Clinical Chemistry laboratory staff, in comparison, generally do not encounter the need to configure an assay. If LDTs are more prone to generating errors, the next question in a RCA is “why?” Why are LDTs not configured correctly? An important underlying reason will be discussed in the Elucidation of underlying root causes section.

The other significant difference relates to the test readout. Clinical IHC is dependent on a pathologist’s readout. Clinical Chemistry tests, by comparison, usually generate a signal that is measured by an instrument. For example, colorimetric reactions are measured with a spectrophotometer that is incorporated into the chemistry analyzer. The subjectivity (in test result readout) for Clinical IHC stands in contrast to the objectivity of Clinical Chemistry. Published data demonstrate that IHC test results at the threshold between 2 categories are at the greatest risk of readout error.38,39 Studies measuring interlaboratory variability due to interpretation versus variability due to the IHC test itself indicate that both contribute to discordance between laboratories.9,12

National PT data argue that even if readout error was eliminated, there would still be significant interlaboratory variability (errors). The previously cited data from NordicQC and UK-NEQAS ICC all involved central review of stained slides. A single group of expert readers grade all of the participants’ stained slides. Consequently, this PT evaluation protocol promotes the application of consistently applied scoring criteria. For this reason, the errors measured in PT studies are overwhelmingly IHC test protocol errors, not associated with the visual readout of slides. To the extent that pathologist errors in immunohistochemical slide readout are a causal factor contributing to the error rate, a root cause analysis would ask why do they occur? Issues such as training and subjectivity of the diagnostic criteria will arise. If pathologist readout errors are a significant contributor to the error rate, a RCA would also ask about the role of quantitative image analysis in mitigating errors (where appropriate).


The reader will likely already be familiar with the aforementioned causal factors. Sooner or later, analytic test problems arise even in the best laboratories. No matter what the analytic problem is, it manifests as an inappropriate stain that should be detectable on test controls or during test validation. Therefore, a RCA should also consider the effectiveness of methods for detecting and preventing analytic errors through the use of regular accurate feedback. “Regular accurate feedback” into analytic test performance refers to well-established methods of clinical laboratory QA, including QC. The word “regular” in this context means that feedback regarding analytic sensitivity occurs not just at the initial test validation but also on every subsequent day or every subsequent sample. It pertains to both the development and maintenance of the test. The word “accurate” in this context implies a high level of sensitivity to detecting perturbations in the test. In the fields of Clinical Chemistry, Immunology, Hematology, and others, the methods for creating regular accurate feedback require: (1) monitoring of controls with known analyte concentrations, (2) calibrators, (3) reference standards, and (4) traceable units of measure. This RCA finds that the nature of regular accurate feedback is so completely different in Clinical IHC as to likely account for the dramatically higher error rates.

Terminology and QA/QC practices that are fundamental to other types of clinical laboratories are not practiced Clinical IHC. For example, Clinical Chemistry, Immunology, and Hematology laboratories monitor test performance using Levey-Jennings charts, also termed “control charts.” This practice is so integral to QC that their absence (without some type of comparable substitute) would be grounds for suspension of a clinical laboratory’s accreditation. Yet Levy-Jennings charts (or a comparable substitute) are unknown in Clinical IHC. Clinical IHC also lacks another important element for standardization—traceable units of measure. It is difficult to align the analytic sensitivity of IHC test results without a common yardstick. Traceable units of measure, in turn, cannot exist without calibrators that are traceable to a standard. Calibrators, standards, controls of known analyte concentrations, and traceable units of measure comprise basic elements of laboratory QA/QC. In these aspects of laboratory practice, Clinical IHC stands apart from other clinical laboratory disciplines. These QA/QC elements are described in greater detail in the following sections.

Historical Laboratory Precedent

This is not the first review article identifying this underlying root cause of laboratory test errors:

The reliability of analytical results from clinical laboratories has been questioned by many individuals, and several independent surveys have more than justified the suspicion. In these surveys, blame has been placed on poor supervision of personnel, poorly trained and insufficient personnel, poor equipment, poor choice of methods available, and so on …. It is the purpose of this paper to discuss the importance of running routine standards …40

This excerpt actually refers to Clinical Chemistry, not Clinical IHC. It is from a published 1952 review, before modern QA/QC practices were well developed and widely adopted. The review advocated the use of control (Levey-Jennings) charts, which are now mandatory elements of laboratory QC. At the time, control charts were new, having only been published by Levey and Jennings41 2 years prior, in 1950. Modern definitions of standards did not emerge until the 1960s.42 The modern hierarchical model of measurement traceability comprising reference methods and materials in Clinical Chemistry was described by Tietz43 in 1979. Control rules (Westgard rules) were described by Westgard et al44 in 1981. All of these aspects of laboratory QA/QC are presently considered essential for producing accurate laboratory test results. They are all absent from Clinical IHC.

History demonstrates that the adoption of modern QA/QC practices in clinical laboratory testing coincided with a dramatic drop in the analytic error rate.45 Analytical errors in Clinical Chemistry are now generally cited as <1/20th of what they were 50 years ago.22 For example, a seminal 1947 paper studying errors in Clinical Chemistry found an analytic error rate of 16.2%.46 A 1955 Canadian survey of glucose testing found error rates as high as 40%.47 Wootton and King48 found up to 50% variability in test values in a split-sample survey of 36 UK laboratories. Similarly high error rates were described by Shuey and Cebel.49 Yet today, analytic error rates are <1%.

Those early days of clinical laboratory testing have been termed the “ancient” era.45 They were characterized by high analytic error rates, wide dispersion of test data among different clinical laboratories, and “gross errors” even on tests that were relatively simple. It is important to highlight the fact that the description is a snapshot of the entire field in those days. Like Clinical IHC today, there were exceptions, that is, laboratories that adopted the highest standards of practice, with a high level of expertise, and consequently had low error rates. By the 1990s, after these methods were widely adopted, analytic error rates in laboratory medicine broadly declined to 1% to 2%.25,26,50 The incidence of preanalytic and postanalytic errors now far outstrips the rate of analytic errors.26–29

The previously cited published error rates in Clinical IHC mirror those of clinical laboratory medicine 50 years ago. The similarity to the “ancient era” of Clinical Chemistry does not prove causation but the relative absence of the key features of regular accurate feedback from both certainly suggests the possibility of shared root cause(s). If so, then the same solutions may be applicable. There have been many changes in clinical laboratories over the decades—automated instruments, improved reagents, PT, information systems, etc. Any and all of these changes may have helped lower error rates. However, these modern aspects of laboratory testing are also found in Clinical IHC. The one aspect not yet adopted into Clinical IHC is in the area standards, calibrators, traceable units of measure, and controls with known analyte concentrations coupled with Levey-Jennings charts. Collectively, these are the most effective methods for generating regular accurate feedback into analytic test performance.

Reference Standards, Calibrators, and Traceable Units of Measure

All American football fields are 100 yards long. This length can be verified with a tape measure purchased at a local hardware store. Naturally, we expect that the 100-yard measurement is consistent regardless of the brand of tape measure that we purchase. Manufacturers of tape measures verify the accuracy of their products by checking it against a calibrator, such as at the National Institute of Standards and Technology (NIST). NIST provides a service for verifying accurate length. The NIST calibrator, in turn, is verified against an even higher level standard at the Bureau International des Poids et Measures (BIPM). Therefore, the local hardware store’s tape measure is ultimately traceable to the measure at BIPM. There is a hierarchy of traceability ensuring measurement reproducibility. The definition of a traceable measurement is one that has an unbroken chain of calibrations going back to the primary reference standard. Uncertainty increases the further we get from the primary standard.

A similar arrangement exists in clinical laboratory medicine. There are hundreds of primary reference standards in the field. These standards are provided by organizations such as NIST, the International Federation of Clinical Chemistry, and the World Health Organization. Most commercial calibrators sold by clinical laboratory vendors are traceable to these international standards. Hospital laboratories purchase these calibrators to create calibration curves on their instruments. This system of measurement traceability thereby ensures standardization of each hospital’s calibration curve to the primary reference standard. Without traceable measures, it is difficult to foster accuracy, resulting in laboratory testing errors.

The field of Clinical IHC is different from all other clinical laboratory disciplines because there are no traceable units of measure. Scores of 0 to 3+, such as used for HER2 scoring, are not traceable units of measure. Instead, it is a morphologic description incorporating estimates of circumferential staining and stain intensity. Stain intensity is not traceable to any higher order standard and, therefore, varies among commercial vendors’ kits. Similarly, counting the percentage of positive cells, such as for PD-L1 or estrogen receptor (ER), is not a traceable unit of measure. This is because the analytic sensitivity of the IHC test is not traceable to a higher standard. The same tissue sample will show few or many positive cells, depending on the analytic sensitivity of the IHC test.

FAQ: Clinical IHC testing is qualitative in nature. Why should quantitative QA/QC measures such as standards, calibrators, and traceable units (as used in Clinical Chemistry) apply to a qualitative laboratory testing field like Clinical IHC?

Reply: Well-established principles of laboratory medicine require that even qualitative assays are calibrated so as to reproducibly identify a threshold separating positive from negative test results. There are many clinical laboratory tests that are reported as positive or negative, such as serology tests for infectious diseases. Even these qualitative tests have calibrators, many of which are traceable to World Health Organization (WHO) certified reference materials. Examples include antibodies to hepatitis B core and surface antigen, hepatitis A virus, human immunodeficiency virus, T. pallidum, measles, rubella, parvovirus, T. cruzi, cytomegalovirus, and others. Therefore, if the field of Clinical IHC were to follow other clinical laboratory disciplines, qualitative IHC tests would at least be calibrated at the limit of detection (LOD). For each IHC test, a sensitivity threshold would be defined by the appropriate regulatory bodies or as part of a manufacturer’s FDA submission. This sensitivity threshold would then be confirmed in each Clinical IHC laboratory as part of IHC test validation.

Current IHC Test Validation Challenges

The absence of standards, calibrators, and traceable units in Clinical IHC leaves Clinical IHC laboratories generally without the tools that other laboratory testing disciplines consider essential for test validation. Clinical IHC laboratories are instructed to use alternate methods of IHC test validation:

For qualitative assays such as IHC, validation usually requires comparing a new assay’s results with a reference standard and calculating estimates of analytic sensitivity and specificity; however, because there are no gold standard referent tests for most IHC assays, laboratories must use another means of demonstrating that the assay performs as expected.51

Current guidelines for ensuring accurate patient test results generally depend on patient-derived samples, where IHC test results are compared against an expected outcome. An important limitation is, as quoted above, a gold standard is usually not available. Laboratories must find “another means of demonstrating that the assay performs as expected.” Consequently, creative solutions have been devised to descriptively verify analytic sensitivity, such as the use of Immunohistochemistry Critical Assay Performance Controls (iCAPCs).36,52 This type of descriptive low-level calibrator comprises normal tissues where low levels of expression are expected for various analytes. They are valuable for evaluating a Clinical IHC test’s analytic sensitivity. Approximately 2 dozen different iCAPCs (to various analytes) have been described.36 Cell lines with defined (low) levels of expression are a similar type of descriptive calibrator for verifying analytic sensitivity.53,54 These are termed “descriptive” calibrators because the analyte concentration is unknown (albeit at a usefully low level) and is not traceable to a national or international standard. Another source of feedback on analytic test performance is through participation in PT surveys, especially if the surveys incorporate central review of the stained slides. Such surveys, however, are only performed 2 to 3 times per year (per analyte) and are not intended as a laboratory’s primary source of analytic test performance feedback.

FAQ: If it is possible to enjoy low error rates using existing Clinical IHC best practices, then isn’t the root cause the fact that not all Clinical IHC laboratories adopt best practices? Should the focus not be towards ensuring that all laboratories adopt current best practices?

Reply: The two paths forward are not mutually exclusive. Nonetheless, the use of international standards, calibrators with traceable units of measure, and quantitative controls represents best practice in the field of laboratory medicine. If complying with such QA/QC practices (that incorporate traceable measures) is also simpler, then it is a win all around. Clinical Chemistry controls and calibrators (incorporating traceable measures) are simpler because they are commercially available at reasonable prices. The test readouts are also straightforward, performed by any trained Medical Technologist. Current best practices for Clinical IHC, on the other hand, require a higher level of expertise in creating and interpreting controls and descriptive calibrators. The Author’s focus is towards simpler Clinical IHC controls and calibrators, like those found in Clinical Chemistry, that also incorporate analytic traceability of measurement. These attributes foster greater reproducibility, inter-laboratory standardization, higher precision, and easier adoption (see the Generation of recommendations section).

Quantitative Controls

QC is an integral part of creating regular accurate feedback on the analytic part of the test. Many guidelines recommend controls on every slide, at least for companion diagnostics.55–61 The question for this RCA is whether the type of feedback provided by Clinical IHC controls differs significantly from controls in other laboratory disciplines.

Controls must sensitively detect deviations in the analytic components of the test. For example, consider a serum glucose assay with an analytic measurement range (AMR) of 5 to 500 mg/dL. For a typical glucose assay, a Clinical Chemistry laboratory will run at least 2 controls—a low (eg, 80 mg/dL) and a high (eg, 300 mg/dL) concentration. The laboratory establishes acceptable ranges for each control and verifies that the control values are within those bounds every day. Moreover, laboratories quantitatively monitor the control data for subtle trends using Levey-Jennings graphs. This is a sensitive system for detecting assay deviations. In comparison, QC samples for Clinical IHC are typically homemade, prepared from excess diagnostic tissue. There are no units of measure analogous to mg/dL. Pathologists typically observe the stained QC sample, decide if it matches what they think the QC should look like, and pass or fail it. There are 2 reasons why this QC protocol can be profoundly insensitive to problems that can nonetheless affect patient test results: (1) controls with overly high analyte concentrations and (2) the absence of quantitative monitoring.

Controls With Overly High Analyte Concentrations

In the hypothetical glucose assay with an AMR of 5 to 500 mg/dL, it would be illogical to use a QC with a concentration of 1000 mg/dL. If the assay can only measure up to 500 mg/dL, then a 1000 mg/dL test result would produce a maximal signal and be reported as >500 mg/dL. The QC value would need to drop below 500 mg/dL before assay problems could be detected. Therefore, it would be a profoundly insensitive control. For this reason, QC test values must be within the AMR.

Yet this situation likely happens on a regular basis in Clinical IHC. College of American Pathologists (CAP) and Clinical Laboratory Standards Institute (CLSI) guidance documents recommend the selection of controls expressing a low or intermediate concentration of analyte.56,62 First, this guidance from CAP and CLSI is a recommendation, not a requirement. Second, even if it were a requirement, test results on tissue samples do not have test values. Clinical IHC lacks QC samples that are analogous to the previously described 80 and 300 mg/dL glucose controls. Consequently, laboratory staff are supposed to select controls that stain weakly or at an intermediate stain intensity level.36 This process is subjective and depends on the availability of suitable samples. Experimental testing of IHC controls with analyte concentrations beyond the AMR confirm that they are insensitive measures of test failure.63

Absence of Levey-Jennings Graphs

The subjectivity in determining stain intensity of tissue controls creates another opportunity for missing analytic error. Evaluating Clinical IHC QC subjectively as pass or fail does not conform to established QC methods found in other clinical laboratory disciplines. Even for qualitative (eg, serologic) tests, other clinical laboratories use quantitative QC parameters, such as relative light units or the signal to cutoff ratio. The QC parameter is monitored in a Levey-Jennings graph. In contrast, Clinical IHC laboratories do not quantitatively evaluate QC. If it were done, it would identify IHC test problems that were previously undetected. That is exactly what happened when Levey-Jennings graphing of IHC controls was tested in 3 Boston-area hospitals over the course of 1 to 2 months.64 This procedure identified an otherwise undetected problem in each of the 3 Clinical IHC laboratories during the course of the month resulting in altered stain intensity.64 None were detected by routine visual inspection, without quantitative monitoring. For example, in one laboratory, stain intensity declined during the weeks that the regular IHC histotechnologist was on vacation. The replacement histotechnologist operated the instrument in a different manner, trying to conserve reagent but inadvertently lowering the stain intensity. This decline was not noticed by visual inspection but readily detected after image quantification and plotting the data using a Levey-Jennings graph. Stain intensity returned to normal when the regular histotechnologist returned from vacation. This story illustrates the fact that the human eye is not highly adept at judging color intensity in absolute terms, without a comparison in the same field of view. In summary, this study demonstrated what is already known to be true in other clinical laboratory disciplines—the use of a Levey-Jennings chart is an invaluable tool in detecting assay problems.


A Traceable System of Measurement for IHC

The growth of precision medicine and new Clinical IHC predictive tests has made the lowering of errors more important than ever. Although new approaches and concepts for IHC QA are improving the landscape,3,33–35 the field of Clinical IHC has not followed in the path of other clinical laboratory disciplines with respect to standards, calibrators, traceable units of measure, and quantitative controls. Until now, it was technically not possible. In 2003, there was a consensus call to develop a HER2 standard reference material (SRM).65 The call was issued in response to studies showing poor harmonization of HER2 testing. It was supported by representatives of the CAP, NIST, FDA, and the Cancer Diagnosis Program of the National Cancer Institute. The report envisioned HER2 cell line standards expressing defined protein and DNA concentrations. Initial data toward the development of such standards were published in 200966 and a DNA standard (NIST SRM 2373) was released in 2016.67 No IHC test standard ever emerged.

To address this need, my colleagues and I developed a system of IHC calibrators and controls traceable to an already-existing standard—NIST SRM 1932, a fluorescein solution. Commercial microbead calibrators traceable to this primary standard are presently used for flow cytometry linearity verification. The system of measurement traceability depends on a physical constant—the molar extinction coefficient of fluorescein. We incorporated this system of measurement traceability into peptide epitope controls attached to cell-sized glass microbeads.68,69 A schematic representation is shown in Figure 2. The peptide epitopes serve as IHC test analytes.70 Each peptide epitope comprises the small linear portion of the native protein to which the antibody binds. The epitopes are detected in the IHC test just like the native protein. Unlike the purified native protein, the peptides are inexpensive, stable, resistant to heat and solvents, and reproducible over time. Like tissues and cells, they can be fixed in formaldehyde, thereby requiring antigen retrieval.71 In addition, they provide for traceability of measurement. Each peptide is manufactured with a single fluorescein (Fig. 2). By measuring fluorescence intensity, the fluorescein concentration on each microbead can be calculated using a standard curve traceable to SRM 1932. As there is 1 fluorescein per analyte (Fig. 2), the fluorescein concentration corresponds to the analyte concentration on the microbeads. Analyte concentration is expressed as “molecules of equivalent fluorochrome” (MEF). An advantage of this system of measurement traceability is that separate primary standards are not required for each analyte. Traceability of measurement can be created for any IHC test using the same primary standard (SRM 1932). A recent (2018) published pilot PT survey using peptide epitope-coated microbeads successfully identified HER2, ER, and PR test outliers.72

Schematic illustration of a peptide analyte with an attached fluorescein, represented as the small yellow glowing sphere attached to a lysine (K). Each sphere represents a single amino acid using the single letter amino acid code. The peptide is covalently linked to a glass microbead (left). The fluorescein facilitates calculation of the peptide concentration per bead as the microbead’s fluorescein concentration equals the peptide concentration. Adapted from Vani et al.69

The immediate impact of measurement traceability is the ability to measure the lower limit of detection (LOD), a parameter of analytic sensitivity. By staining a series of calibrators (traceable to SRM 1932) spanning 1000 to 1,000,000 MEF, the LOD is the lowest concentration that still produces a detectable stain. With image quantification, an exact LOD can be calculated. This measurement can be replicated in laboratories the world over. The ability to objectively and quantitatively measure analytic sensitivity will foster interlaboratory consistency, mirroring the type of validation that Clinical Chemistry and Immunology laboratories routinely perform.


The absence of standards, calibrators, and traceable units of measure means that there is no analytic gold standard, at least in terms of conventional analytic parameters that are used in other laboratory disciplines. This absence fosters test errors. There is a clinical gold standard, such as existed from a clinical trial assay. However, the clinical trial standard is fleeting because, without well-defined analytic test performance parameters, it is difficult to accurately and consistently reproduce over time. This makes LDTs especially prone to error. Accurate validation and maintenance of an LDT without the use of traditional tools for creating regular accurate feedback requires a higher level of expertise and greater resources. The fact that LDTs are more widely used in Clinical IHC thereby compounds the problem.

Standardization of analytic sensitivity will likely have its greatest impact on companion diagnostics. There is a challenge, however, in retroactively identifying the appropriate analytic sensitivity cutoffs after the pivotal clinical trials have long since been completed. Going forward, companion diagnostic IHC tests will ideally incorporate reliable measures of analytic test performance before commencing clinical trials. In this way, there is analytic consistency in the IHC test from the clinical trial assay to commercial manufacture to the hospital laboratory. Day-to-day patient test results will then optimally correlate with the clinical trial data as they were presented for regulatory approval.


The author is grateful for the helpful manuscript critiques by Dr Seshi Sompuram, Ms Vani Kodela, Ms Anika Schaedle, Dr Ron Zeheb, and Dr Gary Horowitz.


1. ISO/TS22367. Medical laboratories—reduction of error through risk management and continual improvement. In: Standardization IOf, ed. 2008. Available at: Accessed February 15, 2019.
2. Plebani M. Errors in clinical laboratoreis or errors in laboratory medicine? Clin Chem Lab Med. 2006;44:750–759.
3. Cheung C, D’Arrigo C, Dietel M, et al. Evolution of quality assurance for clinical immunohistochemistry in the era of precision medicine: part 1: fit for purpose approach to classification of clinical immunohistochemistry biomarkers. Appl Immunohistochem Mol Morphol. 2017;25:4–11.
4. Paik S, Bryant J, Tan-Chiu E, et al. Real-world performance of HER2 testing—national surgical adjuvant breast and bowel project experience. J Natl Cancer Inst. 2002;94:852–854.
5. Roche P, Suman V, Jenkins R, et al. Concordance between local and central HER2 testing in the breast intergroup trial N9831. J Natl Cancer Inst. 2002;94:855–857.
6. Perez E, Suman V, Davidson N, et al. HER2 testing by local, central, and reference laboratories in specimens from the North Central Cancer Treatment Group N9831 intergroup adjuvant trial. J Clin Oncol. 2006;24:3032–3038.
7. Reddy J, Reimann J, Anderson S, et al. Concordance between central and local laboratory HER2 testing from a community-based clinical study. Clin Breast Cancer. 2006;7:153–157.
8. Badve S, Baehner F, Gray R, et al. Estrogen- and progesterone-receptor status in ECOG 2197: comparison of immunohistochemistry by local and central laboratories and quantitative reverse transcription polymerase chain reaction by central laboratory. J Clin Oncol. 2008;26:2473–2481.
9. Polley M, Leung S, McShane L, et al. An international Ki67 reproducibility study. J Natl Cancer Inst. 2013;105:1897–1906.
10. Huang D, Lu N, Fan Q, et al. HER2 status in gastric and gastroesophageal junction cancer assessed by local and central laboratories: Chinese results of the HER-EAGLE study. PLoS One. 2013;8:e80290.
11. Kaufman P, Bloom K, Burris H, et al. Assessing the discordance rate between local and central HER2 testing in women with locally determined HER2-negative breast cancer. Cancer. 2014;120:2657–2664.
12. McCullough A, Dell’orto P, Reinholz M, et al. Central pathology laboratory review of HER2 and ER in early breast cancer: an ALTTO trial [BIG 2-06/NCCTG N063D (Alliance)] ring study. Breast Cancer Res Treat. 2014;143:485–492.
13. Griggs J, Hamilton A, Schwartz K, et al. Discordance between original and central laboratories in ER and HER2 results in a diverse, population-based sample. Breast Cancer Res Treat. 2017;161:375–384.
14. Orlando L, Viale G, Bria E, et al. Discordance in pathology report after central pathology review: implications for breast cancer adjuvant treatment. Breast. 2016;30:151–155.
15. Pinder S, Campbell A, Bartlett J, et al. Discrepancies in central review re-testing of patients with ER-positive and HER2-negative breast cancer in the OPTIMA prelim randomised clinical trial. Br J Cancer. 2017;116:859–863.
16. Canda T, Yavuz E, Özdemir N, et al. Immunohistochemical HER2 status evaluation in breast cancer pathology samples: a multicenter, parallel-design concordance study. Eur J Breast Health. 2018;14:160–165.
    17. Rosa M, Khazai L. Comparison of HER2 testing among laboratories: our experience with review cases retsted at Moffitt Cancer Center in a two-year period. Breast J. 2018;24:139–147.
    18. Vyberg M, Nielsen S. Proficiency testing in immunohistochemistry—experiences from Nordic Immunohistochemical Quality Control (NordiQC). Virchows Arch. 2016;468:19–29.
    19. Ibrahim M, Parry S, Wilkinson D, et al. ALK immunohistochemistry in NSCLC: discordant staining can impact patient treatment regimen. J Thorac Oncol. 2016;11:2241–2247.
    20. Keppens C, Tack V, Hart N, et al. A stitch in time saves nine: external quality assessment rounds demonstrate improved quality of biomarker analysis in lung cancer. Oncotarget. 2018;9:20524–20538.
    21. Carraro P, Plebani M. Errors in a stat laboratory: types and frequencies 10 years later. Clin Chem. 2007;53:1338–1342.
    22. Lippi G, Plebani M, Simundic A-M. Quality in laboratory diagnostics: from theory to practice. Biochem Medica. 2010;20:126–130.
    23. Szecsi P, Ødum L. Error tracking in a clinical biochemistry laboratory. Clin Chem Lab Med. 2009;47:1253–1257.
    24. Lapworth R, Teal T. Laboratory blunders revisited. Ann Clin Biochem. 1994;31:78–84.
    25. Witte D, VanNess S, Angstadt D, et al. Errors, mistakes, blunders, outliers, or unacceptable results: how many? Clin Chem. 1997;43:1352–1356.
    26. Plebani M, Carraro P. Mistakes in a stat laboratory: types and frequency. Clin Chem. 1997;43:1348–1351.
    27. Hammerling J. A review of medical errors in laboratory diagnostics and where we are today. Lab Med. 2012;43:41–44.
    28. Bonini P, Plebani M, Ceriotti F, et al. Errors in laboratory medicine. Clin Chem. 2002;48:691–698.
    29. Howanitz P. Errors in laboratory medicine: practical lessons to improve patient safety. Arch Pathol Lab Med. 2005;129:1252–1261.
    30. Khoury T. Delay for formalin fixation (cold ischemia time) effect on breast cancer molecules. Am J Clin Pathol. 2018;149:275–292.
    31. Agrawal L, Engel K, Greytak S, et al. Understanding preanalytical variables and their effects on clinical biomarkers of oncology and immunotherapy. Semin Cancer Biol. 2018;52:26–38.
    32. Neumeister V, Juhl H. Tumor pre-analytics in molecular pathology: impact on protein expression and analysis. Curr Pathobiol Rep. 2018;6:265–274.
    33. Torlakovic E, Cheung C, D’Arrigo C, et al. Evolution of quality assurance for clinical immunohistochemistry in the era of precision medicine—part 2: immunohistochemistry test performance characteristics. Appl Immunohistochem Mol Morphol. 2017;25:79–85.
    34. Torlakovic E, Cheung C, D’Arrigo C, et al. Evolution of quality assurance for clinical immunohistochemistry in the era of precision medicine. Part 3: technical validation of immunohistochemistry (IHC) assays in clinical IHC laboratories. Appl Immunohistochem Mol Morphol. 2017;25:151–159.
    35. Cheung C, D’Arrigo C, Dietel M, et al. Evolution of quality assurance for clinical immunohistochemistry in the era of precision medicine. Part 4: tissue tools for quality assurance in immunohistochemistry. Appl Immunohistochem Mol Morphol. 2017;25:227–230.
    36. Torlakovic E, Nielsen S, Francis G, et al. Standardization of positive controls in diagnostic immunohistochemistry: recommendations from the international ad hoc expert committee. Appl Immunohistochem Mol Morph. 2015;23:1–18.
    37. Gown A. Diagnostic immunohistochemistry: what can go wrong and how to prevent it. Arch Pathol Lab Med. 2016;140:893–898.
    38. Perez E, Press M, Dueck A, et al. Immunohistochemistry and fluorescence in situ hybridization assessment of HER2 in clinical trials of adjuvant therapy for breast cancer (NCCTG N9831, BCIRG 006, BCIRG 005). Breast Cancer Res Treat. 2013;138:99–108.
    39. Dowsett M, Hanna W, Kockx M, et al. Standardization of HER2 testing: results of an international proficiency-testing ring study. Mod Pathol. 2007;20:584–591.
    40. Henry R, Segalove M. Running of standards in clinical chemistry and the use of the control chart. J Clin Pathol. 1952;5:305–311.
    41. Levey S, Jennings E. The use of control charts in the clinical laboratory. Am J Clin Pathol. 1950;20:1059–1065.
    42. Radin N. What is a standard? Clin Chem. 1967;13:55–76.
    43. Tietz N. A model for a comprehensive measurement system in Clinical Chemistry. Clin Chem. 1979;25:833–839.
    44. Westgard J, Barry P, Hunt M, et al. A multi-rule Shewhart chart for quality control in clinical chemistry. Clin Chem. 1981;27:493–501.
    45. Plebani M. Exploring the iceberg of errors in laboratory medicine. Clin Chim Acta. 2009;404:16–23.
    46. Belk W, Sunderman F. A survey of the accuracy of chemical analyses in clinical laboratories. Am J Clin Pathol. 1947;17:853–861.
    47. Tonks D, Allen R. The accuracy of glucose determinations in some Canadian hospital laboratories. Can Med Assoc J. 1955;72:605–607.
    48. Wootton I, King E. Normal values for blood constituents inter-hospital differences. Lancet. 1953;261:470–471.
    49. Shuey H, Cebel J. Standards of performance in clinical laboratory diagnosis. Bull US Army Med Dep. 1949;9:799–815.
    50. Steindel S, Howanitz P, Renner S. Reasons for proficiency testing failures in clinical chemistry and blood gas analysis: a College of American Pathologists Q-Probes study in 665 laboratories. Arch Pathol Lab Med. 1996;120:1094–1101.
    51. Fitzgibbons P, Bradley L, Fatheree L, et al. Principles of analytic validation of immunohistochemical assays: guideline from the College of American Pathologists Pathology and Laboratory Quality Center. Arch Pathol Lab Med. 2014;138:1432–1443.
    52. Torlakovic E, Nielsen S, Vyberg M, et al. Getting controls under control: the time is now for immunohistochemistry. J Clin Pathol. 2015;68:879–882.
    53. Rhodes A. Developing a cell line standard for HER2/neu. Cancer Biomark. 2005;1:229–232.
    54. Rhodes A, Jasani B, Anderson E, et al. Evaluation of HER-2/neu immunohistochemical assay sensitivity and scoring on formalin-fixed and paraffin-processed cell lines and breast tumors: a comparative study involving results from laboratories in 21 countries. Am J Clin Pathol. 2002;118:408–417.
    55. Wolff A, Hammond M, Hicks D, et al. Recommendations for human epidermal growth factor receptor 2 testing in breast cancer: American Society of Clinical Oncology/College of American Pathologists Clinical Practice Guideline Update. J Clin Oncol. 2013;31:3997–4013.
    56. Hewitt S, Robinowitz M, Bogen S, et al. Quality Assurance for Design Control and Implementation of Immunohistochemistry Assays; Approved Guideline—Second Edition, CLSI Document I/LA28-A2. Wayne, PA: Clinical and Laboratory Standards Institute; 2011.
    57. CAP-ACP National Standards for High Complexity Laboratory Testing C. CAP-ACP Clinical Immunohistochemistry Checklists: Part I and Part II. Ottawa, Canada: Canadian Association of Pathologists/Association Canadienne Des Pathologistes. Available at: Accessed September 10, 2015.
    58. Rakha E, Pinder S, Bartlett J, et al. Updated UK recommendations for HER2 assessment in breast cancer. J Clin Pathol. 2015;68:93–99.
    59. Lambein K, Guiot Y, Galant C, et al. Update of the Belgian guidelines for HER2 testing in breast cancer. Belg J Med Oncol. 2014;8:109–115.
    60. Rasmussen O, Jorgensen RTaylor C, Rudbeck L. Chapter 14: controls. Immunohistochemical Staining Methods. Glostrup, Denmark: Dako Corporation; 2014:160–169.
    61. Cheung C, Taylor C, Torlakovic E. An audit of failed immunohistochemical slides in a clinical laboratory: the role of on-slide controls. Appl Immunohistochem Mol Morphol. 2017;25:308–312.
    62. Anatomic Pathology Checklist. In: College of American Pathologists, ed. Checklist question ANP22550. 2015.
    63. Vani K, Sompuram S, Schaedle A, et al. The importance of epitope density in selecting a positive IHC control. J Histochem Cytochem. 2017;65:463–477.
    64. Vani K, Sompuram S, Naber S, et al. Levey-Jennings analysis uncovers unsuspected causes of immunohistochemistry stain variability. Appl Immunohistochem Mol Morphol. 2016;24:688–694.
    65. Hammond M, Barker P, Taube S, et al. Standard reference material for Her2 testing: report of a National Institute of Standards and Technology-sponsored consensus workshop. Appl Immunohistochem Mol Morphol. 2003;11:103–106.
    66. Xiao Y, Gao X, Maragh S, et al. Cell lines as candidate reference materials for quality control of ERBB2 amplification and expression assays in breast cancer. Clin Chem. 2009;55:1307–1315.
    67. Lih C, Si H, Das B, et al. Certified DNA reference materials to compare HER2 gene amplification measurements using next-generation sequencing methods. J Mol Diagn. 2016;18:753–761.
    68. Sompuram S, Vani K, Tracey B, et al. Standardizing immunohistochemistry: a new reference control for detecting staining problems. J Histochem Cytochem. 2015;63:681–690.
    69. Vani K, Sompuram S, Schaedle A, et al. Analytic response curves of clinical breast cancer IHC tests. J Histochem Cytochem. 2017;65:273–283.
    70. Sompuram S, Kodela V, Ramanathan H, et al. Synthetic peptides identified from phage-displayed combinatorial libraries as immunodiagnostic assay surrogate quality control targets. Clin Chem. 2002;48:410–420.
    71. Sompuram S, Vani K, Schaedle A, et al. Selecting an optimal positive IHC control for verifying antigen retrieval. J Histochem Cytochem. 2019. Doi: 10.1369/0022155418824092.
    72. Sompuram S, Vani K, Schaedle A, et al. Quantitative assessment of immunohistochemistry laboratory performance by measuring analytic response curves and limits of detection. Arch Pathol Lab Med. 2018;142:851–862.

    immunohistochemistry; error; root cause; standard; calibrator; quality control

    Copyright © 2019 The Author(s). Published by Wolters Kluwer Health, Inc