PD-L1 testing by immunohistochemistry (IHC) has had, and continues to have, significant impact on determining patient eligibility for specific immunotherapy drugs targeting the PD-1/PD-L1 checkpoint. CIQC proficiency testing (PT) for PD-L1 as a predictive biomarker was initiated with an aim to support Canadian laboratories with information on the quality of predictive testing for PD-L1 status in NSCLC and possibly other tumors as new indications for the testing evolve. The scope of the CIQC PT for PD-L1 testing was to compare the results of testing performed by participating laboratories (CIQC member laboratories) with the current “gold standard”/reference standard results. Diagnostic accuracy, as described by the STARD statement,1 is particularly pertinent to PT for predictive biomarkers where demonstration that a participant laboratory’s testing protocol generates results with high diagnostic accuracy, as compared to that of a designated gold standard/reference testing protocol, on the same set of samples being scored by the same readers using the same readout criteria. Similarly, in 2008, the Clinical Laboratory Standards Institute (CLSI) EP12-AE Approved Guideline recommends calculation of sensitivity and specificity for determination of qualitative assay performance in which a “candidate test” is compared against a “comparator test.”2 Although this approach is not traditionally employed for PT, the development of precision medicine with the ever-increasing expectation of high accuracy in predictive biomarker testing as well as the increasing expectation of clinical laboratories that materials used for PT runs should contribute to the validation of their biomarker assays, both exert pressure on PT programs to adopt new models of testing that are “fit-for-purpose.”
PT programs for IHC testing have historically focused on analytical sensitivity, analytical specificity, and reportable range; however, these test performance characteristics alone will no longer be sufficient as measures of quality in IHC testing because as predictive biomarker testing in precision medicine expands, evidence of quality in IHC testing will demand proof of diagnostic accuracy. This is especially true for PT challenges where currently no evidence has been generated that links analytical sensitivity and specificity with diagnostic sensitivity and specificity. Since this link has not been established, diagnostic sensitivity and specificity need to be assessed directly, rather than indirectly through analytical sensitivity and specificity.
PT should be able to answer the question of whether testing results of participating laboratories are considered “acceptable” when compared to results obtained by a designated gold standard/reference laboratory. For predictive biomarkers, the goal is not to demonstrate whether a participant laboratory’s testing results are similar/comparable to the testing results of other laboratories (ie, concordance), but rather, whether they are similar/comparable to the designated gold standard/reference laboratory testing results using clinically relevant test performance characteristics (ie, diagnostic sensitivity and specificity). Hence, it is accuracy rather than concordance or other measures of test similarity [eg, Pearson correlation, intraclass correlation (ICC), etc.] that is the relevant measure of test quality for predictive biomarkers. The main objective of the CIQC PD-L1 educational PT challenge was to learn how to design and analyze the results of PT challenges in order to provide fair evaluation of PD-L1 testing performance of participating laboratories. The principles of test comparison as presented in this paper, are generally applicable to IHC assays for predictive biomarkers.
MATERIALS AND METHODS
Tissue Microarray (TMA) Design for Diagnostic Accuracy
Proficiency testing materials consisted of the following elements: (1) TMA tissue cores from 104 randomly selected primary NSCLC with no specific preselection bias towards cases designated as “PD-L1 positive” or “PD-L1 negative” and (2) Tissue cores of tonsil.
Member Laboratories: Participant and Reference Laboratories
All CIQC member laboratories that expressed interest in participating in the educational PD-L1 PT exercise received slides mounted with unstained sections of the PT tissue tools developed for this purpose. All participating laboratories were asked to apply their PD-L1 IHC protocol(s), complete the self-assessment, and return the stained slides to CIQC for the expert-assessment. The choice of reference laboratory was made based on the fit-for-purpose “3D” (disease-drug-diagnostic assay) approach underlying PD-L1 testing as a predictive biomarker whereby the results generated by the laboratory of the manufacturer of a United States Food and Drug Administration (FDA)-approved and/or a Conformité Européene (CE)-marked PD-L1 diagnostic assay that has been clinically validated to predict potential response to a particular drug in a particular disease, is used as the reference (gold standard) result (ReflabRefkit). As such, 2 reference laboratories were included: Roche Tissue Diagnostics (Ventana) and Agilent Technologies (Dako). Roche Tissue Diagnostics (Ventana) was the reference laboratory for protocols using SP263 and SP142 as primary antibodies (“SP263 RefLabRefKit” and “SP142 RefLabRefKit”). Agilent Technologies (Dako) was the reference laboratory for protocols using 22C3 and 28-8 as primary antibodies (“22C3 RefLabRefKit” and “28-8 RefLabRefKit”). No reference laboratory was available/applicable for protocols using E1L3N as the primary antibody; therefore, the consensus results for each tumor sample from the 3 laboratories using E1L3N were used for calculations; sample tissue cores with no consensus (eg, one laboratory positive, one negative, and one tissue unsatisfactory) on expert assessment readout were excluded from calculations.
All laboratories submitted their own readout of the IHC slides stained with their laboratory’s protocol using TMA Scorer, the CIQC on-line reporting application (http://www.tmascorersystem.ca/login.php). The results were submitted as “positive,” “negative,” or “tissue unsatisfactory for evaluation.” “Equivocal” was also allowed for the self-assessment in the educational run since it is a default category for other CIQC PT runs but will be excluded from future CIQC PT runs for PD-L1 as this category is not included/required in any of the currently recognized readout schemes.
Self-assessment readout results were evaluated for the agreement between positive results obtained by participant laboratories compared to those obtained by the reference laboratory (ie, diagnostic sensitivity) and the agreement between negative results obtained by participant laboratories compared with those obtained by the reference laboratory (ie, diagnostic specificity) using self-assessment results for the same primary antibody clone as the one used by the respective reference laboratory. This approach was employed with the assumption, as a working hypothesis, that the use of a specific PD-L1 primary antibody clone by the participant laboratory either as part of a laboratory developed test (LDT), or a FDA-approved kit and/or a CE-marked kit (RefKit), reflected the same purpose as the reference laboratory’s reference kit (RefLabRefKit) for the same clone.
Calculations for E1L3N were performed using the consensus results from the participant laboratories and against the 4 designated gold standards since the participant laboratories could have adjusted their readout for any of the 4 different purposes linked to the PD-L1 RefKits.
For the expert assessment, all cores for all slides submitted by participating laboratories and reference laboratories were evaluated for consensus tumor proportion score (TPS) as a continuous variable (0 to 100%) by 2 experienced expert assessors (CCC, EET) who had also received formal training in the readout of PD-L1 RefKits employing the TPS for nonsmall cell lung cancer (NSCLC). TPS with 1%, 5%, 25%, and 50% cut-off points were analyzed in order to evaluate laboratory performance when the readout changes for different purposes/indications. Other readout types were not performed but will be incorporated into future CIQC PT challenges for PD-L1 where participating laboratories indicate use of such scoring for their clinical practice, in order to reflect a fit-for-purpose approach.
As the expert assessment removed “noise” from different readers, the TPS readout results for different protocols at different cut-off points facilitated assessment of the technical performance of the IHC protocol. Therefore, the analysis of the expert assessment results reflected diagnostic accuracy of the IHC protocol, rather than the whole IHC assay.
Feedback to Laboratories
Laboratories received feedback on diagnostic sensitivity and specificity for self-assessment against relevant designated gold standard results as well as for the results of the expert assessment. “Pass” and “fail” results were not issued to laboratories since there is currently no international expert consensus on what thresholds should be used for accuracy (sensitivity and specificity) of PD-L1 IHC protocols for different purposes.
Hypothetical Pass/Fail Analysis
In general, CIQC does not calculate “pass” or “fail” rates for educational runs. So although “pass/fail” designations were not returned to participants, for this manuscript, we have extended the analysis of the data from both the self and expert assessments in order to hypothetically determine whether the diagnostic accuracy of the protocols would be considered successful according to criteria set out for two analogous Type 2 predictive IHC biomarkers,3 namely estrogen receptor (ER) and progesterone receptor (PgR) in breast cancer, by the American Society of Clinical Oncology and the College of American Pathologists (ASCO/CAP).4,5 For both self and expert assessments, we used results from the RefKits that exist for four PD-L1 clones (22C3, 28-8, SP142, SP263) in NSCLC that were performed by the relevant RefLabs, as the designated gold standard for diagnostic accuracy of the IHC protocols. We then determined whether the achieved diagnostic accuracies met the ASCO/CAP criteria (for ER and PgR in breast cancer) of 90% agreement for positive cores and 95% agreement for negative cores.4,5
Descriptive statistics were used to summarize the data. Sensitivity and specificity analyses were conducted to examine accuracy of results compared with relevant designated gold standard.5
CIQC received results for PD-L1 protocols that included 5 primary anti-PD-L1 antibody clones: 22C3, 28-8, SP263, SP142, and E1L3N.
The results from the self-assessment for participant laboratories are summarized in Table 1. For laboratories submitting results using primary antibody clones with corresponding RefKits for NSCLC (ie, 28-8, 22C3, SP263, SP142), 2 of 9 (22%) achieved a diagnostic sensitivity of ≥90% against their relevant gold standards; 6 of 9 (67%) achieved a diagnostic specificity of ≥95% against their relevant gold standards. Only 2 of 9 (22%) laboratories achieved both diagnostic sensitivity ≥90% and diagnostic specificity ≥95% against their relevant gold standards; both laboratories used the RefKit for the 22C3 clone.
Three laboratories submitted results using clone E1L3N, which has no corresponding RefKit for NSCLC. Against the gold standard of the RefKit containing the 28-8 clone, none of the 3 laboratories (0%) achieved a diagnostic sensitivity of ≥90%, whereas 2 of the 3 (67%) achieved a diagnostic specificity ≥95%. Against the gold standard of the RefKit containing the SP263 clone, one of the 3 laboratories (33%) achieved a diagnostic sensitivity of ≥90%, whereas none of the 3 (0%) achieved a diagnostic specificity ≥95%. Against the gold standard of the RefKit containing the 22C3 clone, all 3 laboratories (100%) achieved a diagnostic sensitivity of ≥90% while none of the 3 (0%) achieved a diagnostic specificity ≥95%. Against the gold standard of the RefKit containing the SP142 clone, none of the 3 laboratories (0%) achieved a diagnostic sensitivity of ≥90%, whereas one of the 3 (33%) achieved a diagnostic specificity ≥95%. None (0%) of the laboratories using the E1L3N clone achieved both diagnostic sensitivity ≥90% and diagnostic specificity ≥95% against any designated gold standard.
As there are 4 designated RefLabRefKit combinations for the educational run, each for one of the RefKits for NSCLC using 4 different PD-L1 clones (ie, 22C3, 28-8, SP142, SP263), each reference laboratory also acts as a participant laboratory when it is not acting as a reference laboratory. When these results are analyzed against each other, and with the assumption that each RefLab submitted self-assessment results from their own RefKit in a fit-for-purpose manner, the diagnostic sensitivity and specificity changes depending on the selected gold standard (data not shown). These results indicate that the SP263 RefKit had acceptable diagnostic sensitivity and specificity against the 22C3 RefKit when the readout was also generated by the RefLab, but not the other way around because of lower sensitivity of the 22C3 RefKit protocol/readout combination when the SP263 RefKit with the RefLab readout acts as the gold standard. None of the other combinations reached desired diagnostic sensitivity and specificity when being reported by the RefLabs using what are presumably different cut-off points based on the specific purpose of each of their RefKits.
Expert Assessment Results
The results of the expert assessment readout for eleven protocol/laboratory combinations are shown in Figure 1 as percent agreement for positive and negative cores using four different designated gold standards each with five different cut-off points. Using this approach allowed the analysis to encompass four potential different purposes.
As described in Methods, the expert assessment analyses revealed accuracy of the IHC protocols. Figure 1 summarizes results from the expert assessment and shows that diagnostic accuracy of the IHC protocols changes depending on (1) the protocol that is designated as the gold standard and (2) the cut-off point being used.
If we use results from the 22C3 RefKit for NSCLC performed by the designated reference laboratory as the gold standard (Fig. 1, cells highlighted in gold), and following the ASCO/CAP criteria for breast cancer biomarkers of 90% agreement for positive cores and 95% agreement for negative cores, then we see that 4, 3, 1, 0, 0 protocols achieved both diagnostic sensitivity ≥90% and diagnostic specificity ≥95% at TPS cut-offs of 50%, 25%, 10%, 5%, and 1%, respectively (Fig. 1, cells highlighted in green). This is in contrast to using results from the 28-8 RefKit for NSCLC performed by the designated reference laboratory as the gold standard, we see that 2, 0, 0, 0, 0 protocols achieved both diagnostic sensitivity ≥90% and diagnostic specificity ≥95% at TPS cut-offs of 50%, 25%, 10%, 5%, and 1%, respectively. Using results from the SP142 RefKit for NSCLC performed by the designated reference laboratory as the gold standard, we see that 5, 2, 0, 0, 1 protocols achieved both diagnostic sensitivity ≥90% and diagnostic specificity ≥95% at TPS cut-offs of 50%, 25%, 10%, 5%, and 1%, respectively. Lastly, using results from the SP263 RefKit for NSCLC performed by the designated reference laboratory as the gold standard, we see that 1, 0, 2, 0, 0 protocols achieved both diagnostic sensitivity ≥90% and diagnostic specificity ≥95% at TPS cut-offs of 50%, 25%, 10%, 5%, and 1%, respectively.
The analyses of the results for the different RefLabRefKit combinations when compared against each other are also included in Figure 1. For example, the accuracy of the IHC protocol results from the 28-8 RefLab, the SP142 RefLab, and the SP263 RefLab are shown against the designated gold standard results from the 22C3 RefLab. At the 50% cut-off point, using results achieved by the 28-8 RefKit as the gold standard, the 22C3 RefKit achieved 86% sensitivity and 99% specificity, while the other way around, the 28-8 RefKit achieved 92% sensitivity and 98% specificity when 22C3 RefKit was used as gold standard. Similarly, the 22C3 RefKit did not reach “pass” scores at 90% sensitivity and 95% specificity for any of the evaluated cut-off points. Careful study of this exercise revealed that from the perspective of determining accuracy of the IHC protocol, none of the PD-L1 RefKits are “interchangeable” with each other at all TPS cut-off points for the designated “pass-fail” criteria.
The conclusions that may be drawn from any PT IHC challenge depend on the design and construction of samples used for the PT challenge, how the results are analyzed, and what criteria are used to determine whether participant laboratories successfully met the challenge. PT challenges must align the scope of IHC PT testing with the scope of IHC clinical testing by participating laboratories in order to design tissue tools appropriate for the evaluation and to define relevant criteria for setting “pass” versus “fail” statuses. Currently, different PT programs/interlaboratory assessment schemes use different tissue tools and assess different test performance characteristics; therefore the results between different programs are not truly comparable.6–10 Although only a small number of laboratories currently perform PD-L1 testing in Canada and an even smaller number of laboratories participated in the CIQC PD-L1 PT educational run, inclusion of a large number of samples and the availability of designated gold standard results from four RefKits for NSCLC, enabled meaningful analyses of protocol accuracy for PD-L1 testing in NSCLC for participating laboratories.
Purpose in PT: Accuracy, Appraisal, and the “3D” Approach
For predictive biomarker testing, especially in the context of companion/complementary diagnostics, purpose can be defined by the “3D” approach whereby a specific drug receives regulatory approval for use in patients with a specific disease condition provided that a valid result from a specific diagnostic assay is available. Each approved “drug-disease-diagnostic assay” combination is unique and the 3 components are typically linked together by clinically validated results (ie, response data from clinical trials). The “3D” approach can also be applied to purpose in PT. The main purpose of a PT program is to evaluate the accuracy of protocols (diagnostic assay) that in the clinical care setting, are meant to appropriately stratify patients with specific conditions (disease) for a specific targeted therapy (drug). However, there also exists a different aspect of purpose in PT, that being whether the purpose for which a participant laboratory is performing (or wants to start performing) any particular IHC protocol in the clinical care setting, aligns with the purpose of the IHC protocol that is designated as being the gold standard result by the PT program. Only if these purposes are aligned, can the PT program provide a legitimate appraisal of whether participant laboratories can be deemed as having “passed” or “failed” a PT challenge.
PT programs typically use pre-established criteria to determine whether the performance of a laboratory is acceptable or not. In the current state of PT in IHC, the so-called “pass” criteria are currently not defined in terms of test performance characteristics by any of the major PT programs. However, the introduction of PD-L1 IHC testing, with some of its unique demands, has highlighted the need for PT programs to refine both the methodology and terminology utilized for PT in IHC. In addition, there is a clear requirement that PT challenges be designed, and the results evaluated in such a manner that appraisals of “pass” or “fail” are relevant to the purpose of the IHC testing.
Although the lack of information from participating laboratories regarding the specific purpose(s) of their testing prevented CIQC from setting fair “pass/fail” thresholds for this educational run, the expert assessment results nevertheless allowed for determination of protocol accuracy using TPS at 5 different cut-off points that are potentially relevant to currently available PD-L1 RefKits. We recommend that participating laboratories examine their results and consider applying the ASCO/CAP recommendation of 90% agreement for positive results (sensitivity) and 95% agreement for negative results (specificity)4,5 as these criteria can reasonably be considered as “acceptable laboratory performance” for any predictive biomarker relevant to targeted therapy; this recommendation would apply until such time as fit-for-purpose guidelines that stipulate the levels of diagnostic sensitivity and specificity to be designated as being clinically acceptable for PD-L1 testing become available.
Evaluating Proficiency in PD-L1 Testing: IHC Assay Results Versus IHC Protocol Results Versus IHC Readout Results
Proficiency testing for PD-L1 can be demanding because there needs to be clear consideration of what is being evaluated: are we evaluating for calibration of the IHC protocol (ie, IHC protocol results, their analytical sensitivity and specificity and overall quality of signal-to-noise ratio), IHC protocol diagnostic sensitivity and specificity (ie, diagnostic accuracy of the IHC protocol which excludes pathologist readout and hence, cannot be determined using the self-assessment approach), IHC assay diagnostic sensitivity and specificity (ie, diagnostic accuracy of the IHC assay which includes pathologist readout and hence, can be determined using the self-assessment approach, but is not informative regarding potential causes of failure), or accuracy±precision of pathologist scoring (ie, IHC readout results)? In general, the CIQC PT design enables evaluation of the IHC protocol (based on expert assessment results) and the IHC assay (based on self-assessment results). Although PT challenges may address the readout of the IHC protocol to varying degrees, in principle, the focus of PT is the performance of a participating laboratory’s IHC protocol. For determination of the accuracy of the IHC protocol (either its analytical sensitivity/specificity or diagnostic sensitivity/specificity) in proficiency testing, 3 elements are necessary: (1) PT materials that are distributed by the PT program to eliminate potentially confounding effects of local pre-analytical differences, (2) results of participant protocols that are generated by expert assessment readout to eliminate potential variability of local readout results, and (3) gold standard results against which the accuracy of participant results can be determined. As part of the expert assessment for the educational run, participants were asked to stain the PT materials sent out by CIQC, submit their readout, and then to return the slides to CIQC so that an expert consensus readout of all the stained slides could be performed. In this way, inter-participant readout variability was eliminated, readout “noise” minimized, and IHC protocol accuracy could be evaluated. However, since the specific purpose of the PD-L1 testing by participating laboratories was not articulated for the educational run a priori, a more extensive analysis of the data was performed in order to account for the possibility that participants may be performing testing for different purposes. We chose to use TPS for the expert assessment readout to determine protocol accuracy because of its commonality as a readout component of all PD-L1 assays that currently have approval as a companion/complementary diagnostic assay for predicting potential response to immunotherapies in NSCLC.
For the self-assessment, participants were asked to stain slides sent out by CIQC and submit a locally performed readout. Although the participants could have employed various different readout methods, the results were analyzed on the assumption that participants submitted readout results based on the readout/cutoff requirements of the RefKits for NSCLC containing the PD-L1 primary antibody clone that the participant used to stain the CIQC PT materials. Future CIQC PD-L1 PT challenges will offer more direction to participants with respect to the self-assessment portion of the results in order to determine whether the results of the PT challenge are “fit-for-purpose.”
Since the determination of overall laboratory performance was better with expert assessment results than with self-assessment results, this may suggest that: (1) the IHC slide staining (IHC protocol) results are better than suggested by self-assessment results, (2) CIQC may have incorrect assumptions about the purpose of the testing performed by participating laboratories that did not use RefKits, and (3) separate evaluation of pathologist readout of PD-L1 testing is required.
Diagnostic Accuracy in Proficiency Testing: A Metric That Matters for Laboratory Performance
For most predictive biomarkers, it is diagnostic accuracy, as measured by diagnostic sensitivity and specificity, that matter to clinical practice. PT programs often assess technical/analytical sensitivity and specificity by using small numbers of tissue samples containing a designated descriptive limit of detection (LOD) with an iCAPC–like design or cell lines.11 However, for PD-L1 predictive assays, the desired analytical sensitivity/specificity that would ensure acceptable diagnostic sensitivity and specificity has not been determined for any of the existing assays. Hence, it is not possible to link laboratory performance in proficiency testing that relies on analytical sensitivity/specificity with laboratory performance in clinical practice (diagnostic accuracy).
The availability of designated gold standard results enables evaluation of accuracy (comparison of a candidate protocol to a gold standard comparative protocol), rather than evaluation of concordance (comparison of results to each other without a designated gold standard).5 In addition, because PD-L1 is a qualitative assay with cut-off points tailored to specific purposes, analysis of results without the consideration of such cut-off points would not enable a meaningful evaluation of diagnostic accuracy. If a gold standard is not available, which on occasion may be the case, then this should be clearly stated by the PT program and calculations of diagnostic sensitivity and specificity replaced by percent positive agreement (PPA) and percent negative agreement (PNA) if the number of samples allows for it. If only analytical sensitivity/specificity are explored, then this should also be stated. Our data indicates that an IHC protocol deemed accurate for one purpose, may be inaccurate (because of lower diagnostic sensitivity and specificity) for another purpose (eg, another cut-off point) (Fig. 1).
Therefore, the aim of determining whether laboratories are “right” or “wrong” regarding stratification of patients by using PD-L1 IHC assay as a predictive marker can only be fulfilled by determining diagnostic accuracy of the IHC protocol for specific purposes (in our exercise this is achieved by designating gold standards and using different cut-off points). Although it is feasible to do so, systematic error, which is addressed by Bland-Altman plots, was not evaluated by this PT run. Various other approaches and calculations may provide interesting and potentially relevant information; however, diagnostic accuracy addresses clinically relevant test performance and was thus deemed “fit-for-purpose” for this PT challenge.
“Fit-for-purpose” in PT entails that:
- The purpose(s) of each PT challenge is declared a priori. For predictive biomarkers, this is exemplified by the “3D” approach (drug-disease-diagnostic
- The PT program has selected/designated purpose-specific gold standard results for the PT challenge.
- The PT materials for the PT challenge are designed and constructed to enable properly powered statistical calculations of relevant test performance characteristics (eg, diagnostic accuracy by diagnostic sensitivity/specificity).