Pembrolizumab, a humanized antibody targeting programmed death receptor 1, was recently approved by the Food and Drug Administration (FDA) for first-line therapy of patients with advanced non–small cell lung carcinoma.1 The programmed death-ligand 1 (PD-L1) immunohistochemistry 22C3 pharmDx Kit (Agilent-Dako) is an approved companion diagnostic that detects the corresponding ligand, PD-L1. The companion diagnostic test is an on-label requirement for prescription of pembrolizumab in the first-line setting and uses a clinical scoring approach to classify patients for treatment on the basis of tumoral PD-L1 expression. In lung cancers, PD-L1 expression is evaluated by the Tumor Proportion Score (TPS). The specimen is considered to have PD-L1 expression if the TPS is ≥1% and has high PD-L1 expression if the TPS is ≥50%; in clinical trials pembrolizumab showed a high response rate in patients with over 50% ligand expression.2,3 The test is optimized to a threshold value based upon the number of cells showing PD-L1 expression on tumor cell (TC) membranes4 at or above the threshold intensity. In lung cancer, the 22c3 pharmdx test does not evaluate non-TCs but now does for its use in gastric or cervical cancer. The stromal cells, also called “immune cells,” are defined based on the complementary diagnostic test for atezolizumab but are now evaluated in diagnostic tests associated with other programmed death 1 axis drugs.
PD-L1 expression in lymphoid and other immune effector cells is a special challenge for pathologists. Recent publications have shown that interpretation of the “programmed death-ligand 1 immunohistochemistry score” (percentage of cells staining) by pathologists is more reproducible for estimates of TC staining, but not for immune cell staining.3,5 Even with extensive training, pathologists were unable to reproducibly score immune cell staining.6
A challenge for immunohistochemistry assays is that they are optimized to provide a binary outcome, but in reality, they detect biomarkers that are innately expressed as a continuous variable. This scenario poses a problem for pathologists, who are asked visually to assess target protein expression across a population of many thousands of cells in a tissue section and to arrive at a score (percentage of biomarker expressing cells over a nonstandardized threshold) for TCs, or any other admixed cell population (such as lymphocytes and/or macrophages), to give a result that can predict a response to the targeted therapy.5 Rapid technological advances and the advent of computer-aided platforms have permitted development of machine-based scoring algorithms for automated assessment of immunohistochemistry expression that show promise in overcoming these challenges, and provide a possible approach to accurate, rapid, and reproducible analysis. This manuscript describes an automated image analysis approach for evaluation of PD-L1 immunohistochemistry expression on both tumor and immune cells in non–small cell lung carcinoma that may be reliably used to assess PD-L1 expression in a more objective manner. The purpose of this study was to validate and assess reproducibility of Optra automated PD-L1 image analysis solution for non–small cell lung cancer. The primary goal was to investigate whether PD-L1 immunohistochemistry can be validated and standardized using automated machine-based scoring on whole slides. Additional goals included: (i) the evaluation of differences in the assessment of PD-L1 expression on TCs versus immune infiltrates; (ii) comparing the interpretation of results between different pathologists; and (iii) assessment of the reproducibility of automated machine-measurement of PD-L1 expression on tumor and immune cell population compared with pathologists.
Patient Cohort for Tissue Procurement and Immunohistochemistry
Immunohistochemistry slides that had been stained with the FDA-approved PD-L1 immunohistochemistry with anti-PD-L1 22C3 mouse monoclonal primary antibody were obtained from the Yale School of Medicine Department of Pathology archives from June 2016 to April 2017. These slides represented 230 cases of non–small cell lung cancer. Additional training and evaluation set of slides including customized index tissue microarrays containing representative lung cases with variable PD-L1 expression were utilized for training and validation of algorithm thresholds. No personalized health information was obtained or required for this study. Case were selected and retrieved under permission from the Yale Human Investigation Committee protocol #9505008219.
Slide Scanning and Generation of Pathologists’ and Automated Optra Image Analysis PD-L1 Scores
Sections (5-μm) were scanned using OptraSCAN (OptraSCAN Inc., California Corporations) at ×20 with a resolution of 0.5 μm/pixel. Three board-certified pathologists (C.R.T., J.H., D.L.R.) from different institutes across United States, all with extensive recent experience in interpreting the assay, independently recorded the percentages of tumor and immune cells staining positively based on the scoring guidelines (see below). A subset of cases, representative of the entire cohort was selected to render the study more feasible for the participating pathologists to compare interpathologist scores and to validate automated scores generated for tumor and immune cells. Sample size was calculated based on the overall published prevalence of PD-L1 positivity rate, which varies between 20% and 70% for different cohorts.7,8 In the absence of similar published studies to provide a basis, we adopted 50% prevalence (p) for TC positivity, with an error of 10% (L). The sample size was rounded to 100 by the formula 4pq/L 2(q=1−p). To select a representative sample of 100 cases across the scoring categories, a proportionate to size sampling technique was used after stratifying all 230 lung cancer cases into previously described National Comprehensive Cancer Network score categories (Table 1).
The following analytical components were assessed: TC staining (an estimate of the percentage of TCs exhibiting partial or complete membranous staining) and immune cell staining (an estimate of the percentage of immune cells, including macrophages and lymphocytes, within the tumor, exhibiting staining). Scoring was based upon the entire slide. The digital images were rendered viewable using the Optra web-based image management system for viewing and recording individual whole slide image (WSI) scores. All images were hosted on a dedicated server with a user login for individual pathologists.
Training and Evaluation of Algorithm (Feedback Machine Learning Process)
For development of Optra’s membrane algorithm for PD-L1, the training data set consisted of 40 WSIs using 22C3 antibody chromogenic slides which accounted for a rich variation in tumor and immune cell positivity for lung cancer as input data set (Fig. 1). Whole slide tissue images usually consist of 30 to 800 fields of view (FOVs). Each FOV was an input image for the algorithm. To be sure we did not miss on hot spots for tumor and immune cells, negative tumor or immune cell membrane positivity and density of immune cells; we subjectively selected FOV from this set of WSIs to represent a variation in intensity and percent positivity for tumor and immune cells. Multiphase, intensity, and morphology-based segmentation was used for identification of all viable nucleated cells in individual FOV and ultimately the WSI. The algorithm input for nuclear parameter primarily included identification of all viable tumor and immune cell nuclei in the blue plane of the input image based on the intensity values for the hematoxylin counterstain. Distance transform algorithm was also applied on segmented image to get seed points for further segmentation to separate as many connected viable nucleated cells as possible. Cells exhibiting membrane positivity were identified in the red plane of the input image. The intensity values for both, the membrane and nuclear stain were rendered configurable in the algorithm to account for subtle variations in membrane positivity and hematoxylin counterstaining which differs between laboratories though they may use the same assay. Next step included classification and segregation of tumor and immune cells based on morphologic characteristic thresholds like size, circularity etc. The immune cell and TC nuclei were further classified into positive or negative using texture features based classification viz. size, shape, intensity, nucleo:cytoplasmic ratio for identifying and differentiating macrophages from TCs among a mixed population of TC and immune cells in the WSI. The membrane positive candidates were validated and any background/residual and/or cytoplasmic stain was filtered out by unique stroke detection algorithm. The cells assessed were further quantified into TCs and immune cells as below:
- Total number of TCs with positive and negative membrane staining
- Total number of immune cells with positive and negative membrane staining
- Total number of TCs with membrane staining
- Total number of immune cells with membrane staining
Further refinement in the algorithm training continued followed by iterations of training and evaluation sessions by a board-certified pathologist experienced in scoring PD-L1 antibody staining for the assay used. Posttraining iterations, a validation set 1 comprising of 50 WSIs (not a part of training set) were run for implementation of the fully trained robust algorithm. Receiver operator curves were plotted to assess the performance of the algorithm at 1% cutoff for both tumor and immune cell positivity at any intensity (Figs. 2A, B).
The automated PD-L1 algorithm quantifies the segmented cells for percent-positive membrane positivity using nuclei as the denominator, as well as staining intensity according to predefined categories, resulting in separate percentage scoring system for tumor and immune cells; and a processed overlay output image is generated on the input image to reflect the analysis results (Figs. 3A, B).
The fully developed quantitative PD-L1 algorithm was then implemented on each WSI of (validation set II, N=99) to generate an automated whole slide percent-positive membrane score for TCs and immune cells separately, according to the preselected cutoff for the above mentioned standard PD-L1 assay, without any human intervention.
Pathologists’ scores were recorded on a 6-point scale as per the published National Comprehensive Cancer Network PD-L1 score categories5 with each value representing a range of PD-L1 expression by percentage of TCs showing positive membrane staining as follows: category A, negative or <1%; B, 1% to 4%; C, 5% to 9%; D, 10% to 24%; E, 25% to 49%; F, ≥50% for validation set II.
Immune Cell Scoring
The following categories were used: A<1%; B 1 but <10%; C ≥10%. The adoption of this unified scoring system previously used in the study, facilitated assessment of the TPSs and immune cell score levels of the participating pathologists.
Concordance correlation coefficient (LCC) estimates (ranging from 0 to 1, with 1 representing perfect reproducibility) were computed and applied to PD-L1 immunohistochemistry results for pathologists’ digital manual reads, to assess variability between pathologists for scoring both tumor and immune cell positivity (percentage of positive cells as per the scoring guidelines) and to assess concordance between pathologists’ digital manual reads and PD-L1 image analysis scores for both tumor and immune cells separately.
Training, Evaluation, and Validation (Set I) Data
Figures 2A and B show receiver operating characteristic curves for tumor and immune cells separately at 1% cutoff for both tumor and immune cell positivity to predict PD-L1 expression on tumor and immune cells. In efforts to evaluate the ability of the algorithm to correctly predict positive and negative membrane PD-L1 expression as an analog for sensitivity and specificity, respectively, on tumor and immune cells, the criterion standard was the pathologist’s digital manual reads. The accuracy of the algorithm to detect membrane positivity was 88% and 84% for tumor and immune cells, respectively. The algorithm performed well with; 0.8 (area under curve) for tumor and 0.7 (area under curve) for immune cells.
The clinical characteristics of this patient cohort are summarized in Table 2. Three pathologists interpreted the diaminobenzidine immunohistochemistry staining obtained with antibody 22C3 for 99/100 cases. One case was not scored due to poor resolution rendered for scoring. The staining percentages were recorded on digital WSIs for tumor and for immune cells for each case as per the National Comprehensive Cancer Network publication.5
Pathologists’ Concordance Study Findings
An analysis was performed on a categorical level (instead of continuous 0% to 100% scale), considering categories A to F for TC scoring and A to C for immune cell scores as described. Figure 4 shows distribution of PD-L1 scores among 3 pathologists and automated algorithm scores for TCs (Figs. 4A, C) and immune cells (Figs. 4B, D). We used the unified scoring system to allow us to assess pathologists’ ability to score at various levels as scored in this study; the test provides a categorical outcome. FDA-approved cut points for the assay positivity are at >50% and >1% for high expression and low expression, respectively. Hence, an important variable in the present study was to determine comparison of pathologists’ and automated score concordance around these approved cut points at which clinicians prescribe drugs. Figure 5 illustrates distribution of cases scored by each pathologist for tumor and immune cell positivity around this 3-category scoring system. The distribution of cases scored for TC positivity above the assay specific threshold of >1% was 27 versus 72 for pathologist 1, 18 versus 81 for pathologist 2, and 19 versus 80 for pathologist 3. Immune cell score distribution observed around the cutoff of >1% was 20 versus 79 for pathologist 1, 7 versus 92 for pathologist 2, and 14 versus 85 for pathologist 3.
Lin’s CCC Estimates
The concordance between pathologists for tumor and immune cells (ranging from 0 to 1, with 1 representing perfect reproducibility) were computed. Pair wise-CCC between pathologists for PD-L1 expression in TCs showed a strong positive correlation. Lin’s CCCs were 0.81 [95% confidence interval (CI), 0.76-0.86], 0.78 (95% CI, 0.73-0.84), and 0.68 (95% CI, 0.60-0.75) between pathologist 1 and 3, pathologist 2 and 3, and pathologist 1 and 2, respectively (Table 3). The readings for immune cell scoring indicated a substantial discordance between pathologists. Table 4 shows a high level of variation between pathologists for immune cell scoring; Lin’s CCCs for immune cell scores were 0.56 (95% CI, 0.45-0.67), 0.48 (95% CI, 0.36-0.60), 0.48 (95% CI, 0.37-0.58) between pathologist 1 and 3, pathologist 2 and 3, and pathologist 1 and 2, respectively.
Comparison of Manual Scores With Automated Quantitative Measurement for Tumor and Stromal PD-L1 Expression
Finally, the quantitative information computed by the algorithm was compared with the reads by the pathologist. The Lin’s CCCs between the pathologists’ reads and the machine scores were calculated. The coefficients of concordance between the automated scores and reads for pathologist 1, 2, and 3 were 0.82 (95% CI, 0.77-0.87), 0.66 (95% CI, 0.58-0.74), and 0.90 (95% CI, 0.87-0.93) (Tables 3) and 0.62, 0.53, and 0.88 (Table 4) for tumor and immune cell, respectively, which is better than interpathologist concordance for immune cell scoring. To compare the pathologist scores to the automated system, the scores from the 3 pathologists were averaged to produce a single read score. The Lin’s CCC between the pathologists’ read and the machine score was 0.80 (95% CI, 0.74-0.85) for TCs and 0.70 (95% CI, 0.60-0.76) for immune cell population. This is considered an excellent agreement for TCs and good concordance for immune cells.
In this work, we hypothesize that adoption of an automated image analysis method can achieve reproducibility and accuracy for interpretation of immunohistochemistry staining percentages for PD-L1. A study conducted by Rehman and colleagaues revealed high concordance within pathologists when they scored TCs, but concern was raised for lack of concordance in reporting immune cell scores. Similar findings were described in 2 multi-institutional studies3,5,6 where concordance was good for TPSs, but not Immune Cell Percentage Scores. In the present study the concordance between the pathologists and the algorithm for TCs is comparable to that seen in the large multi-institutional studies. A strong positive correlation between pathologists and Optra PD-L1 image analysis algorithm scores allowed us to evaluate and validate tumor and immune cell quantification and scoring algorithm. By using the unified scoring system, we could assess the pathologists’ ability to score at various TPS levels and also provided us with the “truth” for validation of the algorithm scores across all scoring categories in a robust manner. This is also notable in that the concordance between the automated scoring algorithm and any pathologist is no worse than the concordance between pathologists. Thus, our study provides support for future studies to evaluate the automated algorithm in prediction of response to immune therapy.
In this study, we assess TCs and immune cell populations expressing programmed death 1/PD-L1 in non–small cell lung carcinoma specimens using a completely automated image analysis on a WSI without pathologist’s assistance. Previous subjective studies have shown that evaluation of immunohistochemistry stained tissue sections are characterized by significant intraobserver and interobserver variability, both with respect to direct “manual” scoring, and also in the pathologist-selected areas (subsets of the overall tissue section) for digital analysis.9 The reduced reproducibility could be overcome by using automated digital analysis of the WSI, which potentially is more reliable and objective. Although reproducibility is not tested by comparing multiple runs of the algorithm, the concordance of this algorithm with pathologist reads is comparable to concordance between pathologists.
This study has several limitations. Most significantly, the study cannot assess predictive value of the algorithm as the cohort tested does not include treatment or follow-up information. However, this is typical for a pilot study where an algorithm must be shown to be comparable to pathologist scores before being granted access to more valuable outcome cohorts. After publication of this work, we hope to receive access to specimen collections from clinical trials to evaluate accuracy. A second limitation of this work is the absence of a standard for assessment against which the immune cell algorithm can be evaluated. As in previous studies, pathologist concordance on immune cell assessment was low. As a result, there is no easy way to evaluate the algorithms accuracy for immune cell scoring. We plan to evaluate the immune cell algorithm in future studies where patient outcome is more tightly predicted by immune cell components, perhaps in bladder cancer or gastroesophageal cancer.10,11
In summary, the algorithm generated on a training set, performed well on a validation set. The interpathologist concordance seen in this study is similar to previously reported studies, where agreement is higher for evaluation of TCs than for immune cells. The present study shows that the Optra PD-L1 image analysis showed concordance with the pathologists’ scores that were comparable to levels observed for interpathologist scores of PD-L1 on tumor and immune cells. This finding suggests promise for the use of automated WSI analysis assessment in non–small cell lung cancer for accurately generating scores for tumor and immune cell population.
The authors are grateful to Dr David L Rimm, Department of Pathology, Yale University School of Medicine, New Haven, Connecticut, United States for supporting and collaborating with OptraSCAN for the present study.
1. Gettinger SN, Horn L, Gandhi L, et al. Overall survival and long-term safety of nivolumab (anti–programmed death 1 antibody, BMS-936558, ONO-4538) in patients with previously treated advanced non–small-cell lung cancer. J Clin Oncol. 2015;33:2004–2012.
2. Carbognin L, Pilotto S, Milella M, et al. Differential activity of nivolumab, pembrolizumab and MPDL3280A according to the tumour expression of programmed death-ligand-1 (PD-L1): sensitivity analysis of trials in melanoma, lung and genitourinary cancers. PLos ONE. 2015;10:e0130142.
3. Rehman JA, Han G, Carvajal-Hausdorf DE, et al. Quantitative and pathologist-read comparison of the heterogeneity of programmed death-ligand 1 (PD-L1) expression in non-small cell lung cancer. Mod Pathol. 2017;30:340–349.
4. Hirsch FR, McElhinny A, Stanforth D, et al. PD-L1 immunohistochemistry assays for lung cancer: results from phase 1 of the blueprint PD-L1 IHC assay comparison project. J Thorac Oncol. 2017;12:208–222.
5. Rimm DL, Han G, Taube JM, et al. A prospective, multi-institutional, pathologist-based assessment of 4 immunohistochemistry assays for PD-l1 expression in non–small cell lung cancer. JAMA Oncol. 2017;3:1051–1058.
6. Tsao MS, Kerr KM, Yatabe Y. PL 03.03 Blueprint 2: PD-L1 immunohistochemistry comparability study in real-life clinical samples (Blueprint 2). J Thorac Oncol. 2017;12:S1606.
7. Kerr KM, Tsao M-S, Nicholson AG, et al. Programmed death-ligand-1 immunohistochemistry in lung cancer: in what state is this art? J Thorac Oncol. 2015;7:985–989.
8. Galon J, Pagès F, Marincola FM, et al. The immune score as a new possible approach for the classification of cancer. J Transl Med. 2012;10:1.
9. Varghese F, Bukhari AB, Malhotra R, et al. IHC profiler: an open source plugin for the quantitative evaluation and automated scoring of immunohistochemistry images of human tissue samples. PLoS ONE. 2014;9:e96801.
10. Wu Y, Cao D, Qu L, et al. PD-1 and PD-L1 co-expression predicts favorable prognosis in gastric cancer. Oncotarget. 2017;8:64066–64082.
11. Bellmunt J, Powles T, Vogelzang NJ. A review on the evolution of PD-1/PD-L1 immunotherapy for bladder cancer: the future is now. Cancer Treat Rev. 2017;54:58–67.