It is estimated that there will be 102,480 new cases of colon cancer in the United States in 2013.1 Approximately 25% of colon cancer cases present with stage II disease.2 Unlike stage I colon cancer, where surgical resection alone is widely accepted as the standard treatment, or stage III tumors, in which surgery followed by oxaliplatin-based adjuvant chemotherapy is currently the standard of care and has been shown to provide a benefit in survival, the use of adjuvant chemotherapy for stage II disease is not as clearly defined.3–6 Even though current guidelines advise against the standard use of adjuvant chemotherapy for all stage II colon cancer patients, some studies demonstrated that adjuvant chemotherapy after surgery can provide some moderate benefit to a proportion of higher risk patients.4,6 Therefore, it is clinically important to have a reliable method to identify high-risk patients with stage II colon cancer to select for adjuvant chemotherapy treatment.
A 634-probe prognostic gene expression–based assay has been recently developed and clinically validated for assessing a patient’s likelihood for tumor recurrence in stage II colon cancer by Almac Diagnostics Ltd. (Craigavon, UK). Using formalin-fixed paraffin-embedded (FFPE) tissue, the assay divides stage II colon cancer patients into high-risk and low-risk groups. This assay was rigorously developed using 215 stage II colon cancers and subsequently validated in an independent set of 144 stage II colon cancers, demonstrating a statistically significant ability to identify high-risk patients with a hazard ratio of 2.5 [95% confidence interval (CI), 1.536-4.154, P<0.001].7 An additional clinical validation using stage II colon cancers is underway to further demonstrate its clinical significance.
In addition to clinical validation, a laboratory-developed test must also undergo a rigorous analytical validation to ensure that the assay can generate reliable assay results. All laboratories in the United States that perform clinical testing on human samples are regulated by the Clinical Laboratory Improvement Amendments (CLIA). Although this genomic assay is technically complex and novel in nature, the current study evaluates the analytical performance, including accuracy, sensitivity, reproducibility, and reportable range of the assay by following CLIA guidelines and those published by Clinical and Laboratory Standards Institute (CLSI).8–13 Some examples of quality control measures implemented in the assay process are also described. The results from the current study demonstrate the strong analytical performance of this 634-probe gene expression assay and support its use as an objective and validated prognostic test for stage II colon cancer.
A total of 52 stage II colon cancer FFPE samples were used in the study. Forty-three samples were sourced from clinical centers and commercial tissue banks in Sweden, Ireland, Northern Ireland, Canada, and the United States; these samples all have associated IRB approval and patient informed consents from the clinical centers and the commercial tissue banks and were also granted local ethical approval within Northern Ireland by the Office of Research Ethics Committees Northern Ireland (ORECNI). Eight samples were obtained from a bank of excess human tissue (Precision Therapeutics Inc., Pittsburgh, PA), all of which were deidentified and exempted from Informed Consent Forms (ICF).14 One sample was obtained from a clinical trial and has associated IRB approval and patient informed consent.15
Gene Expression Profiling from FFPE Tissue
A minimum of 50% tumor cells is required to be present in the FFPE tissue section. In the cases where there was <50% tumor cellularity, a pathologist-guided macrodissection was performed to ensure each sample contained a minimum of 50% tumor cellularity and was free of necrotic tissue. The 634-probe gene expression assay (GeneFx Colon, Precision Therapeutics, Inc., Pittsburgh, PA) was performed as previously described.7 Total RNA was extracted from these FFPE tissue sections using the Roche High Pure RNA paraffin kit (Roche, Basel, Switzerland). Samples with a minimal RNA concentration of 9.5 ng/μL proceeded with RNA amplification using the WT Ovation FFPE cDNA amplification V2 kit (NuGEN Technologies Inc., San Carlos, CA). The amplified product was fragmented and labeled with the Encore Biotin Module (NuGEN Technologies Inc., San Carlos, CA), and hybridized to the Almac Diagnostics’ proprietary Colorectal Cancer DSA chip (Affymetrix, Santa Clara, CA) on the Affymetrix Dx 7 G scanner (Affymetrix, Santa Clara, CA). The resulting microarray data then went through a series of quality control and processing steps (including microarray QC such as evaluation of surface artifacts, background artifacts, background noise, spike controls, housekeeping controls, average signal (A), normalization with RefRMA, signature genes extraction, and risk score calculation based on a fixed formula) such that a risk score was generated for each specimen assayed, which was then used to predict the patient’s prognosis into a high-risk or low-risk group using the predetermined risk score cutoff of 0.4377.7
Generation of Pools of RNA With Different Percentages of Tumor Cellularity
In the sensitivity study, 5 samples that spanned both low-risk and high-risk scores were selected. Each sample was then assayed over a range of tumor content in the sample: 30%, 50%, 70%, and 90%. To achieve these RNAs with different tumor cellularity percentages, multiple sections from each of the 5 patient samples underwent pathologist-guided macrodissection to completely remove the nontumor tissue from the sections. The nontumor tissue was placed in a 1.5 mL microcentrifuge tube, followed by RNA extraction to generate RNA from 100% nontumor (0% tumor) tissue. The remaining tissue on the tissue sections was collected into a separate 1.5 mL microcentrifuge tube, followed by RNA extraction to generate RNA from 100% tumor (0% nontumor) tissue. Both the 100% tumor and 0% tumor RNAs were dissolved in equal amount of nuclease-free water. Thereafter, differing volumes of RNA from 100% tumor and from 0% tumor were mixed together to generate 30%, 50%, 70%, and 100% tumor RNA mixtures that were used in the sensitivity study.
Generation of FFPE HCT116 Cell Line Block
A human colorectal carcinoma cell line, HCT116 (ATCC catalog # CCL-247), was used as a high-risk control based on its gene expression profile. The cells were cultured in McCoy’s complete growth media supplemented with 10% FBS, trypsinized, and pelleted with centrifugation. The resulting cell pellet, containing a minimum of 2×107 cells, was then incubated in 10% buffered neutral formalin for 1 hour at room temperature and was embedded in HistoGel at room temperature overnight. The resulting block was then sectioned into 5 µm sections for RNA extraction. The RNAs were aliquoted, labeled with one lot number, and stored at −80°C for downstream gene expression profiling and risk score generation. RNAs from new sections from the existing block or RNAs from a new block were labeled with a new lot number and were quality control tested after controlled lot-to-lot validation Standard Operating Procedures (SOP) established by Precision Therapeutics Inc. Only the new lot of RNAs that fulfills the established acceptable risk score range is deemed validated and acceptable for use in the assay process as the positive control material.
Statistical Analysis Methods
The reportable range (also called the measuring interval or range) is defined as “a set of values of quantities of the same kind that can be measured by a given measuring instrument or measuring system with specified instrumental uncertainty, under defined conditions” by the Joint Committee for Guides in Metrology (JCGM).16 It is also defined as “the range of test result values over which the laboratory can establish or verify the accuracy of the instrument, kit, or test system measurement response” by CLSI MM22-Draft.8 The first definition from JCGM defines the reportable range based on the concentration of the analyte; therefore, samples of known concentrations (reference standards) have to be available. Because of the multifactorial nature of the assay and because the true risk for each patient is unknown (there is no other reference method available), the second definition was followed, which defines the reportable range based on the test result, in this case, the risk score. Thus, the reportable range was defined as the interval between the maximum and the minimum risk score values that are theoretically possible under the assay.
As a result of using RefRMA quantile normalization, each gene can take a finite number of possible values. Consequently, the signature can take a finite number of possible risk scores values. As it is not computationally feasible to obtain the exact range, we can estimate it by randomly permuting the reference quantiles sufficiently many times (we used 1.008×106 times) and computing the corresponding risk scores and their sample range.
Following CLSI EP9-A2, Deming regression was used to test for a bias between the production and reference labs and between the 2 replicates within the production lab.9 Pearson correlation was used to estimate the strength of the relationship between the risk scores and the percent agreement—the strength of the relationship between the categorical assignments. A simulation study was employed to address the percent agreement and the correlation coefficient when the sampled data covered the entire reportable range of the assay; the variability of the assay was assumed to be the same as that observed in the accuracy study (Supplementary material, Supplemental Digital Content 1, http://links.lww.com/PDM/A47).
The precision study consisted of 2 substudies. The first substudy was designed to evaluate the variation of the risk score because of day, operator, and measurement error; the second substudy was designed to evaluate the variation due to reagent lots.
The first substudy used 6 patient samples, 3 of which were of low risk and the other 3 were of high risk. Three operators were assigned to each sample. Triplicate measures were made for each sample at each combination of factors. For the first sample, each operator performed 3 replicates of the assay on each of 3 days. For the remaining samples only 1 replicate per day per operator was performed. A 4-way nested analysis of variance (ANOVA) model was used to estimate the variance components of this substudy. The repeatability of the assay was estimated by the SD of the error term in the model.
Two essential reagent kits were tested in the second substudy: WT Ovation FFPE cDNA amplification V2 kit for RNA amplification and Encore Biotin Module for RNA labeling. Three lots of the cDNA amplification (WT Ovation) kit and 2 lots of the fragmentation and labeling (Encore) kit were assayed. Each lot of the WT Ovation kit was combined with lot 1 of the Encore kit to produce 4 to 6 replicates of the assay. The remaining Encore lot was combined with only 1 WT Ovation lot to produce 2 replicates of the assay. A 2-way nested ANOVA model was used to estimate the variance components of this substudy. Following the CLSI EP21 guideline,10 the total error was used to evaluate the reproducibility, which includes the lot-to-lot variability of the 2 reagent kits, the operator-to-operator variability, and day-to-day variability (Supplementary material, Supplemental Digital Content 1, http://links.lww.com/PDM/A47).
A 1-way ANOVA model was used to evaluate the difference in the mean risk scores among 5 concentration levels (5.5 ng/μL, 7.5 ng/μL, 9.5 ng/μL, 11.5 ng/μL, and 13.5 ng/μL, with each concentration assessed in triplicate) in the RNA concentration study. A mixed model was used to test whether the percentage mix of tumor and nontumor tissue (30%, 50%, 70%, and 90%) has an effect on the risk score in the tumor concentration study (Supplementary material, Supplemental Digital Content 1, http://links.lww.com/PDM/A47).
Development and Validation of Gene Expression–based QC Classifier for Evaluating a New Lot of the Amplification Reagent (NuGEN WT Ovation FFPE cDNA Amplification V2 Kit)
The training set for this classifier contained 28 samples processed with a lot (lot 1176-B) that was outside-assay specifications and 57 samples processed with 2 lots (lots 5147-A and 5147-B) that were within-assay specifications. All of the above assays were performed on FFPE sections of the HCT116 cell line. The distinction between the lots that are within-assay specifications and the lots that are outside-assay specifications was made based on a previously observed lot effect on the individual gene expressions using principal components analysis. A lot effect was also observed on the final risk scores.
To develop a classifier based on gene expression profiles, the normalized gene expressions of all 634 probes that make up the assay signature were considered. Ten genes that overexpressed and 10 genes that underexpressed under the outside-assay specifications lot when compared with the within-assay specifications lot were selected. These 20 genes also produced a clear separation between lots on bivariate scatter plots. It was expected that the separation be preserved in higher dimensions. Canonical discriminant analysis was then performed on the set of 20 gene expressions and the lot class.
Two sets of arrays from FFPE HCT116 cell line samples were used to test the performance of the classifier. The first set (n=77) was used for both testing and learning purposes. This was a learning set in the sense that it was used to help establish the final cutoff for class assignment. The classifier performed as expected on the training data, and because the within-assay specifications lot and the outside-assay specifications lot data were well separated, a unique classifier cutoff could not be determined. This set consisted of normalized gene expression data from 77 arrays of FFPE HCT116 cell line samples compiled over several years and processed with one of 11 different amplification reagent lots. Five of the 77 experiments were processed with lot 1176-B, which was the outside-assay specifications lot in the training set. The remaining 72 arrays were considered to originate from within-assay specifications lots. Several cutoffs were investigated:
The cutoff (−3.9) that minimizes the total number of misclassified replicates in the combined training and test sets (the misclassification error rate was 1%).
The cutoff (−1.7) that minimizes the number of replicates in 2-fold cross-validation using the training set only (the misclassification error rate was 0.4%).
The cutoff (−2.0) that minimizes the number of misclassified lots in the combined data using the “4-6” rule (the misclassification error rate was 8%).
The “4-6” rule is taken from a common acceptance/rejection rule found in the QC literature.11 When running 6 replicates of a process, at most 2 are allowed to fail to deem the process acceptable. This approach was reasonable for this case, given the observed intralot variability. It is of note that the 5 outside-assay specifications lot experiments from the testing set received scores similar to those assigned to the outside-assay specifications lot group of the training set and were correctly classified by all 3 cutoffs. Overall, the last cutoff of −2.0 was selected for having a very low misclassification rate in both the training and testing sets.
The final QC criterion was formulated as follows: process 6 arrays using FFPE HCT116 cell line sample. Evaluate the gene expression canonical variable for each array. If the gene expression canonical variable is above −2.0 for at least 4 of the 6 arrays, pass the new lot. Otherwise, fail the lot.
After the final QC criterion was determined, a second set of 12 arrays was used solely for testing purposes. This set comprised 6 FFPE HCT116 cell line replicates processed with the known within-assay specifications lot and 6 replicates processed with the known outside-assay specifications lot. The classifier correctly classified all 12 replicates.
The reportable range of this assay was estimated to be [−1.129, 1.414].
Following CLSI guideline EP9-A2, “Method Comparison and Bias Estimation Using Patient Samples: Approved Guideline-Second Edition,” and the CLSI guideline GP29-A2, “Assessment of Laboratory Test When Proficiency Testing is Not Available,” the accuracy of the assay was assessed by comparing the results of replicates within the production lab and by comparing results from the production lab with those generated by a reference lab.12
Forty-three samples were included in the study. Deming regression was used to compare the technical replicates performed within the production lab (Precision Therapeutics Inc., Pittsburgh, PA) (Fig. 1). The final sample size for this comparison was reduced from 43 to 39 because of QC failures. Deming regression was also used to compare the risk scores generated in the production lab with those generated from the same samples in the reference lab (Almac Diagnostics Ltd., Craigavon, Northern Ireland, UK) (Fig. 2). The final sample size for this comparison was reduced from 43 to 36 because of QC failures. In both comparisons, the intercept was not significantly different from 0, and the slope was not statistically different from 1. Therefore, there was no evidence of constant bias and proportional bias either between the replicates or between the 2 labs.
The Pearson correlation and the high-risk/low-risk categorical percent agreement after dichotomizing the risk scores using the predetermined risk category cutoff (0.4377) between the 2 replicates in the production lab were 0.72 (Fig. 1; 95% CI, 0.52-0.84) and 79% (Table 1; 95% CI, 64%-91%), respectively. The correlation and the categorical agreement between the 2 labs were 0.69 (95% CI, 0.47-0.83; Fig. 2) and 78% (Table 2; 95% CI, 61%-90%), respectively.
The observed range of the risk scores was [−0.248, 0.895] in the production lab and [−0.003, 0.904] in the reference lab. Both ranges were approximately half of the reportable range of the assay, which was calculated to be [−1.129, 1.414]. This narrower range and the fact that the majority of the data points centered around the cutoff of 0.4377 contributed to the low categorical agreement and the low Pearson correlation coefficient between the 2 labs. The strength of the correlation is likely affected by both the sample range and the assay variability because of measurement error. To this end, a simulation study was conducted to assess what the percent of agreement and the correlation coefficient would be if the sampled risk scores had covered the entire reportable range of the assay. In this simulation, the measurement error variability of the assay was assumed to be the same as that observed in the accuracy study.
The results of the simulation are provided in Table 3. The last row corresponds approximately to the range observed in the accuracy study. Keeping the analytical SDs fixed, wider data ranges yielded larger correlation coefficients. For example, when the full reportable range was sampled, the concordance was above 95%. Note that the uniform selection of τi (see the definition of τi in the Supplementary material, Supplemental Digital Content 1, http://links.lww.com/PDM/A47) also leads to a higher correlation than the one observed for the sample range in the accuracy study [−0.003, 0.904]. Thus, had the sample range been wider, the estimated correlation and concordance would have been larger.
As the 634-probe assay was originally developed and clinically validated within the reference lab (Almac Diagnostics Ltd, Craigavon, UK), RNAs from FFPE HCT116 cell line was included in each run for quality control. The range of the risk score for the in-control studies was [0.4900, 1.4000] based on 72 replicates of the FFPE HCT116 assays run in the reference lab during the development of the 634-probe assay. We used the same cell line and the same quality control criteria to select in-control assays for the analytical performance characteristics. In the production lab, all the 62 FFPE HCT116 cell line samples, compiled across all analytical validation studies fell within this range and, therefore, deemed all the included analytical validation studies as in-control. The analysis of the analytical performance characteristics described below was based on these in-control studies.
The first substudy in the precision study was designed to evaluate the variation of the risk score because of day, operator, and measurement error; the second substudy was designed to evaluate the variation due to reagent lots. Other factors such as production sites and equipment lines were not addressed as this was a one-lab, one-equipment-line environment. The total error due to operator, day, and reagent lot was used to evaluate the assay reproducibility. The measurement error was used to evaluate the assay repeatability.
On the basis of diagnostic plots, the errors showed a homogeneous pattern across the samples and the risk scores. The repeatability error was estimated to be
Equation (Uncited)Image Tools
based on the first substudy.
The day-to-day variability estimate was
Equation (Uncited)Image Tools
, the operator-to-operator variability estimate was
Equation (Uncited)Image Tools
, and the Encore and WT Ovation kit lot-to-lot variability were both estimated to be
Equation (Uncited)Image Tools
Moreover, the study found no evidence of significant variability because of days (P value=0.6) or reagent lots (Encore kit P value=0.6, WT Ovation kit P value=0.5). The reproducibility error of the assay was estimated to be 0.0548.
According to the CLSI/NCCLS guideline EP17, the lower limit of quantitation (LLoQ, for quantitative assays) can be used to define assay analytical sensitivity.13 For the 634-probe gene expression assay under evaluation in this study, extracted total RNA was the analyte and all signature developmental work was performed using a total RNA input concentration of 9.5 ng/μL. The sensitivity study was conducted using a range of total RNA concentrations (5.5 ng/μL, 7.5 ng/μL, 9.5 ng/μL, 11.5 ng/μL, and 13.5 ng/μL, with each concentration assessed in triplicate) to demonstrate that the concentration of 9.5 ng/μL is equal to or above the LLoQ. The results showed that there is no evidence of a concentration level effect on the risk scores (1-way ANOVA F-test P-value=0.3) among the 5 tested concentrations (Fig. 3).
An additional sensitivity study was conducted to ensure that assay results do not change when the ratio of tumor and nontumor tissue varies within the required limit. Five samples with risk scores that span both high and low risk were selected. Each sample was then assayed over a range of tumor content in the sample: 30%, 50%, 70%, 90%, and each tumor content percentage was evaluated in triplicate. This study found no evidence that the tumor content level has an effect on the risk scores (P value=0.6; Fig. 4).
Assay Quality Control
To ensure the strong analytical performance of this 634-probe gene expression assay, numerous quality control steps have been implemented into the assay process to augment those already in place following development of the assay.7 These quality control measures will continue being employed during commercial clinical sample testing. A few examples are included below.
A pathologist-guided manual macrodissection process for removal of normal epithelial cells, normal nonepithelial material, and any necrotic tissue is included as a step in the assay. This step ensures that there is a minimum of 50% tumor cellularity and no residual necrotic tissue in the FFPE tissue samples for downstream RNA extraction and gene expression analysis. During the development and initial validation of the assay, a minimum of 50% tumor cellularity was one of the sample inclusion criteria.7 However, a tumor and nontumor ratio study was still undertaken as part of the analytical sensitivity study, which demonstrated that samples with tumor cellularity ranging from 30% to 90% generate similar risk scores and categorical risk predictions. However, the minimal required tumor cellularity percentage at the end of the macrodissection process will remain at 50% to account for subjectivity in tumor cellularity determination and to conservatively ensure assay performance.
To monitor environmental amplicon contamination, a no-template control (NTC), in which nuclease-free water (without any RNA or DNA) is added into the cDNA amplification reaction, is included. This control is included in each run of the assay. If the NTC is above the predetermined QC limit (50 ng/μL), all the patient samples included in the same run will be failed. A thorough decontamination of the lab will be undertaken, followed by re-run of the NTC control to confirm the success of the decontamination. This decontamination and retesting process will continue until the NTC control passes the QC criterion. New cDNA amplification and gene expression profiling for the failed patient samples will then be performed, if sufficient patient material exists.
In addition, high-risk and low-risk controls have been included in each run of the assay. The high-risk control uses the total RNA extracted from sections of FFPE HCT116 cell line blocks. Sixty-two replicates from assay runs in the production lab that were deemed in-control generated risk scores in the range of 0.6868 to 1.3464, a consistent high-risk prediction by the assay. This range was then used as the control limits for FFPE HCT116 cell line QC samples in the production lab. If a FFPE HCT116 cell line QC sample result falls outside this established range, all test samples within this run will be failed and retested. The low-risk control was Universal Human Reference RNA (UHR) (Stratagene, Cedar Creek, TX). The UHR is composed of the total RNA from 10 human cell lines, and it is designed to be used as a reference for microarray gene-profiling experiments.18 Fifty-two UHR samples at an input of 2 ng from all the in-control analytical validation studies generated risk scores within the range of −0.0997 to 0.3315, a consistent low-risk prediction by the assay. This range was then used as the control limits for UHR QC samples in the production lab. If a UHR result falls outside this established range, all test samples within this run will be failed and retested. In the commercial assay process, the risk scores from the positive controls and negative controls from each commercial sample batch run are plotted on a Shewhart control chart and are monitored regularly. In the event of detection of an abnormality in the monitoring process, the cause(s) of the abnormality will be investigated and appropriate corrective actions will be undertaken to ensure consistent assay performance.
Furthermore, a QC process was also developed for evaluating the functionality and performance of a new lot of the amplification reagent (NuGEN WT Ovation FFPE cDNA amplification V2 kit) using a classifier that is based on the expressions of a subset of genes that make up the 634-probe assay signature. Even though every lot of amplification reagent kit goes through the manufacturer’s own QC check before its release to the market, we have encountered reagent lot with performance outside the test specifications for this 634-probe assay. Therefore, a QC process was developed and implemented into the assay process for evaluating new lots of the amplification reagent kit. The newly developed QC criterion was applied to the selection of a new amplification reagent lot to be used in a clinical study. Two new lots were tested. One lot was failed after 5 of 6 cell line replicates received scores below −2.0 on the canonical variable. The second lot was passed after 6 of 6 cell line replicates had scores well above −2.0. To confirm this result, 3 more replicates were run using a different cell line stock. All 3 received scores above −2.0. As a result, this second lot was selected for use in future studies. This QC classifier will continue to be used to quality control future amplification reagent lots for the 634-probe assay.
In addition to the NuGEN amplification reagent, other critical reagents in the assay process such as the Encore Biotin Module, Agencourt RNA Clean XP beads (Beckman Coulter, Brea, CA), GeneChip Poly-A RNA Control kit (Affymetrix, Santa Clara, CA), High Pure RNA paraffin kit, and the assay control materials including the positive control material, RNAs from FFPE HCT116 cell line block, and the negative control material, UHR, undergo comprehensive, controlled lot-to-lot testing in the production lab to demonstrate that each new lot fulfills its intended performance before use in the production lab for the 634-probe prognostic assay.
A number of single-featured prognostic molecular markers have been under investigation in colon cancer, including loss of heterozygosity at 18q and presence of microsatellite instability.19 DNA microarray technology can measure more mRNA transcripts at once and may be able to capture the complex biology that underlies colon cancer recurrence better than single gene markers.20 Several prognostic signatures have been developed using DNA microarray or RT-PCR technology with the intention to identify the high-risk population that may benefit from adjuvant chemotherapy.21–28 However, some of these signatures are offered in the fresh-frozen tissue format21 or are inclusive of rectal cancer.23 The assay described herein has been clinically developed and validated using FFPE tissue from stage II colon cancer patients only. The assay has been validated to identify high-risk patients at a level of significance that is superior to other FFPE tissue-based stage II colon cancer prognostic signatures.7,24
Despite the complex nature of the assay process, a series of studies were developed to evaluate its analytical performance. As this is a novel test that no other laboratory is currently offering commercially and therefore, there lacks a well-established gold standard method and a gold standard reference lab to compare with, the accuracy of the assay was evaluated by both a comparison of internal replicates within the production lab and a split sample comparison between the production lab and a reference lab. The reference lab (Almac Diagnostics Ltd) is a large, UK-based, CLIA-registered and CAP-accredited laboratory that specializes in clinical trials and biomarker development. The 634-probe signature was originally developed within their laboratory.7 It is the only other laboratory in the world with the capability to perform this assay. Therefore, Almac Diagnostics was selected as the reference lab for the accuracy study. The current study demonstrates that there is good concordance and no evidence of bias between the production lab and the reference lab measurements, nor between the technical replicates within the production lab. Further, through a simulation study, it was shown that a larger correlation coefficient and concordance would have been possible had the sampled specimens covered a wider risk score range.
The sensitivity study results demonstrated that neither the tumor percentage ranging from 30% to 90% nor the total RNA concentration level ranging from 5.5 to 13.5 ng/μL had any effect upon the risk scores. Therefore, the requirement of minimal 50% tumor cellularity at the end of the tissue macrodissection and the input total RNA concentration of 9.5 ng/μL currently instructed by the assay are conservative and appropriate amounts of analyte.
The repeatability and the reproducibility error estimates are reported in risk score units, rather than a coefficient of variation percent (CV%), as the sample average of the risk scores in the low-risk group could be very close to 0, thus making the CV% not suitable for quantifying the assay repeatability and reproducibility. To judge whether the repeatability error estimate of
Equation (Uncited)Image Tools
and the reproducibility error estimate of
Equation (Uncited)Image Tools
(for definition of
Equation (Uncited)Image Tools
, see Supplementary material, Supplemental Digital Content 1, http://links.lww.com/PDM/A47) were acceptable, these values were compared with the repeatability and reproducibility obtained from the accuracy study. The accuracy study was used as a benchmark for comparison as it produced an acceptable level of concordance (ie, the assay has acceptable technical error to produce consistent clinical results). In the precision study, the variability due to operators accounts for the entire observable variance under reproducibility. Hence, when comparing the reproducibility between the two studies, we compare the variability due to operators obtained in each study. However, since the accuracy study was not designed to estimate the assay reproducibility and repeatability, in retrospect, the study only allows for the estimation of a variance component for which the variability due to operators and variability due to replicate error were confounded (Supplementary material, Supplemental Digital Content 1, http://links.lww.com/PDM/A47). This standard error was estimated to be 0.13. We compare this estimate to the combined reproducibility and repeatability obtained in the precision study, which was
Equation (Uncited)Image Tools
. The absolute difference between the 2 estimates does not point to the precision study having considerably more variable measurements. With the accuracy study giving a lab-to-lab concordance of 78% (95% CI, 62%-90%), the repeatability and reproducibility estimates obtained in the precision study were deemed acceptable. Note that all 4 concordance estimates between the individual lab replicates fall within the above confidence interval. We report the concordance between the averaged replicates instead, because by averaging them we diminish the effect of the measurement error within the labs on the concordance estimate.
Quality control measures have also been implemented into the assay process. Examples of these quality control measures include tissue processing steps to ensure a minimal tumor cellularity that is required to generate reliable and consistent risk scores and risk categorical predictions, as well as the implementation of high-risk and low-risk controls and nontemplate control in the assay process. Taking advantage of new and advanced technologies and instrumentation, more and more biomarker assays employ multiple reaction steps with multiple reagent components and controls in the assay process. Quality controls of these reagent components, as well as control materials, are essential to ensure consistent assay performance characteristics. For this purpose, comprehensive lot-to-lot testing procedures of the critical reagents, including the preanalytical and analytical reagents and positive control and negative control materials and regular monitoring of positive and negative control performance over time have also been meticulously incorporated in the commercial assay process for the 634-probe prognostic assay to ensure consistently reliable results. For example, a gene expression–based classifier was developed to assess the quality of the NuGEN WT Ovation FFPE cDNA amplification kit, a key reagent in the assay process. This QC classifier will continue to be used to assess the quality of future WT Ovation FFPE cDNA amplification V2 kit for commercial patient sample profiling, as well as clinical validation studies, of the 634-probe prognostic assay, to ensure reliable results. A more economical 2-stage QC classifier that considers both Bioanalyzer profile and microarray gene expression is currently in development.
In summary, the analytical studies presented in this study demonstrate that the 634-probe gene expression assay is a robust and reproducible assay capable of use as a clinical biomarker of prognosis to identify patients at higher risk of recurrence after surgery for stage II colon cancer.
The authors thank Dr Stacey Brower for her review and proof-reading of the manuscript.
© 2014 Lippincott Williams & Wilkins, Inc.