# Biomarker Guidelines for High-Dimensional Genomic Studies in Transplantation: Adding Method to the Madness

^{1} Department of Molecular and Experimental Medicine, The Scripps Research Institute, La Jolla, CA.

^{2} Department of Surgery, University of Virginia, Charlottesville, VA.

^{3} Department of Transplant Nephrology, Mayo Clinic, Phoenix, AZ.

^{4} Comprehensive Transplant Center, Feinberg School of Medicine, Northwestern University, Chicago, IL.

^{5} Department of Surgery, Mayo Clinic, Phoenix, AZ.

Received 17 November 2016. Revision received 12 December 2016.

Accepted 16 December 2016.

Professor Daniel R. Salomon passed away on November 10, 2016.

In tribute to a true scientist, brilliant mind, wonderful friend and collaborator.

*Daniel R. Salomon (1953-2016).*

Correspondence: Bruce Kaplan, Departments of Nephrology and Transplantation, Mayo Clinic 13400 E. Shea Blvd., Scottsdale, AZ 85259.

The rapid expansion of high dimensional “omic” technologies and their rapidly falling costs have ushered in the era of precision medicine. However, large data sets are required for the training, discovery and validation of molecular signatures. Those aspects lead to significant challenges as these data sets are translated into diagnostic and predictive signatures of important clinical phenotypes, causing some to conclude metaphorically that “a thousand dollar genome may require a million dollar interpretation.”^{1} The pace of innovation and growth in this space has not allowed sufficient time to generate a set of standardized and established approaches to the bioinformatic needs necessary to process genetic/genomic data toward validated signatures. Instead, it seems that a robust literature on bioinformatics is emerging in parallel with biomarker discovery, leading to published claims that might not have been vetted sufficiently through accepted bioinformatics approaches. As a result, there appears to be some difficulty reaching consensus around the validation of published biomarkers. This issue is particularly relevant to the peer review process that frequently disagrees about the clinical validity and true predictive performance of studies adding to public skepticism and accusations of “hype” versus “hope.”

Although many other related fields have rigid established criteria for validity, such as the need to have high statistical thresholds such as *P* values less than 10^{−8} in the case of Genome Wide Association Studies or False Discovery Rate corrected values in gene expression studies, biomarker discovery, and validation has benefited from an apparent free reign, accepting studies with too few samples on high dimensional data. This enthusiastic support for meaningful biomarkers has resulted in recent concerns. For instance, the reductionist approach of using smaller feature sizes for diagnosis and/or predictions, blatantly violate the “curse of dimensionality” while using high dimensional data that comprises the simultaneous measurement of thousands of genes/proteins. The very nature of high-dimensional omics data is that there are vastly more measurable features (probe set/genes/proteins) than there are available samples, such that a reductionist analysis of feature selection is prone to substantial type I error (where the null hypothesis is true and you reject it), thus not appropriately accounting for individual differences within a patient cohort. Unfortunately, well powered studies that follow proper statistical recommendations are infrequent in general, and rare in transplantation, where well-intended studies have fallen short of these criteria. Moreover, the “gold standard” continues to be the biopsy despite its well-accepted shortcomings,^{2} including sampling, interobserver variability, and the reliance on nonalloimmune-specific patterns of inflammation and injury.^{3}

*“There are two kinds of statistics, the kind you look up and the kind you make up.” Rex Stout, writer, intellectual, and activist.*

Thus, although we and others have outlined the dire need for genomic biomarkers in organ transplantation,^{1,4,5} there are clear risks associated with candidate biomarkers as potential surrogates of clinical phenotypes without a thorough understanding of, and a conceptual framework to guide, statistically sound discovery and validation in the context of high-dimensional data. The objective of this commentary is to provide a set of guiding principles for the translation of high-dimensional genomic data into discovery and validation approaches that can result in clinically relevant biomarkers.

### The Need for Biomarker Study Guidelines

Statistical models are used often to estimate the probability of the presence (ie, diagnosis), or the development (ie, prognosis) of a particular disease (ie, clinical phenotype). These methodologies and the reporting of such models have been extensively catalogued in fields including diabetes and cancer.^{6-8} A growing number of publications have also outlined the need to include biomarkers in clinical guidelines, based on the recommendations of several working groups.^{9-11}

This enthusiasm for diagnostic and predictive molecular phenotyping has given rise to a proliferation of published reports fueled by an increased feasibility to generate high-dimensional genetic/genomic data, resulting from the rapid evolution of technical platforms and coupled with lower costs. Unfortunately, our ability to validate these findings, even for the purpose of peer-review, has been severely compromised by the fact that the development of standardized bioinformatic methods to process the mounds of data that can be generated by these technologies is lacking. In fact, it is often difficult to determine whether a true “classifier” has been discovered, or instead, whether it is simply an artifact of spurious data mining methodologies. There is a growing concern that some bioinformatic approaches may represent overoptimistic models of prediction. For example, although mass spectroscopy studies had reported 100% sensitivity leading to the rapid development of the Ova-Check blood test, independent data analyses eventually uncovered major flaws with data and study designs.^{12,13} Another well-publicized example, a study that claimed that microarray signatures derived from human cancer cell lines that predicted in vitro sensitivity or resistance to a particular drug, resulted in the inability to predict in vivo chemotherapeutic response to different drugs.^{14} However, independent analyses exposed several flaws in the methodology including the lack of independence between the training and test sets.^{15,16} The fact that test data were able to alter the model resulted in findings that could not be reproduced by others. This type of outcome forms the basis for some of the skepticism around molecular biomarkers. Underlying this problem is the acknowledged fact that the rules to assess the validity of biomarkers are both underdeveloped and not routinely applied in clinical studies^{17-19}; and, as a result, that many published studies are not adequately scrutinized for their overinflated claims of validity.^{4} Such studies tend to “overfit” discovery data sets and neglect to properly asses the reproducibility and generalizability of classifiers. At a more basic level, the entire concept of classifier validation is sometimes misinterpreted. It is also important to reiterate that the rationale behind validation is not to (1) validate the exact genes chosen from the training set in the test set or to, (2) retrain the discovery cohort genes in the validation cohort data. Validation in its true sense requires locking or freezing the equations derived in the discovery cohort without any change in the set parameters and evaluating their performance in the validation cohort. An incorrect method observed in many studies is to perform discovery on one platform and then, using a “reductionist” approach that validates smaller subsets of the original genes on a different technology platform, thereby adding additional biases such as batch effects and normalization nightmares. Finally, a review process where editors and reviewers fail to insist on internal and external validation of classifiers after the locking down of algorithms to ensure reproducibility easily explains the inability to reproduce these published reports.

### Biomarker Discovery and Validation Strategies

Studies that have a “low” statistical power (due to small sample sizes, small effect sizes, or both) also have a very low likelihood that a statistically significant finding actually reflects a true surrogate or biological effect. An easy workaround for this problem is to assess the statistical power of a given study, a priori, using a power calculation. Any study should ideally calculate the power either using the existing literature to estimate the desired size of effect (or event rate) or, if available, to use data from a “pilot study” to do the same. If the study is underpowered due to time, sample accessibility, or financial constraints, this needs to be clearly acknowledged during the interpretation and publication of such results.

Provided that adequate sample sizes based on power calculations are effectively achieved, the next challenge is to harness the power of a data set to successfully discover and validate candidate signatures or classifiers as clinically relevant molecular biomarkers. The ideal method to test the error of a diagnostic model would be to use all the data (from “n” samples) choosing the model that provides the lowest error rate on the entire population. However, the major limitation of such an approach is that, in real-world clinical studies, investigators do not always have access to an unlimited “n” of samples, which would ideally need to be in the thousands. The fundamental drawback of using such an approach on an underpowered “n” of samples is that the final chosen diagnostic model will almost always “overfit” the training data. The error rate from a single data set will be overly optimistic, and it is very common to get 100% correct classification accuracy on training or discovery data sets. Of note, the same model may rather perform dismally on external or “holdout” data sets that were not used for training purposes (independent/external validation). An additional constraint is imposed when using assays that have a complex set of parameters (classifiers), as is typically the case of high-throughput techniques, such as microarrays and next-generation sequencing where several RNA molecules are measured simultaneously. It has been shown that for small studies (n < 100), where features are selected from multidimensional assays composed of thousands of candidate biomarkers, the simple split-sample estimate can be flawed and extremely biased. The leave-one-out cross-validation (LOOCV), and fivefold or tenfold cross-validation (CV) can be accurate but subject to high variability and bootstrap approaches (including the 0.632+ method). This approach reduces apparent bias and usually gives better error estimates for a given diagnostic or predictive model.^{20}

### Split Sample Methods

The split sample approach is a common study design implemented with high dimensional data. This method usually randomly divides the full data set into a training set and a test cohort. The training set is used to select features (classifiers) and model type while determining parameters and cutoff thresholds. Moreover, this approach selects an optimal model (algorithm) which is then fixed or “locked” before it can be applied to an external (independent) data set. The test cohort is a proportion of the full data that has been withheld until a single model is fully specified using the training set. The locked and fully specified model from the training set is applied to the expression profiles in the test set to predict class labels (phenotypes). However, a major flaw in implementing this methodology is that, occasionally, prediction models are improperly developed without explicitly separating the training data used for model development and the test data used for establishing the prediction accuracy.^{21} A recent study used real microarray data to simulate approaches to data splitting. Here, 2 different approaches evaluated the data splits: one based on simulations designed to qualitatively understand the relationships among data set characteristics and optimal split proportion with an additional approach based on a nonparametric method that does not rely on distributional assumptions.^{22} This approach used various data sets with sample sizes from 72 to 240 concluding that the predictive accuracy of the full data sets improved when the proportion of samples **(%t)** allotted to the training set had been increased. The authors suggest that the commonly used strategy of allocating second or third of cases for training was close to optimal for reasonable sized data sets (n ≥ 100) with strong signals. In addition, they also proposed an alternative nonparametric method using bootstrapping on the whole data set, concluding that the optimal proportion of cases for the training set tended to be in the range of 40% to 80% influenced by the conditions studied and most importantly by the biology.

Other common data splitting methods include the resampling-based CV techniques that essentially use the whole data set. The K-fold CV is one suggested method which tries to improve the limitation of the split sample method by dividing the full data set into *k* subsets. For each iteration (1, 2,…*k*), 1 subset is used as the test set and the other *k-1* subsets are used to form a training set. The error calculated for the test set in each of the *k* iterations is then averaged to generate the estimated error for the full data set. The advantage of this method is that it matters less how the data get divided. Every data point gets to be in a test set exactly once, and gets to be in a training set *k-1* times. The primary disadvantage is the potential for large differences in the estimated error based on how the samples get distributed to the *k* subsets. Repeated K-fold CV is used to alleviate this problem of low reproducibility and high variability, but this approach can require substantial computational time and resources. Another disadvantage of this method is what most statisticians would believe is its strength, namely that all the examples in the data set are eventually used for both training and testing which leaves no independent validation data set(s) thus lacking reproducibility and generalizability. The LOOCV can be called “K-fold CV on steroids” because it takes the k-fold to its logical extreme. For example, in a data set of 100 samples, each iteration of the LOOCV uses 99 training samples and 1 test sample and the number of iterations will be 100. Therefore, it is natural that LOOCV tends to have a higher variance than other CV methods because the training sets in LOOCV have a significant overlap, making the estimates from different folds very dependent. The limitation of most re-sampling strategies is that they have been distorted for “fit-for-purpose” studies, resulting in biased estimates of prediction accuracy,^{23,24} leading to a deep distrust of such techniques by journals and reviewers who insist on validation on a sample data set that was not used for initial model development.

### Bootstrapping Techniques

There is strong evidence that using the split-sample approach especially on smaller data sets with weaker signals (fold change) is inefficient for 2 reasons: (i) there is a substantial loss of estimation precision from developing a given predictive model in only a certain proportion of the data, and, (ii) unless sample sizes are extremely large (>200), very little can actually be learned about the real robust and generalizable predictive models from a single split sample.^{25} The bootstrap technique, in contrast, is a method for estimating the distribution of a test statistic by resampling (with replacement) the full data set, thereby treating the full data as the “universe” for the purpose of evaluating the distribution of interest. Therefore, the bootstrap is a computational substitution in the absence of very large real data sets. In fact, the bootstrap is often more accurate in data sets with small sample sizes but it cannot entail the complexity of larger data sets that add additional dimensions in terms of both clinical and technical variability. Thus, although bootstrapping techniques represent a practical method for approximations of real world complex data sets, they are no substitutes for independently collected cohorts of samples for validation. Modified methods of bootstrapping that have improved the CV performance, such as the 0.632+ Bootstrap method have also shown strong promise, outperforming other similar techniques.^{26} Like repeated K-fold CV, the bootstrap reduces variability by repeated iteration of the resampling algorithm. However, the technique also shares the disadvantage of being expensive in time and computational resources. Several other novel techniques such as the Jackknife,^{27} Akaike's Information Criterion^{28} and the Schwarz Bayesian Information Criterion^{29} have been proposed.

Although it is accepted in the statistical community that a nonparametric strategy for biomarker discovery using bootstrapping on the whole data set is sufficient for robust biomarker discovery, the reality in clinical practice is that manuscript and grant reviewers are likely to require some form of independent external validation. Moreover, there have been calls for a necessity to perform external validation of any classifier in an external, independent data set, different from the cohort/s where it was developed.^{30-32} This external validation is critical especially to establish generalizability of a prediction model for patients that have not been part of the cohort used for the discovery process. This aspect is particularly relevant for biomarker studies where overinflated accuracies are reported that fail to generalize when tested on another set of similar patients collected at a different time or site. Table 1 lists some of the elegantly carried out “omics” biomarker studies in kidney transplantation with a note of caution on their apparent limitations.

### The Relevance of “Prevalent Incidence”

The fundamental measures of diagnostic accuracy of a test are sensitivity and specificity. The sensitivity of a test quantifies its ability to correctly identify subjects with the disease, whereas the specificity, in contrast, measures its ability to correctly identify subjects without the condition. However, there are 2 other metrics that are crucial, namely the positive predictive value (PPV) and the negative predictive value (NPV). Both PPV and NPV are related to sensitivity and specificity based in part on the prevalent incidence of the disease or condition (phenotype). The PPV reflects the probability that the disease is present and tends to be higher when overrepresented in the test/training data set, especially when a disease/condition is rare (ie, low prevalent incidence); a greater specificity is needed to achieve a higher PPV. Conversely, NPV tends to be higher when a disease is common (ie, prevalent incidence is high). A greater sensitivity is needed to achieve a higher NPV. For example, discovering a biomarker for rejection (assuming 10% prevalent incidence) using a training/discovery data set with 50% of samples from patients with rejection would inflate the PPV. Instead, the discovery data set should reflect the prevalent/incidence of the phenotype (in this case rejection). Figure 1 illustrates the effect of disease prevalence on PPV and NPV when both sensitivity and specificity are fixed at 0.95. Increasing the disease prevalence will lead to an increasing PPV but a decreasing NPV for a diagnostic test. This is of consequence because in most study designs, researchers tend to have equal number of cases and controls in a roughly 50:50 split leading to an overinflated presumption of the true prevalence. For example, the prevalence of clinical acute rejection (cAR), calculated using the flawed albeit accepted gold standard of histology is well-established. Similarly, recent large studies of the incidence of sub-cAR, which is more appropriate for an early diagnostic, range from 32% during the first 6 months^{35} to approximately 13% at a year posttransplant.^{36} Thus, equal numbers of cAR or sub-cAR subjects versus the baseline (ie, well-functioning transplant patients) in a study would naturally have a higher PPV if not adjusted for true prevalence. In fact, there are accepted calculations for adjustment of the PPV and NPV based on known prevalence to help equate the performance of a diagnostic test to real-world settings, and 1 formula is shown below:

## CONCLUSIONS

Credit has to be given to given to initiatives such as the “Enhancing the QUAlity and Transparency Of health Research network” that have strived to provide guidelines for the conduct and proper reporting of clinical, genomic and diagnostic studies. This initiative proposed standardized guidelines such as Consolidated Standards of Reporting Trials, Standards for Reporting Qualitative Research, and the Standards for Reporting Diagnostic Accuracy statements from an impressive list of 339 standards that have vastly improved research principles over the last few years. Although the Standards for Reporting Diagnostic Accuracy guidelines are aimed at remedying incomplete reporting to reduce avoidable waste in biomedical research by the identification, critical appraisal, and replication of studies, there are no established and widely used guidelines for appropriate tools for the statistical analysis of such studies other than well-written reviews on the various statistical tools for discovery and validation of biomarkers.^{8,22,25,26} In Figure 2, we attempt to provide a graphical format of essential checklists as a minimum guideline for studies that are designed to discover and validate clinical biomarkers of transplantation. It is important to stress that newly discovered associations that appear inflated in discovery data sets (areas under the curve in the high 90s, 100% sensitivity, specificity, and so on) should be red flags that have to be further scrutinized carefully. When investigators, with underpowered sample and effect sizes, make such claims where evidence is based on crossing a certain threshold of statistical significance or false discovery rate that satisfies only the given data set, this is an automatic recipe for inflated results. The results of studies such as these usually will not be generalizable in truly independent external cohorts and should be viewed with a skeptical eye.

It is now up to the transplantation community, and the editorial boards of journals that report biomarker research in transplantation, to better understand and use these guidelines for the proper conduct and evaluation of clinical trials and studies of diagnostic and prognostic biomarkers by ensuring that appropriate statistical rigor is applied to the analysis and interpretation of the voluminous amounts of data derived from high-dimensional strategies. This will help advance the field, rather than create further confusion and hype, both of which are sure to stifle the development of the next generation of diagnostics that will help inform therapeutics and clinical management.

## REFERENCES

*Kidney Int*. 2013;83:201–206.

*Transplantation*. 2007;84:671–676.

*Nat Rev Nephrol*. 2015;11:204–205.

*Proc Natl Acad Sci U S A*. 2007;104:15448–15453.

*PLoS One*. 2015;10:e0125045.

*Transplantation*. 2011;91:657–665.

*Am J Transplant*. 2013;13:2865–2874.

*J Clin Oncol*. 2016;34:1134–1150.

*PLoS Med*. 2014;11:e1001759.

*Lancet*. 2016;388:983–993.

*Bioinformatics*. 2004;20:777–785.

*BMC Bioinformatics*. 2003;4:24.

*Nat Med*. 2006;12:1294–1300.

*Ann Appl Stat*. 2009;3:1309–1334.

*Nat Med*. 2007;13:1276–1277; author reply 1277–8.

*Nat Clin Pract Oncol*. 2005;2:466–472.

*J Proteome Res*. 2005;4:1053–1059.

*Nat Rev Cancer*. 2004;4:309–314.

*Bioinformatics*. 2005;21:3301–3307.

*J Natl Cancer Inst*. 2007;99:147–157.

*BMC Med Genomics*. 2011;4:31.

*Proc Natl Acad Sci U S A*. 2002;99:6562–6566.

*J Natl Cancer Inst*. 2003;95:14–18.

*Stat Med*. 1996;15:361–387.

*J Am Stat Assoc*. 1997;92:548–560.

*Biometrika*. 1956;43:353–360.

*Cc/Eng Tech Appl Sci*. 1981:22.

*Ann Stat*. 1978;6:461–464.

*Brief Bioinform*. 2011;12:189–202.

*J Clin Oncol*. 2005;23:7332–7341.

*BMC Bioinformatics*. 2009;10:53.

*Am J Transplant*. 2012;12:2710–2718.

*Am J Transplant*. 2014;14:1164–1172.

*Am J Transplant*. 2006;6:747–752.

*J Am Soc Nephrol*. 2015;26:1721–1731.