^{1}New biomarkers exploring the cardiovascular system, kidney, central nervous system, inflammation, and sepsis are under the scrutiny of bioengineering companies, and we are witnessing a biomarkers revolution similar to the imaging technique revolution.

^{2}Remarkably, this revolution has already occurred for cancer drugs.

^{1}

^{3}Recommendations concerning the reporting of diagnostic studies, the Standards for Reporting of Diagnostic Accuracy (STARD) initiative, have been published recently,

^{4}several years after the first recommendations concerning reporting of randomized trials.

^{5}However, these recommendations do not encompass all issues of this rapidly evolving domain.

#### Role of a Biomarker

^{6}In contrast, it is considered only as a severity biomarker in pulmonary embolism,

^{9}whereas procalcitonin is considered both as a diagnostic and severity biomarker of infection.

^{8}Biomarkers are often used for risk stratification. For example, blood lactate levels have been proposed for risk stratification of sepsis.

^{14}However, the purpose of diagnostic and prognostic settings markedly differ. For example, in the diagnostic setting, although unknown, the outcome (the disease) has occurred, whereas in the prognostic setting, the outcome remains to be determined and can only be estimated as a probability or a risk, and the uncertain nature of this outcome should be considered.

^{7}

^{15}Assessment of the usefulness mainly involves both characteristics of the test itself such as cost, invasiveness, technical difficulties, rapidity, and characteristics of the clinical context (prevalence of the disease, consequences of outcome, cost, and consequences of therapeutic options).

^{16}However, intervention studies are lacking for many novel biomarkers or give conflicting results for others.

^{17}

^{1}For example, procalcitonin has been advocated to guide the clinician to decide the duration of antibiotherapy,

^{13}and genetic determinants of metabolic activation of clopidogrel have been shown to modulate the clinical outcome of patients treated by clopidogrel after an acute myocardial infarction.

^{12}Finally, a biomarker can be used as a surrogate endpoint in a clinical trial,

^{1,18}but this issue is beyond the scope of this review.

#### The Bayesian Approach

*e.g.*, disease state, prognosis). In this regard, Bayesian statistical methods provide a powerful system from which to update existing information about the likelihood of the occurrence of some disease or prognosis. A comprehensive introduction to these methods is far beyond the scope of this review, but an interested reader is referred to one of the several textbooks that have been written on applying Bayesian statistical methods to medical problems.

^{19}

*i.e.*, the ability of the test to discriminate between disease states) to adjust our prediction of the likelihood of the outcome. Stated simply, the predicted probability of a patient having the disease (posttest probability) can be calculated as

^{20}: posttest probability = (pretest probability) × (predictive power of the evidence).

^{21}In these examples, it can be seen how disease prevalence (pretest probability) is used in conjunction with the LHR (strength of evidence) to calculate an updated (posttest) probability of the disease.

#### Statistical Tools

##### Decision Matrix

*i.e.*, a true positive), and specificity is the ability to rule out the disease in patients in whom the disease is truly absent (

*i.e.*, a true negative). Calculation of these indices requires knowledge of a patient's “true” disease state and a dichotomous prediction based on the biomarker (

*i.e.*, disease is predicted to be present or absent) to construct a 2 × 2 contingency table. Table 2 displays how the frequency of predictions from a sample of patients could be used in conjunction with their known disease state to calculate sensitivity and specificity.

*i.e.*, the sum of true positive and true negative tests). Perhaps because it is the most intuitive index of diagnostic performance, diagnostic accuracy is sometimes reported as a global assessment of the test. However, the use of this index for this purpose is inherently flawed and produces unsatisfactory estimates under a range of situations, such as when the prevalence of the disease substantially deviates from 50%.

^{22}It is recommended that authors report more than just a single estimate of diagnostic accuracy.

^{23}(sometimes called the “regret” defined as the utility loss because of uncertainty about the true state).

^{24}Interestingly, accuracy is actually a weighted average of sensitivity and specificity, using as weight the prevalence of the disease. It should be clear that these five indices (sensitivity, specificity, negative and positive predictive values, and accuracy) are partially redundant, because knowing three of them enables the calculation of the rest.

##### Influence of Prevalence

*et al.*

^{25}reported that a 1 ng/ml procalcitonin had a positive predictive value of 0.63 for predicting postoperative infection after thoracic surgery. However, in that study, the prevalence of infection was 16%. If a

*post hoc*analysis was conducted restricting the scope of inclusion only to patients with systemic inflammatory response syndrome criteria, the prevalence would have been 63% and the positive predictive value 0.90.

^{26,27}These indices can be influenced by case mix, disease severity, or risk factors for disease.

^{27}For example, a biomarker is likely to be more sensitive among more severe than among milder cases of the diseases. The sensitivity of procalcitonin to diagnose bacterial infection is greater in patients with meningitis than in patients with pyelonephritis.

^{28}

##### Likelihood Ratios

*versus*nondiseased populations. Two dimensions of accuracy have to be considered, the LHR for a positive test (positive LHR) and the LHR for a negative test (negative LHR). One of the most interesting features of LHRs is that they quantify the increase in knowledge about the presence of disease that is gained through the diagnostic test. Thus, LHR could also be referred as Bayes factors; we could demonstrate that using the following formula: posttest probability of disease = (positive LHR) × (pretest probability of disease) or posttest probability of nondisease = (negative LHR) × (pretest probability of nondisease).

^{29}Experts usually consider that tests with a positive LHR greater than 10 (or a negative LHR less than 0.1) have the potential to alter clinical decisions. Although it could be tempting to follow definitive rules-of-thumb for interpretation (such as those provided in table 3), we must primarily consider the clinical setting to determine what level of increased likelihood is clinically relevant to improve the management of patients.

*e.g.*, LHR = 1.0), the pretest probability equals the posttest probability. Second, pretest probability greatly influences what can be learned from using even a very predictive biomarker (

*e.g.*, LHR = 10.0). Very high (or low) pretest probabilities result in smaller adjusted expectations than those less extreme.

##### ROC Curve

##### Basic of ROC Curve.

^{30,31}In other words, the positive LHRs calculated at various values of the diagnostic test can be plotted to produce a ROC curve, and thus, a ROC curve is a graphical way of presenting information presented in the table of LHRs. Graphically, the positive LHR is the slope of the line through the origin (sensitivity = 0; 1 − specificity = 0) and a given point on the ROC curve, whereas the negative LHR is the slope of the line through the point opposite to the origin (sensitivity = 1; 1 − specificity = 1) and that given point on the ROC curve.

_{ROC}) (also called the c statistics or the c index) is equivalent to the probability that the biomarker is higher for a diseased patient than a control and, thus, is a measure of discrimination. By convention, ROC curves should be presented above the identity curve (fig. 2) that represents a test without any value and which performs like chance. It is important to note that the following points belong to the identity curve: of course sensitivity = 0.50 and specificity = 0.50 but also sensitivity = 0.90 and specificity = 0.10, and sensitivity = 0.10 and specificity = 0.90. This enables us to understand that sensitivity cannot be interpreted without specificity. The AUC

_{ROC}should be reported with confidence intervals (CIs) to allow statistical evaluation

*versus*the identity line or statistical comparison

*versus*other diagnostic tests (see Comparison of ROC Curves). Usually, biomarkers are considered as having good discriminative properties tests when AUC are higher than 0.75 and as excellent more than 0.90 (table 3). The ROC curve is a global assessment of the test accuracy but without any

*a priori*hypothesis concerning the cutoff chosen, is relatively independent on prevalence, and is a simple plot that is easily appreciated visually. However, the cutoff point and the number of patients are not typically presented (although a small sample size is easily detected by a jagged and bumpy ROC curve). The generation of a ROC curve is no longer cumbersome because most statistical software provides the calculation and display of the relative parameters.

_{ROC}, and the third is the area under a portion of the curve (partial area) for a prespecified range of values. Interpreting the AUC

_{ROC}is somewhat problematic because of the substantial portion of variance in this index that comes from values of the biomarker of no clinical relevance. One ROC curve may have a higher proportion of false positive than another in the region of clinical interest, but the two ROC curves may cross, leading to different conclusion when curves are compared on the basis of the entire area. Therefore, it is recommended that examination of the ROC curve be conducted in the context of partial area or average sensitivity over a range of clinically relevant proportion of false positives in addition to the AUC

_{ROC}.

^{32}

##### Comparison of ROC Curves.

_{ROC}and to detect the situations where ROC curves cross. However, formal statistical testing is required to assess differences between the curves. Several different approaches are possible, and all must take into consideration the nature of the collected data. When the predictive value of a new biomarker is compared with an existing standard(s), two or more empirical curves are constructed based on tests performed on the same individuals. Statistical analysis on differences between these curves must take into account the fact that one individual is contributing two scores to the analysis. Most biomarker studies collect data that are paired (

*i.e.*, measurements are correlated) in nature. Parametric approaches to these comparisons assume that there is a continuous spectrum of possible values of the biomarker for both diseased and nondiseased patients (generally true with biomarkers) and that the underlying distribution is Gaussian (normal). However, this assumption is often not tenable in biomarker studies. Despite this, paired parametric methods of ROC comparison are often used to evaluate biomarkers, using an approach described by Hanley and Mc Neil.

^{33}An alternative nonparametric paired method, described by DeLong

*et al.*,

^{34}is based on the Mann–Whitney U statistic. The two approaches yield similar estimates even in nonbinormal models.

^{35}

_{ROC}within a specific range of specificity for two correlated ROC curves have been developed and might be interesting to consider for some biomarkers.

^{36}

##### Determination of Cutoff.

^{37}In some situations, we do not wish to (or could not) privilege either sensitivity (identifying diseased patients) or specificity (excluding control patients), and thus, the cutoff point is chosen as that one which could minimize misclassification. Two techniques are often used to choose an “optimal” cutoff. The first one (I) minimizes the mathematical distance between the ROC curve and the ideal point (sensitivity = specificity = 1) and thus intuitively minimizes misclassification. The second (J) maximizes the Youden index (sensitivity + [specificity − 1]) and thus intuitively maximizes appropriate classification (fig. 3).

^{38}Interestingly, Perkins

*et al.*

^{39}present a sophisticated argument that the J point should be preferred, because I does not solely rely on the rate of misclassification but also on an unperceived quadratic term that is responsible for observed differences between I and J.

^{39}

^{40}The researcher should assign a relative cost (financial or health cost, from the patient, care provider, or society points of view) of a false positive to a false-negative result and consider the prevalence, and these different elements can be combined to calculate a slope m: m = (false-positive cost/false-negative cost) × ([1 − P]/P), the operating point on the ROC curve being that which maximizes the function (sensitivity − m[1 − specificity]).

^{15,41}Other methods includes the net cost of treating controls to net benefit of treating individuals and the prevalence.

^{42,43}

*a priori*decided; (3) the ROC curve should be provided to allow the reader to make its opinion; and (4) the cutoff that maximizes the Youden index should also be indicated. It remains clear that data-driven choice of cutoff tends to exaggerate the diagnostic performance of the biomarker.

^{44}This bias should be recognized and probably concerns many published studies.

^{45,46}The reason of the absence of CI is probably related to the fact that more sophisticated statistical methods should be used. The principle of all these methods is to perform multiple resampling of the studied population to provide a large sample of different populations providing a large sample of cutoff points and thus a mean (or a median) associated with its 95% CI. Several techniques of resampling can be used (bootstrap, Jackknife, Leave-One-Out, n-fold sampling).

^{47,48}In a recent study, Fellahi

*et al.*

^{49}used a bootstrap technique to provide median and 95% CIs for cutoff points of troponin Ic in patients undergoing various types of cardiac surgery, enabling the comparison of these different cutoff points. Here again, CI rule enables the researcher and the reader to honestly communicate or understand the values presented, taking into account the sample size.

##### The Gray Zone.

*i.e.*, privilege specificity). The second cutoff is chosen to include the diagnosis with near-certainty (

*i.e.*, privilege sensitivity). When values of the biomarker falls into the gray zone between the two cutoffs, uncertainty exists, and the physician should pursue a diagnosis using additional tools. This approach is probably more useful from a clinical point of view and is now more widely used in clinical research. Moreover, the two cutoffs and gray zone comprise three intervals of the biomarker that can be associated with a respective LHR. In that case, the positive LHR of the highest value of the biomarker in the gray zone is considered to include the diagnosis and the negative LHR of the lowest value to exclude the diagnosis. This interesting option is often called the interval LHR

^{50}and results in less loss of information and less distortion than choosing a single cutoff, providing an advantage in interpretation over a binary outcome. This allows the clinician to more thoroughly interpret the results improving clinical decision-making.

^{47,48}and the rules for choosing the cutoffs be determined

*a priori*and clearly explained and justified.

##### Reclassification Table

_{ROC}for a model containing the new biomarker is defined simply as the difference in AUC

_{ROC}calculated using models with and without the biomarker of interest. This increase, however, is often very small in magnitude. Ware

*et al.*

^{51}and Pepe

*et al.*

^{52}describe examples in which large odds ratios are required to meaningfully increase the AUC

_{ROC}. As a consequence, many risk factors that we know to be clinically important may not affect the c-statistic very much. Thus, the ROC curves approach might be considered as insensitive to evaluate the gain of biomarkers.

^{52}Furthermore, ROC curves are frequently not helpful for evaluating biomarkers because they do not provide information about the actual risks or the proportion of participants who have high- or low-risk values. Moreover, when comparing ROC curves for two biomarkers, the models are aligned according to their false-positive rates (that is, different risk thresholds are applied to the two models to achieve the same false-positive rate), and this might be considered as inappropriate.

^{53}In addition, the AUC

_{ROC}or c-statistic has poor clinical relevance. Clinicians are never asked to compare risks for a pair of patients among whom one who will eventually have the event and one who will not. To complete the results obtained by ROC curves, some new approaches to evaluate risk prediction have been proposed. One of the most interesting is the risk stratification tables. This methodology better focuses on the key purpose of a risk prediction, which remains to classify individuals into clinically relevant risk categories. Pencina

*et al.*

^{54}have recently purposed two ways of assessing improvement in model performance using reclassification tables: Net Reclassification Index (NRI) and Integrated Discrimination Improvement.

^{55}NRI is the combination of four components: the proportion of individuals with events who move up or down a category and the proportion of individuals with nonevents who move up or down a category. Because the NRI and its four components might be affected by the choice of stratification of the risks, lack of clear agreement on the categories that are clinically important could be problematic when using the NRI to assess new biomarkers. This concern is common with the Hosmer-Lemeshow test. Again, prevalence, predictive values, cost, and benefit should probably be considered to make clinically relevant decisions.

^{56}On the contrary, the Integrated Discrimination Improvement table does not require predefined strata, and it can be seen as continuous version of NRI with the probability of disease differences used instead of predefined strata. Alternatively, it could be defined as the difference of mean predicted probabilities of events and no events.

_{ROC}lead to substantial improvement in reclassification by the NRI and/or Integrated Discrimination Improvement table. This might suggest that very small increase of AUC

_{ROC}might still be suggestive of a meaningful improvement in the risk prediction and that the exclusive use of ROC curve is not sufficient to demonstrate that a biomarker is not useful. This is clearly an evolving domain of biostatics,

^{53}which should be highly considered for perioperative medicine and risk stratification.

#### Common Pitfalls of the Evaluation of a Biomarker

##### Intrinsic Properties

*vs.*other type of surgery).

^{49}

^{57}The terms “limit of detection,” “limit of quantitation,” or “minimal detectable concentration” are synonyms used for analytical sensitivity. Polymerase chain reaction is considered as a very sensitive test because it could detect a very low number of copies of gene or gene fragment. However, despite this exquisite analytical sensitivity, its diagnostic sensitivity may not be so perfect when the target DNA is absent in the biologic material analyzed: this could be the case of a patient with endocarditis but whose withdrawn blood samples do not contain any bacteria. In the same way, polymerase chain reaction can be considered as an assay with exquisite analytical specificity, but its diagnostic specificity may not be so perfect just because of contamination.

^{57}

##### Numerical Expression of Diagnostic Variables

^{58}The lower and upper limits of the 95% CIs inform the reader about the interval in which 95% of all estimates of the measure (

*e.g.*, sensitivity, area under the curve) would decrease if the study was repeated over and over again. When LHRs are reported, CIs that include 1 indicate that the study has not shown convincing evidence of any diagnostic value of the investigated biomarker. Therefore, the reader does not know whether a test with a positive LHR of 20 but a 95% CI of 0.7–43 is useful. A study reporting a positive LHR of 5.1 with a 95% CI of 4.0–6.0 provides more precise evidence than another study arriving at a positive LHR of 9.7 with a 95% CI of 2.3–17. Usually the sample size in critical care medicine studies is small, leading to wide CIs. Likewise, too often, studies concerning diagnostic tests are underpowered to allow statistically sound inferences about the differences in test accuracy.

^{58}

##### Role of Time

^{59}The importance of timing can be crucial in perioperative medicine, particularly in the postoperative period, because timing of the insult (anesthesia/surgery) is precisely known. For example, in cardiac surgery, Fellahi

*et al.*

^{11}suggested that troponin should be measured 24 h after cardiopulmonary bypass to gain the maximum information. In contrast, the time profile of another biomarker such as BNP may be completely different in that surgical setting.

^{60}

*vs.*onset of severe infection

*vs.*onset of shock) remains vague.

##### Different Populations

*vs.*chronic illness, pathologic location of form).

^{61}Moreover, the diagnostic test may work well in a global population but not in a given subgroup. For example, procalcitonin may not be a good biomarker of infection in pyelonephritis

^{28}or intraabdominal abcess.

^{62}Procalcitonin is not a good biomarker for infection in a population exposed to heatstroke even though half of them are truly infected simply because heatstroke itself increases procalcitonin.

^{63}In the perioperative period, the type of surgery might be an important cause of variation. The properties of cardiac troponin I to diagnose postoperative myocardial infarction are fundamentally different in noncardiac

*versus*cardiac surgery, just because cardiac surgery alone is responsible for important postoperative release of cardiac troponin, which has multiple causes: surgical cardiac trauma, extracorporeal circulation, and defibrillation.

^{60}Even when considering cardiac surgery, different cutoff points of cardiac troponin to predict major postoperative cardiac events are observed when comparing cardiopulmonary bypass, valve, or combined surgery.

^{49}

^{64}When the covariate does not modify the ROC performance, the covariate-adjusted ROC curve is an appropriate tool to assess the classification accuracy and is analogous to the adjusted odds ratio in an association study. In contrast, when a covariate affects the ROC performance, the ROC curves for specific covariate groups should be used. Covariate adjustment may also be important when comparing biomarkers, even under a paired design, because unadjusted comparisons could be biased.

^{64}

##### Importance of the Biomarker Kinetics

^{65}This effect could be of paramount importance in the postoperative period or in the intensive care unit because these patients are more likely to present organ failures. When comparing two biologic forms derived from BNP, the active form and its prometabolite N-terminal prohormone brain natriuretic peptide, Ray

*et al.*

^{66}observed that the diagnostic properties of N-terminal prohormone brain natriuretic peptide were decreased compared with BNP, probably because of the differential impact of renal function on these two biomarkers in an elderly population.

^{10}Because we are more frequently caring for elderly patients, it is important that biomarkers be tested not only in a middle-aged population but also in an elderly population.

##### Other Bias

^{67}Apart from differences concerning the cutoff point, which should be considered as a definition issue, one of the most important reason for this wide variation is that diagnostic studies are plagued by numerous biases

^{68}:

^{3,69}This bias (also called spectrum bias) may be associated with the largest bias effect.

^{3}

^{3}

^{3}This bias might be particularly confounding when the decision to perform the reference test is based on the result of the studied test.

^{70}

##### Statistical Power Issue

^{71}statistical power considerations are as important for studies examining the diagnostic performance of a biomarker as in other types of research (

*e.g.*, clinical trials). Thus, all studies on biomarkers should have included an

*a priori*calculation of the number of patients needed to be included. The exact statistical power considerations that are relevant for interpreting a biomarker study are dependent on the nature or purpose of the study but generally focus on demonstrating that the sensitivity and/or specificity of a biomarker is superior to some stated value (

*e.g.*, sensitivity >0.75). It is of note that this focus on sensitivity and specificity is the case even if the predictive values are of greater interest (as they often are), because predictive values are also dependent on prevalence of the underlying disease (fig. 2).

*i.e.*, one-sample difference in proportions) using a one-sided CI. For the calculation, the sensitivity or specificity values of the test are treated as a proportion and compared with a minimally acceptable value. For this calculation, a null hypothesis is posited that the sensitivity or specificity of the test is equal to a minimally acceptable value, with the alternative hypothesis that the test value is greater than this minimal value. To test this hypothesis, a type-I error rate must be specified (usually as α = 0.05) to construct a one-sided CI (1 − α). Further, because of the nature of the inference being conducted, the desired statistical power is conservatively set to 95% (1 − β = 0.95). Finally, the expected sensitivity or specificity of the biomarker must be anticipated such that the difference in the two proportions can be used in the calculation.

*et al.*

^{72}have recently provided an extensive overview of the process and have even provided tables of values that are routinely encountered. Finally, a growing list of internet sites host statistical power calculators for a variety of applications. Although many of these sites are not formally vetted, several are hosted by Universities and are, thus, quite useful.

^{73}the AUC

_{ROC},

^{73–75}including partial AUC

_{ROC},

^{73,76}or the reclassification indices.

^{54}Moreover, the objective of a study could also be to determine the value of a cutoff or to compare two or more biomarkers. In fact, diagnosis assessments of biomarkers include numerous forms of statistical analyses, but we have to take into consideration sample size calculations, which is ever feasible even with some difficulties. Thus, clinicians have first to define the aim of the considered research and second to evaluate its ability to conduct this power calculation. In fact, these techniques are not yet available in most of the usual statistical software applications, and more advanced statistical software# might be dissuasive for a punctual use by clinical researchers. Thus, advice of a biostatistician may be very helpful.

##### Imperfect Reference Test

*e.g.*, cardiac failure), may not have been performed in many patients (

*e.g.*, autopsy), or logistically could not be concurrently performed. For example, when evaluating BNP, echocardiography for heart failure is not always performed in the emergency department but is usually performed later during hospitalization.

^{34}Moreover, in many situations, biomarkers are compared with derived scores from several clinical metrics that have unknown reliability in place of a confirmed diagnosis. This practice was seen in the Framingham score for heart failure and use of the biomarker BNP, criteria of systemic inflammatory response syndrome and sepsis and use of procalcitonin,

^{77}and Risk, Injury, Failure, Loss, and End-stage Kidney (RIFLE) score for acute renal failure.

^{78}

^{79}Glueck

*et al.*

^{80}showed that when inappropriate reference standards are used, the observed AUC

_{ROC}can be greater than the true area, with the typical direction of the bias being a strong inflation in sensitivity with a slight deflation of specificity. Taken together, this information warrants the use of reliable reference standards that are not prone to such bias.

^{81}These experts should have complete access to all available information, except that concerning the biomarker test, to which they should be blinded. The statistical agreement between experts should be quantified and reported. A second option is to assign a probability value (

*i.e.*, 0–1) that corresponds to a subjective or derived (logistic regression using dedicated variables) probability that a patient has the disease. Third, one can use covariance information to estimate a model of the multivariate normal distributions of disease-positive and disease-negative patients when several accurate tests are being compared. Finally, one can transform the diagnostic problem into a clinical outcome problem.

^{82}

*et al.*

^{83}proposed a ROC type nonparametric measure of diagnostic accuracy. This is a discrimination test in which a diagnostic test is compared with a continuous reference test to determine how well it distinguishes outcome of the reference test.

^{84}For example, cardiac troponin has progressively modified the definition of the diagnosis of myocardial infarction.

^{70}Glasziou

*et al.*

^{84}have proposed three main principles that may assist the replacement of a current reference test: the consequences of the new test can be understood through disagreements between the reference and the new test; resolving disagreements between new and references test requires a fair, but not necessarily perfect, “umpire” test; possible umpire tests include causal exposures, concurrent testing, prognosis, or the response to treatment. A fair umpire test means that it does not favor either the reference or the new test and, thus, is considered as unbiased.

#### STARD Statement for Diagnosis Studies

^{4}Complete and accurate reporting of biomarker studies allow the reader to detect potential bias and judge the clinical applicability and generalization of results. The STARD recommendations follow the template of the Consolidated Standards of Reporting Trials statement for the reporting of randomized controlled trials (RCTs).

^{5}The STARD guideline attempts to improve the reporting of several factors that may threaten the internal or external validity of the results of a study, including design deficiencies, selection of the patients, execution of the index test, selection of the reference standard, and analysis of the data. That these reporting improvements are needed is evidenced by a survey of diagnostic accuracy studies published in major medical journals between 1978 and 1993 that found generally poor methodologic quality and underreporting of key methodologic elements.

^{85}Similar shortcomings were observed in most specialized journals.

^{86}

#### Associated Clinical Predictors and/or Multiple Biomarkers

^{87}After cardiac surgery, a multiple biomarker approach has been shown to improve the prediction of poor long-term outcome when compared with the classic clinical Euroscore (fig. 7).

^{60}There are two main approaches: (1) several biomarkers testing the same pathophysiologic process; (2) different biomarkers testing different pathologic processes. For example, C-reactive protein may assess the postoperative inflammatory response, BNP the cardiac strain, and troponin myocardial any ischemic damage, all of them influencing final outcome in cardiac surgery.

^{11}

^{88}

#### Meta-analysis of Diagnostic Studies

^{89}In this regard, meta-analyses of biomarker diagnostic studies are similar to other types of meta-analyses.

^{67}For that reason, the reporting of meta-analyses of biomarker diagnostic studies should generally follow existing guidelines for meta-analysis such as Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA).

^{90}

^{3}To assist in the evaluation of study quality, several specialized tools have been created, such as STARD guidelines (table 4), and the quality assessment of studies of diagnostic accuracy (QUADAS) has been included in systematic reviews.

^{91}The accurate characterization of a biomarker's performance in a particular setting for a specific population is dependent on sorting through the available evidence to primarily focus on only relevant studies of high quality.

*e.g.*, sensitivity and specificity), as opposed to a single index in the meta-analysis of an RCT.

^{92}It is also expected that heterogeneity in the indices will be observed from several different sources,

^{3}and this heterogeneity must be considered in the statistical model used to pool the estimates.

^{93}The choice of which type of model and estimation strategy to use is not trivial, with several novel techniques such as the hierarchical summary ROC

^{94}and multivariate random-effects meta-analysis

^{95}offering distinct advantages over traditional approaches. For the interested reader, Deeks

*et al.*

^{92}offers an informative illustration of the meta-analytical process.

#### Conclusions

^{4}Two main reasons may explain this situation. First, there is a widely recognized delay between the development of biostatistical techniques and their implementation in medical journals. Second, even in biostatistics, this domain has not been thoroughly explored and developed. Thus, there is an urgent need to accelerate the improvement of our methods in analyzing biomarkers, particularly concerning the use of the ROC curve, choice of cutoff point, including the definition of a gray zone, appropriate

*a priori*calculation of the number of patients to include, and the extensive use of validation techniques. Admittedly, if we retrospectively look at one of our recent studies,

^{66}we now realize that considerable improvement in information could have been provided (fig. 8), that should be considered as a promising and encouraging signal.

^{96}