# Statistical Primer on Biosimilar Clinical Development

A biosimilar is highly similar to a licensed biological product and has no clinically meaningful differences between the biological product and the reference (originator) product in terms of safety, purity, and potency and is approved under specific regulatory approval processes. Because both the originator and the potential biosimilar are large and structurally complex proteins, biosimilars are not generic equivalents of the originator. Thus, the regulatory approach for a small-molecule generic is not appropriate for a potential biosimilar. As a result, different study designs and statistical approaches are used in the assessment of a potential biosimilar. This review covers concepts and terminology used in statistical analyses in the clinical development of biosimilars so that clinicians can understand how similarity is evaluated. This should allow the clinician to understand the statistical considerations in biosimilar clinical trials and make informed prescribing decisions when an approved biosimilar is available.

^{1}Biotechnology Clinical Development Statistics, Pfizer Inc, Cambridge, MA; and

^{2}Global Established Pharma Medicines Development Group, Pfizer Inc, New York, NY.

Address for correspondence: Biotechnology Clinical Development Statistics, Pfizer Inc, 10 Fawcett St, Cambridge, MA 02138. E-mail: leah.isakov@pfizer.com

Medical writing support was provided by Christina McManus, PhD of Engage Scientific Solutions and funded by Pfizer Inc.

All authors are employees of Pfizer and hold stock or stock options in Pfizer.

This is an open-access article distributed under the terms of the Creative Commons Attribution-Non Commercial-No Derivatives License 4.0 (CCBY-NC-ND), where it is permissible to download and share the work provided it is properly cited. The work cannot be changed in any way or used commercially without permission from the journal.

## INTRODUCTION

A biosimilar is a biological therapy, which is highly similar to a licensed biological product (ie, originator biologic) and approved under a specific regulatory approval process, such as those enacted by the European Medicines Agency (EMA), US Food and Drug Administration (FDA), and the World Health Organization.^{1–3} According to the definition in the FDA guidelines, a biosimilar has “no clinically meaningful differences between the biological product and the reference (originator) product in terms of safety, purity, and potency.”^{2}

Biosimilars are not generic equivalents of the originator because both the originator and the potential biosimilar are large and structurally complex proteins, which can undergo posttranslational modifications that may affect safety and potency (even if the differences are minor).^{2} Thus, although regulatory approval of small-molecule generic medicines can occur after demonstration of bioequivalence to the originator small-molecule drug, this approach is not sufficient for approval of potential biosimilars.^{3} Instead, biosimilars are evaluated in a stepwise fashion that includes structural and functional studies, nonclinical assessments of pharmacokinetics (PK) and toxicity, and clinical evaluation of PK, efficacy, and safety (including immunogenicity) to demonstrate similarity of the potential biosimilar to the originator biologic.^{1–4} Approval of a potential biosimilar is based on the totality of the evidence in demonstrating similarity of the biosimilar to the reference product.^{1–4} Totality of the evidence refers to the extensive comparative structural, functional, nonclinical, and clinical data required to establish biosimilarity. Totality of the evidence is an important concept for biosimilars because of the different paradigm for developing and regulating biosimilar products. Data from all stages of development (not just clinical) are important for regulatory approval and to gain medical acceptance.

The number of biosimilars approved by regulatory agencies is expected to increase as patents on licensed originator biologics expire. Thus, it is important for prescribers to understand statistical analyses used in biosimilar clinical trials so that they can make informed decisions about appropriate ways to incorporate biosimilars into treatment regimens. The purpose of this review is to provide a high-level education to clinicians on concepts and terminology used in statistical analyses in the clinical development of biosimilars to understand how similarity is evaluated. Two case studies are also presented to illustrate margin derivation and sample size calculation for efficacy equivalence trials for potential biosimilars.

## TERMINOLOGY AND DEFINITIONS

Clinicians should become familiar with the traditional statistical approaches used in approval process for evaluation of a novel therapeutic (eg, the New Drug Application or the Biologics License Application processes), which generally use *superiority trial* study designs. A superiority trial has a primary objective of showing response to the investigational product is superior to a comparative agent (active or placebo control).^{5} In many cases, the comparative agent is the current standard of care, especially if treatment with a placebo would be considered unethical.

In contrast, biosimilar products must demonstrate similar efficacy and safety to the originator. Typically, studies of a potential biosimilar usually use an *equivalence trial* design.^{2,3} An equivalence trial has a primary objective of showing that differences in response to 2 or more treatments are clinically unimportant.^{5} In other words, clinical responses are close enough so neither the biosimilar nor the comparator (originator) is superior or inferior to the other.^{6} This is usually shown via demonstration that the true treatment difference is likely to lie within a specific range of clinically acceptable differences.^{5} Equivalence is demonstrated when the entire confidence interval (CI), or a given parameter, falls within the lower and upper equivalence margins set before the experiment or study (Figure 1).

Although the equivalence trial is commonly used, some assessments of a biosimilar may use a *noninferiority trial* design if adequate scientific justification is provided (Figure 2).^{2,3} In a noninferiority trial, the primary objective is to show that response to the investigational product is not clinically inferior to a comparative agent (active or placebo control).^{5} A noninferiority design uses only one margin (the lower or upper limit, depending on what is appropriate for the specific study or endpoint) and tends to require a smaller sample size than an equivalence trial design.^{3}

The CI expresses the degree of uncertainty associated with a statistical parameter, such as the difference between 2 treatment effects (such as risk difference) or the ratio. For example, if overall response rate (ORR) is used as the primary endpoint, risk ratio could serve as a primary efficacy parameter. The CI is different from the *point estimate* for a population parameter. For example, sample mean, ORR, and median survival time are examples of point estimates of unknown population parameters.

The *margins* for demonstration of equivalence (and noninferiority) are generally based on the effect size, should be justified on both clinical and statistical grounds, and use robust statistical rationale and clinical criteria to determine a margin value.^{7} In terms of demonstrating biosimilarity, the margin is the largest difference between the potential biosimilar and originator that is judged as clinically acceptable (Table 1) but should be smaller than the minimum difference reported between the originator and a placebo for clinically relevant results.^{7,8} When available, effect size, magnitude, and variability of effect size of the originator in placebo-controlled trials may be used to calculate margins.^{7,8} In some cases, such as in clinical PK studies, which lack established acceptance criteria for biologicals, regulators suggest a traditional equivalence range of 80%–125%.^{3}

The *two one-sided test* (TOST) is the simplest and most widely used approach to test equivalence.^{4,6} This is done using 2 simultaneous 1-sided tests to analyze the composite null and alternative hypotheses (H_{0} and H_{1}, respectively), which are: H_{0}: μ ≤ θ_{L} or μ ≥ θ_{U} (the treatment difference is outside the equivalence margin) and H_{1}: θ_{L} < μ < θ_{U} (the treatment difference is within the margins), where θ_{L} and θ_{U} are the lower and upper margins, respectively, and μ is the analysis criterion (eg, mean ratio or mean difference).^{4,9,10} This is done with 2 sets of hypotheses (H_{a0}: μ ≤ θ_{L} with H_{a1}: μ > θ_{L} and H_{b0}: μ ≥ θ_{U} with H_{b1}: μ < θ_{U}).^{4,9} The H_{0} is rejected in favor of H_{1} if the CI for μ is contained completely within the range (θ_{L}–θ_{U}).

In any statistical test, 2 types of errors can result: *type I error* or *type II error*. A type I error occurs when the null hypothesis is rejected when it is true, whereas a type II error occurs when the null hypothesis is accepted when it is false.^{11} The probability of committing a type I error (denoted by α) is the significance level for the test (eg, α = 0.05), whereas the probability of committing a type II error is denoted by β.^{11} The probability of not committing a type II error is the *power* of the test, which is equal to 1 minus β.^{11} In the context of the TOST procedure, the type I error is the larger of the 2 sets of hypothesis tests and the power for the TOST procedure is the probability of not committing the type II error in either of the 2 sets of hypotheses tested.

## STATISTICAL CONCEPTS FOR CLINICAL STUDIES OF BIOSIMILARS

Biosimilars are evaluated in a stepwise fashion requiring demonstration of similar structural, functional, and nonclinical PK and nonclinical toxicity to the originator before clinical trials begin.^{1–4} The specific nature of comparative clinical trials depends on the nature and extent of residual uncertainty about biosimilarity.^{2} The EMA, FDA, and World Health Organization guidelines for biosimilar development note that clinical PK is a critical basic characteristic of any medicinal product and should be evaluated as the first step of any clinical program.^{1–3} These studies should be comparative in nature and designed to enable detection of potential differences between the potential biosimilar and the originator.^{3} If feasible or if relevant pharmacodynamic measures are available, initial clinical studies may include pharmacodynamics in addition to PK.^{1,2}

Although the exact nature of statistical analyses depends on trial design, there are some principles that generally apply in most clinical studies evaluating potential biosimilars. A single-dose crossover study is not appropriate for most biosimilar clinical PK studies because of the long half-life and the potential for formation of antidrug antibodies. Instead, parallel treatments are necessary.^{3} This means that to show bioequivalence, sample size needs to be increased relative to crossover studies.^{3}

Regulatory agencies generally expect that equivalence trials will be used unless there is sufficient scientific justification for using noninferiority designs.^{1} For equivalence trials, statistical analysis is generally based on the use of 2-sided CIs (typically at the 90% level) for the difference between treatments.^{3} In many situations, no established acceptance criteria exist for standard clinical PK comparability (bioequivalence) studies for biological drugs because the criteria were developed for chemically derived orally administered products and, thus, may not be applicable for biologics.^{3} In this case, regulatory agencies often recommend use of the traditional 80%–125% equivalence range with 90% CIs of the ratio of the population geometric means (test to reference) for studies evaluating clinical PK.^{3,12} However, this is not the default range and proposed limits need to be justified.^{13} Regardless, similarity limits are determined before initiation of studies.

Outcomes can be measured as the treatment effect, which is either an absolute measure or a relative measure.^{6,8} For example, the *risk difference* is an absolute measure of effect and is calculated by subtracting the risk of the outcome for 2 groups of individuals (eg, those treated with the potential biosimilar and those treated with the originator). The *relative risk* (RR) (also called *risk ratio*) is a relative measure of effect calculated as the ratio between 2 incidence proportions and is generally presented with the CI as a measure of precision of the estimate.^{14} According to the FDA Statistical Approaches to Establishing Bioequivalence Guidance for Industry, the primary comparison in bioequivalence studies is expected to be the ratio between the average data for each parameter (eg, area under the curve) between the test (biosimilar) and the reference (originator), which can be conducted as a general comparison via logarithmic transformation, rather than the difference between the averages.^{15}

The patient population evaluated in clinical trials is also important. The intent-to-treat population includes every subject who is randomized and does not remove patients with noncompliance, protocol deviations, withdrawal, or anything that happens after randomization from the analysis.^{16} Although considered a conservative analysis in the superiority trial setting and therefore appropriate for the primary analysis, the intent-to-treat population is not conservative in equivalence and noninferiority trials because of a tendency for biasing the results toward equivalence (or noninferiority).^{8} However, the per-protocol population, which excludes data from patients with major protocol violations, can also bias results.^{8,16} Therefore, both populations should be evaluated and reported in equivalence (and noninferiority) trials.^{8} It is expected that at the end of the trial analysis, both intent-to-treat and per-protocol populations lead to the same conclusion for potential biosimilars.

## STATISTICS IN COMPARATIVE EFFICACY AND SAFETY STUDIES

After demonstration of comparable clinical PK, the next step in the development program is usually to evaluate the clinical efficacy and safety in adequately powered and randomized controlled clinical trial(s).^{1,3} The goal of these studies is to demonstrate that there are no clinically meaningful differences in efficacy, safety, and immunogenicity of the potential biosimilar compared with the originator.^{1–3} These studies should include a sensitive patient population and endpoints with appropriate scientific justification.^{1–3}

For comparisons of efficacy, selection of an appropriate primary endpoint depends on several factors, including the originator, intended use of the potential biosimilar, disease prevalence, and target population.^{3} The primary endpoint is expected to be relevant and sensitive to detect potential differences between the potential biosimilar and originator.^{1} This means that choice of clinical endpoints and/or time points for analyses may differ for those used for regulatory approval of the originator or for those traditionally applied for a novel agent in the same therapeutic area.^{1,2} When possible, some secondary endpoints used for approval of the originator should also be included to facilitate comparisons between drugs.^{1,2} Data collection for secondary endpoints may require an additional follow-up period. Although some regulatory agencies have provided some class-specific guidelines defining appropriate endpoints for biosimilars, not all areas are covered by these guidelines.^{1} One approach to identifying sensitive populations, margins, and appropriate endpoints in clinical efficacy is to perform a meta-analysis of clinical data for the originator.^{17} Regardless of the endpoints selected, equivalence designs are generally preferred for comparisons and margins must be prespecified and include the largest differences that would not be clinically relevant.^{2,3} Generally, the prespecified margins are symmetric, although an asymmetric margin can be used if dose-related toxicities occur or if the dose used is near the plateau of the dose–response curve and there is little likelihood of dose-related effects.^{2}

Although investigation of safety and tolerability is multidimensional because of the wide range of possible adverse effects and that occurrence of new and unforeseeable effects is always possible, safety analyses for potential biosimilars are expected to compare severity and frequency of adverse events after administration of the potential biosimilar versus the originator, with specific interest in adverse events listed in the originator product label.^{1–3,5} In addition, a key element of biosimilar safety is assessment of potential differences in incidence and severity of human immune responses because these may affect safety and effectiveness.^{1–3} Recommendations for statistical analyses of safety and tolerability in clinical trials include descriptive statistics of the data with provision of CIs and/or *P* values when useful for interpretation.^{5}

## CASE STUDIES: ILLUSTRATING MARGIN DERIVATION AND SAMPLE SIZE CALCULATION FOR EFFICACY EQUIVALENCE TRIALS

As described earlier, biosimilar products must demonstrate comparable efficacy and safety to the originator rather than superiority. This is usually done using an equivalence trial design with appropriate margins and sample sizes to show that the treatment difference lies within the calculated range of clinically acceptable differences. The calculation of equivalence margins and sample size depends on the selection of appropriate and sensitive clinical endpoints, and this, in turn, depends on the therapeutic indication of interest. Two case studies illustrate the approaches used when designing clinical trials to evaluate a potential biosimilar with indications for cancer or inflammatory diseases.

### Case study 1: oncology

In general, the amount of clinical trial data supporting registration of monoclonal antibodies used in the treatment of cancer is limited. Instead of being developed as a novel monotherapy for the treatment of specific types of cancer, in general, monoclonal antibodies are commonly evaluated as “add-on” therapy to a current, cytotoxic chemotherapy regimen used as the standard of care. It is rare to find large randomized controlled clinical trials that all use the same patient population and cytotoxic chemotherapy. Although randomized trials may have been conducted, they are often smaller studies, in different patient populations, and/or with different chemotherapy regimens. This leads to considerable variability in reported outcomes and inconsistent treatment effects. These differences pose a challenge for the identification of a robust and consistent treatment effect margin that could be used in the design of a biosimilar equivalence study.

According to the FDA and EMA biosimilar guidelines, selection of the most sensitive population is critical step in the design of efficacy trial.^{1,2} Therefore, the selection of an appropriate study population usually occurs after discussions with regulators and is based on historical studies, practical considerations, and input from clinicians to address changes in clinical practice and current knowledge of the disease.^{1,2} Whenever possible, the potential for confounding because of co-medication or previous treatment must be avoided. Thus, the selection of the sensitive patient population may be based on an indication in which the originator is approved for first-line use (with or without concomitant chemotherapy). For example, patients with previously untreated, advanced non–small-cell lung cancer (NSCLC) are well characterized for safety and efficacy and therefore are considered a sensitive population for the evaluation of potential biosimilars to bevacizumab (in combination with paclitaxel and carboplatin) per the bevacizumab-approved indications in the United States and European Union.^{17–19}

Identifying appropriate endpoints for comparisons between a potential biosimilar and the originator must take into consideration how to identify any potential product-related differences in efficacy and safety while removing potential confounding factors.^{20} Thus, although survival endpoints are preferred in clinical trials evaluating novel biological therapies for oncology indications, these endpoints may not be appropriate in clinical trials evaluating biosimilars. Instead, a measure of response (such as ORR) or activity may be more sensitive for clinical comparisons of a potential biosimilar and therefore a better choice for the primary endpoint.^{1,20,21}

When an appropriate study population and primary endpoint are identified, the next step is to derive the equivalence margin to demonstrate biosimilarity between the potential biosimilar and the originator. This case study is a hypothetical phase 3 comparative clinical trial of a potential biosimilar to bevacizumab being evaluated for safety and efficacy in a study population of patients with NSCLC and a primary endpoint of ORR. The equivalence margin can be calculated based on the meta-analysis described earlier that evaluated the addition of bevacizumab to chemotherapy in patients with NSCLC (Figure 3).^{17} Margin construction is commonly designed to preserve at least 50% of the treatment effect.^{10} In this example, using the 4 clinical trials identified as appropriate for this patient population and primary endpoint,^{22–25} we applied log scale and our margins are symmetrical around 1. Log scale transformation is commonly used when a ratio (eg, RR or hazard ratio) serves as a primary endpoint. Therefore, computing the margins for a risk ratio is a 2-step procedure. First, the upper margin is generated for Ln(RR) and then exponent of the upper margin is computed to give the upper limit margin for the RR. Then to construct a lower margin that is symmetrical, the reciprocal of the upper margin is taken. Based on a log scale, computing margins for Ln(RR): 1/2 of Ln(1.79) = 0.291, the upper margin is exp(0.291) = 1.34 and lower margin is 1/1.34 = 0.75. Thus, an equivalence margin of 0.75, 1.34, will demonstrate clinical similarity for the intended indication while maintaining 50% of the 2-sided 70% CI lower bound of the effect size based on the historical ORR data.

Treatment effect is a critical component for sample size calculation. In this example, the fixed-effect model estimates that the ORR (95% CI) for chemotherapy plus bevacizumab is 38% (35%, 40%; Figure 4). This relatively conservative assumption of an ORR of 38% is used to calculate the total sample size. If the type I error rate is 5% and power is 85%, the total sample size would be 764 (without accounting for dropout; Table 1). Under an assumption of 10% drop-out rate, the target sample size would be 849 patients. Although other assumptions about power will change the sample size (eg, using power of 80% results in a decreased total sample size of 688 patients and increasing power to 90% leads to an increase in the required sample size to 870 patients), the estimate of the treatment effect has a greater impact on sample size. For example, if the assumed ORR is 40% (which is potentially justifiable based on the fact that this value is the upper limit of the 95% CI for the fixed-effect model), the sample size will be 633 patients for power of 80%, 704 patients at power of 85%, and 800 patients if the power used is 90% without accounting for dropout. Using a 45% preservation of treatment effect (instead of the standard preservation of 50% of treatment effect) will lead to wider margins and, as a result, to smaller sample size under the same set of assumptions. It is important to remember that other approaches for treatment effects and sample size calculations could apply if clinical interpretations suggest that it is more appropriate to use nonsymmetrical margins, linear scale, or noninferiority margins.

### Case study 2: rheumatoid arthritis

Clinical efficacy of a therapeutic agent used in the treatment of inflammatory diseases, such as rheumatoid arthritis, can be evaluated through a number of measures, including composite measures of disease activity. In this case study, a hypothetical study of a potential rituximab biosimilar (with methotrexate) being evaluated for safety and efficacy in a study population of patients with rheumatoid arthritis, the primary endpoint selected is the disease activity in 28 joints score (DAS28). DAS28 is a validated, continuous composite measure of disease activity included as an endpoint in clinical trials in rheumatoid arthritis, including those evaluating rituximab plus methotrexate.^{26–32} DAS28 is considered a sensitive continuous endpoint to detect potential differences in treatment effect in a potential biosimilar.^{27,33}

Using a meta-analysis approach similar to one used by Volkmann et al to assess clinical efficacy, an equivalence margin for DAS28 in terms of change from baseline at week 24 can be derived (see Volkmann et al, Figure 3).^{32} The FDA guidance for noninferiority trials suggests that a margin for noninferiority trial preserves at least 50% efficacy from historical data.^{10} In this case, the noninferiority margin would be 50% of the upper 95% CI, (−0.70) for the treatment effect difference (vs. control) from historical data (ie, −0.35) and the symmetric margin for an equivalence trial is then −0.35, 0.35.

Using this margin of −0.35, 0.35, and assuming that the true difference is zero, a sample size of 356 patients will provide 90% power to establish equivalence with type I error of 0.05 in terms of standardized DAS28 change from baseline at week 24. If a 20% attrition rate by week 24 is used, the trial will need to enroll a total sample size of 445.

## CONCLUSIONS

Statistical analyses used for clinical assessments of potential biosimilars are different than those used for regulatory approval of the originator or those applied for approval of a small-molecule generic drug. Understanding the different analyses, comparisons, endpoints, populations used, and calculations of margins and sample size in these studies should allow the clinician to make informed decisions when prescribing biological agents when a biosimilar is approved by relevant regulatory authorities.

## REFERENCES

**Keywords:**

biosimilar; statistical analyses; equivalence