Normand, Sharon-Lise PhD*; Marinac-Dabic, Danica MD, PhD†; Sedrakyan, Art MD, PhD†; Kaczmarek, Ronald MD, MPH†
Current approaches for integrating clinical information in clinical trials and real-world settings of medical devices require updating. This need arises due to the recognition of at least 2 facts. First, randomized controlled trials (RCTs), when serving as the basis for new device approval, are small, short term,1 and are generalizable to an increasingly smaller percentage of patients. The reasons for decreased generalizability is 2-fold: (1) the population is aging, having more chronic diseases, and comprising a larger portion of routine practice yet are often excluded from trials and (2) the increasing inclusion of less sick patients who are less likely to benefit.2
Second, postmarket studies are often voluntary, have design limitations, and are difficult to execute.3 Although these problems are not new, they have become increasingly important during the last decade because device technology is changing at a rapid pace, therapies are used outside their intended populations, and more representative groups of patients are likely to have differential responses to the same therapy.4 A broader more inclusive group of patients means wider ranges of disease severity, of sociodemographic characteristics, of genetic characteristics, and of health-related behaviors. Consequently, the device effectiveness will be more heterogeneous.
Some of the most common and challenging devices are those used in hip replacement. A total hip replacement involves cutting off the top of the femur, inserting a stem (with a femoral ball) into the femur, and replacing the hip's socket, which will articulate with the femoral ball. Patient enrollment and retention in the pre or postapproval study setting pose unique problems in assessing hip replacement systems because long-term follow-up, generally 10 years postimplantation, is required. Blinding and allocation concealment in RCTs are difficult, and the numerous potential comparators requires very large numbers of patients to be studied. Device characteristics, such as mechanical performance and reliability of the prostheses, can directly affect patient outcomes such as hip dislocation and failure.5 Finally, it is not practical to design thousands of clinical studies with long-term follow-up to assess all comparators particularly when technology is rapidly evolving.
Because of these particular challenges, combining pre and postmarket information may provide an additional mechanism to learn about outcomes of medical device use. In this article, we propose an approach to the integration of multiple and diverse data sources to obtain more precise, and potentially faster, information about postmarket performance of hip implants. The key assumption underpinning our approach is that device performance characteristics and clinical outcomes obtained from 1 study, whether randomized or observational, are related to performance characteristics and outcomes of the same device or similar devices observed in other studies.
Current Paradigm for Device Approval
Premarket approval (PMA) of medical devices requires a number of key pieces of evidence including biocompatibility, preclinical testing, and pivotal clinical trials. Clinical trial populations are relatively homogeneous to demonstrate safety and efficacy/effectiveness, and maximize internal validity of the study. These particular strengths are also weaknesses for using them for postmarket surveillance: compromised generalizability of study findings. More recent clinical trials have attempted to broaden the inclusion criteria for their study populations. As the spectrum of included patients has broadened, so has the variation of individual treatment effects. Subgroup analyses are often used to explore evidence of treatment heterogeneity but suffer several difficulties: many trials are not sufficiently powered for subgroup testing; many subgroup analyses are undertaken on a posthoc basis; the most appropriate statistical methods for making inferences about subgroups are often not used; and many subgroup results are overinterpreted.6
What about after approval? In the United States, oversight of medical devices is undertaken by the Center for Devices and Radiologic Health (CDRH) at the Food and Drug Administration (FDA). CDRH uses 2 general approaches to monitor the safety of medical devices. First, a passive reporting surveillance system, the Manufacturer and User Facility Device Experience (MAUDE) Database, is monitored. By regulation, manufacturers and importers of medical devices are required to submit reports of device-related deaths, serious injuries, and malfunctions. Facilities that use the devices are required to report deaths to the FDA, and deaths and serious injuries to the manufacturer. The MAUDE database currently houses more than 2 million reports nationwide and non-US data on device-related adverse events. It is 1 of a few sources available for assisting in the early identification and characterization of emerging medical device problems and related public health issues. For example, the reports are used for health hazard evaluations, product assessments, trend analysis, regulatory actions, or education program development.
However, there are inherent limitations with passive surveillance systems. The adverse event reports are often incomplete or difficult to understand and interpret. Underreporting continues to be a problem due to lack of detection or attribution of device to specific events, lack of knowledge about reporting system, liability concerns, or lack of motivation to report. Because MAUDE does not include information about all devices, estimation of population-based adverse event rates is not possible.
The second surveillance tool used by CDRH is mandated postmarket studies (ie, postapproval studies) conducted as conditions of approval [21 CFR 814.82(a)]. These postapproval studies are important epidemiologic tools that enable the FDA to evaluate performance of medical devices as they transition from the clinical trial environment to broader patient and clinician populations. Postapproval study questions may relate to assessment of longer term outcomes under general conditions of use, assessment of the effectiveness of training programs, and device performance in specific subgroups. Unfortunately, the design and conduct of effective and least burdensome postapproval studies remains a challenge. In the postmarket environment, recruitment of clinical investigators, of health care facilities, and of patients are often more difficult because access to a given approved device does not require study participation.
The public health utility of existing databases external to the FDA, including various types of registries, has not been systematically evaluated in the context of complementing or providing a vehicle for conduct of FDA-mandated postapproval studies. The successful integration of all information will benefit the FDA, the sponsors of medical devices, and the public through an enhanced understanding of the diffusion path and device performance in understudied populations. In its evolving role of the public health agency, FDA has begun engaging its public health partners and leveraging existing data sources to address threats to the safety of medical products through novel approaches designed to augment more traditional regulatory tools.7
Hip Arthroplasty in the United States
Therapies for joint failure caused by osteoarthritis include supplements, antiinflammatory medications, physical therapy, hip resurfacing, and partial or total hip replacement (Table 1). Artificial hips are used in hundreds of thousand of patients to alleviate pain caused by osteoarthritis, aseptic necrosis, and trauma to restore patients' mobility and improve their quality of life. Osteoarthritis is the cause for the need for hip replacement in most of device recipients. In the United States, the number of total or partial hip replacements rose from 290,700 in 1997 to 383,500 in 2004; and the average hospital cost for total or partial hip replacement in 2004 was $14,500 with 62% of the hospital stays attributed to females and 68% to those 65 years of age and older.8 As the prevalence of osteoarthritis continues to increase with age and as the elderly proportion of the US population continues to increase, it is predicted that, by 2020, 18.2% of Americans will be affected by osteoarthritis.9 At the same time, a growing increase in the obesity will continue to contribute to an overall rise in arthroplasty procedures.10 Moreover, because women are at a greater risk of developing degenerative arthritis and have greater longevity, they do constitute a disproportionately sizeable cohort of hip device recipients.
The options for surgeons and patients have also increased. The bearings (surface) that form the key components of the hip replacement system have changed substantially over time and have been grouped into 4 broad categories: metal femoral heads with a polyethylene liner in the acetabular cup (metal-on-polyetheylene); ceramic femoral heads with a polyethylene liner [ceramic-on-polyethylene (COP)]; metal femoral heads and metal liner (metal-on-metal); and, more recently, ceramic femoral heads and ceramic liner (ceramic-on-ceramic). As of June 30, 2009, there are approximately 960 metal-on-polyethylene or COP hip replacement systems cleared for use and 136 metal-on-metal hip replacement systems cleared for use. Both these types of hip systems were initially introduced before 1976, so their approval pathway was through the premarket notification review process denoted as “510(k).” Ceramic-on-ceramic hip replacement systems were first approved for use in the US market in 2003 through the PMA pathway. As of June 30, 2009, there are 10 ceramic-on-ceramic hip replacement systems approved for use in the United States.
Why do the bearings matter? The answer is simple—through use, wear particles are generated by the interaction of the hip components. Particulate-wear debris creates the formation of tissue, which may eventually lead to aseptic loosening of the hip implant. Clinical studies have documented that aseptic loosening is a major cause of revision surgery after total hip arthroplasty.11 Although various surface combinations have been developed in an effort to decrease the generation of wear particles, each has their own unique drawbacks. For example, metal-on-metal total hip systems generate metal ions such as chromium and cobalt. No adverse heath effects have been definitively linked to the increased serum metal ion levels, but it remains a concern. Ceramic-on-ceramic hips have been associated with audible squeaking.
Because multiple sources of information are assembled to get a comprehensive assessment of device performance, we may crossclassify these data sources by at least 2 of the features: study design (experimental versus nonexperimental) and species (human versus nonhuman).
In the preclinical development phase of implants, laboratory bench, and animal studies are conducted to determine preliminary device safety and effectiveness. Several tests may be undertaken in controlled settings: microbiological; toxicological; immunologic; and biocompatibility testing. Device performance is quantified by mechanical wear, corrosion, and shelf life testing. For wear and fatigue testing, investigators evaluate components of the hip device under conditions that represent the in vivo environment.
In the clinical phase, a randomized clinical trial may be conducted. Table 2 presents design and summaries of the pivotal trial data for the 10 ceramic-on-ceramic devices approved for use on the US market. Several features are worth noting. First, many of the hip systems have been on the market outside the United States before approval within the United States. Second, some of the studies are randomized, whereas others are historically controlled studies. Moreover, 5 of the ceramic-on-ceramic hip systems' PMAs were granted, in part, due to a licensing agreement with the company that owns the rights to the PMA for the TRANSCEND Ceramic Hip System. This company distributes the ceramic components used in these 5 hip systems and the TRANSCEND System. Rather than new premarket clinical data, component comparisons and preclinical test results of the new systems were used to demonstrate that the subject components performed similarly to the TRANSCEND System. Third, and not surprising, the nature of the device characteristics and clinical outcomes are very similar across the studies.
Nonexperimental (Observational) Data
There are several sources of nonexperimental data that may be informative for surveillance of hip replacement systems. A natural source of information is the FDA-mandated postapproval study. Ongoing postapproval studies include studies of 10 ceramic-on-ceramic total hip replacement systems and 2 hip resurfacing systems.
Several registries also exist in the United States that contain detailed information on patients treated with hip arthroplasty. A registry is an “organized system that uses observational study methods to collect uniform data on populations defined by an exposure (including to a manufactured product or a healthcare procedure), disease, or condition.”12 In a report to the FDA reviewing US orthopedic device implant registries, 9 were identified to meet specific FDA inclusion/exclusion criteria.13Table 3 summarizes key features of the 9 US registries: GLORY Registry (University of Massachusetts); Harris Joint Registry; HealthEast Orthopedic Joint Registry; the Hospital for Special Surgery/Center for Education and Research on Therapeutics Total Joint Replacement Registry (Weill Cornell Medical College); Kaiser Permanente National Total Joint Replacement Registry; Mayo Clinic Joint Replacement Database; MaineHealth Total Joint Replacement Registry; Rush University Medical Center Joint Registry; and Virginia Joint Registry. Limitations of the registries relate to a definition of a well-defined population; the selection of participants, of centers, and of surgeons; longitudinal follow-up; and geographic diversity. No national orthopedic registry for total hip replacements exists in the United States.
Because hip arthroplasty is an expensive inpatient procedure, administrative data, such as hospital billing data, is a potential supplemental source to assessing comparative effectiveness. Inpatient databases can be used to quantify utilization, patient characteristics, surgeon characteristics (for some state inpatient databases), and some outcome information such as in-hospital complications, discharge status, and reoperations (in those databases having unique patient identifiers). The advantage of the hospital billing data is the inclusion of all-payers—there is no restriction to health insurance status of the patient. The usual disadvantages include lack of clinical, radiographic, functioning and quality of life data, device-specific identifier, and difficulty in follow-up. Similarly, Medicare claims can be used to track patients longitudinally, quantify their inpatient and outpatient care, and in some cases, pharmaceutical utilization.
Typical clinical outcomes after hip replacement include activity and function improvement [the Harris Hip Score (HHS), the University of California at Los Angeles activity score, the Western Ontario and McMaster Universities score, and the Short-Form-12 score]; thigh pain; and survivorship (time to hip revision or revision-free success). Complications include infection; device dislocation (radiographic evidence of component loosening or breakage; wear; cup dissociation; and linear dislodgement); and femoral stem fracture. Device design and surgical technique have been both linked to dislocation. Thromboembolism after hip replacement is not only rare but also possible, and although government and payers have mandated prophylaxis against venous thromboembolic disease, the ideal agent, timing, and duration of administration are unknown.
Our underlying assumption is that device performance obtained from 1 “study” is related to performance characteristics and outcomes of the same device or occasionally similar devices observed in other studies, particularly when studied populations are similar. This strategy assumes that, although there will be diversity in terms of clinical characteristics of enrolled populations, surgical skills, and participating hospitals, there is some relationship of device performance across the study populations. We also assume that, although studies can vary in their designs, such as use of randomization, blinding, and other study methodological quality indicators, each provides valid information about safety and effectiveness. Thus, we use the term “study” generically—it can be experimental or nonexperimental and involve human subjects or animals.
We wish to assemble all the available evidence to reduce uncertainty about performance of hip replacement systems in the postmarket setting. However, although we believe particular relationships exist, we are uncertain about their relevance: how device performance characteristics of different hip replacement systems relate to patient outcomes; how devices that have been compared in other studies on similar outcomes but not to each other are related; and how devices that have been compared in other studies on different outcomes but not to each other are related.
We begin by positing an underlying mechanism that generates the observed data. Let Yijkm denote the mth outcome associated with the kth “treatment” in the jth group of patients within the ith study. The set of outcomes include the patient clinical outcomes and the device outcomes; the set of treatments could include metal-on-metal hip systems, metal-on-polyethylene hip systems, ceramic-on-ceramic hip systems, and antiinflammatory medications; groups refer to the number of distinct patient groups in the study; and studies comprise specific trials or observational cohort. For example, Yijkm may represent the number of patients implanted with ceramic-on-ceramic hip replacement systems who had a HHS ≥80 in the ith FDA-mandated postapproval study. We assume that, although some outcomes in cells defined by outcome × treatment × group × study may be missing, the set of observed outcomes is connected.14 Specifically, letting g(·) denote a link function, we assume
where Yijkm: Outcome
Equation (Uncited)Image Tools
αm: Mean outcome or average base rate for mth outcome
βk: Effect of treatment k for average study and average outcome
γmk: Deviation from average of treatment k on outcome m
ai: Main effect of ith study
bj(i): Study-specific effect of jth group within ith study
ck(i): Study-specific effect of treatment k within ith study
dm(i): Study-specific effect of outcome m within ith study
εijkm: Sampling error (known)
The notation E(Yijkm) refers to the “expected” or average value of the outcome. We have also assumed the available data are summaries, eg, the average failure rate, rather than at the patient-level, although it is a simple modification to include patient-level data. The function g(·) may represent the logit link. Assuming the observed outcomes are connected means that any observed (i,j,k,m) can be reached from any other (i*,j*,k*,m*). Connectedness would imply that, for example, a HHS could be estimated for studies based on administrative billing data by extrapolation via the relationship represented in Eq. (1). The statistical model described earlier is a hierarchical generalized linear model where outcomes are nested within studies.
Heterogeneity among outcomes is permitted by assuming γmk,ai,bj(i),ck(i),and dm(i) are random effects. An additional random effect, ψkk*m, may be included to permit inconsistency in evidence of this pair of treatments (hip system k compared with hip system k*) on the mth outcome. Fixed characteristics of the study (eg, randomized versus observational), of the treatment (ie, type of bearing surface, relative motion of the implant, and bone under cyclic loading), and of the patient groups (ie, sex, age, and obesity) can be easily included in the model. Expected differences in outcomes 1 hip replacement system compared with another system averaged overall all patient groups can be obtained as functions of the parameters in Eq. (1). Because of the complex structure of the data due to nesting, estimation of the parameters can be achieved using a Bayesian approach. Useful summaries of specific comparisons may be presented using posterior distributions. A particular advantage of the Bayesian approach is computation of Bayes factors. A Bayes factor explicitly quantifies the evidence of device performance of 2 hip replacement systems and may be obtained straightforwardly from the model specified in Eq. (1).
We use data from 3 pivotal studies of 4 ceramic-on-ceramic total hip systems (Table 4): ABC I, ABC II, Trident, and Transcend to illustrate ideas. These data comprise part of the information contained in the Summary of Safety and Effectiveness for each artificial hip, which is publicly available on the FDA website.15 Three of the ceramic-on-ceramic hips were evaluated using randomized designs (ABC I, ABC II, and Reflection) and 2 used observational designs (Trident and Transcend). Of note, the Trident system was an observational arm of the ABC study and thus nested within that study. Furthermore, 630 additional subjects implanted with the Transcend system, denoted Transcend Extended, were recruited after completion of enrollment for the primary subjects implanted with Transcend (Transcend Original). In total, 3 COP hips served as comparators in the studies. There are 9 observations in total reflecting information from 1851 subjects implanted with 1 of 8 unique hip systems. Patients recruited into the 3 studies were all 21 years of age or older and presented with noninflammatory degenerative joint disease. The primary reasons for the implant included osteoarthritis, avascular necrosis, traumatic arthritis, and congenital hip dysplasia. The mean age across the specific artificial hips ranged from 50 for the Reflection ceramic-on-ceramic artificial hip to 63 for subjects implanted with the COP hip used in the TRANSCEND study. Our illustration is limited in that the artificial hips are completely nested within study—the overall goal is to supplement the data in Table 4 with information from other studies that contain outcomes about the same hip systems.
We examine 2-year HHSs, Yc, assumed continuous, and “counts” of subjects who are revision free at 2 years, Yb. We let Ycij represent the jth mean 2-year HHS from the ith study and let ς2cij the reported variance of Ycij. We permit the means of the HHSs to depend on the type of device (ceramic-on-ceramic coded as 1 versus 0 for COP) and on the type of design (randomized coded as 1 versus 0 for observational). We further assume, as in a meta-analysis, the variances of the means of the HHSs, ς2cij, are known. Next, we let Ybij denote the number of nij subjects who are revision free 2 years after implant and assume that this number arises from a Binomial distribution with success probability pij. We permit the log odds or logits of the probability of 2-year revision-free success to depend on the same covariates as the means of the HHSs. Although we permit the same covariates to be associated with both outcomes, this is not a requirement. Importantly, we allow the association parameters of the covariates to differ for the 2 outcomes. To accommodate correlation between outcomes measured within the same study, we assume a latent variable, ui, having a normal distribution with mean 0 and variance, ς2u. With only 3 studies, we anticipate that the posterior distribution for ς2u will not be well estimated, particularly when assuming vague prior distributions. A joint model for the 2 types of outcomes is given in Eqs. (2)–(3):
The term, ςcij, multiplying the latent variable in Eq. (2) is required for model identifiability.16 Some important assumptions are worth emphasizing. First, we assume the mean HHS and the logit of revision-free success reported for the jth observation in the ith study are independent when we know ui. This is referred to as conditional independence. A key assumption relates to the level at which we assume the information can be considered similar or “exchangeable.” We have assumed that the “true” unknown mean HHS within each study is random quantity drawn from a distribution that depends additively on whether the artificial hip is ceramic-on-ceramic or COP and on whether the study design was randomized or observational. Specifically, this implies that if we are considering 2 separate artificial hips and both are ceramic-on-ceramic and both involved subjects recruited into a randomized study, we cannot say which of the 2 artificial hips would have a higher mean HHS.
We also note that there is no report of 2-year HHSs or of revision success for the Trident system, and we thus treat these 2 observations as parameters to be estimated. Two comparisons are also estimated: (1) the effectiveness of the ABC COP system compared with the Transcend COP system and (2) the ABC I Ceramic-on-Ceramic compared with the Reflection Ceramic-on-Ceramic.
Vague but conjugate prior distributions are assumed for all regression coefficients. We vary the prior distribution for ς2u, however. We begin by assuming the standard deviation (SD), ςu, arises from a uniform distribution on the interval (0,100); we then assume a uniform distribution over a smaller interval (0,10) and finally we assume the precision, 1/ς2u, arises from a vague gamma distribution. We estimate this model in the WinBUGS software implemented in R using 3 parallel chains with a burn in of 12,500 iterations and inference based on 2028 iterations of the remaining 12,500 draws. Convergence was determined with use of the Gelman-Rubin statistic that examines the ratio of the between-chain to within-chain variability.
Because of the lack of variation in 2-year revision success, no relationship of type of hip (ceramic or polyethylene) represented by βb1 nor of randomization (βb2) to revision success was observed (Table 5). In contrast, regardless of the prior assumed for ς2u, subjects participating in randomized studies had 4 points higher (posterior SD = 0.6) on their HHSs at 2 years compared with subjects not randomized. No meaningful differences in HHSs were observed for ceramic compared with polyethylene hip systems.
Figure 1 displays summaries of the posterior draws for 4 additional parameters. Each graph contains density estimates assuming 3 different prior distributions for the variance parameter. The upper left-hand graph indicates patients implanted with the COP hip system used in the ABC study had, on average, a 4-point higher 2-year HHS than those implanted with the COP hip system used in the Reflection study (posterior mean difference = 4.2, posterior SD = 0.6). There were no differences in HHSs between those implanted with the ABC I Ceramic-on-Ceramic system compared with those implanted with the Reflection Ceramic-on-Ceramic system (posterior mean difference = 0.7, posterior SD = 0.6) although the 25th percentile is 0.1. The lower left-hand graph indicates that had the Trident participants been followed out to 2 years, the average HHS would be 91.5 (posterior SD = 1.0). Finally, Figure 1 demonstrates both the lack of information about the values of ς2u given the large interval covered and the dependence of the posterior distribution on the assumed prior (lower right hand). The inclusion of additional studies or cohorts guided by expertise from clinicians will enable more precise estimation.
Our idea combines statistical methods proposed for several distinct problems: meta-analysis17; crossdesign synthesis18; and network meta-analysis.19 Meta-analysis involves combining information across related but independent studies and typically is restricted to randomized studies. Although combining evidence from several RCTs is a powerful strategy to increase power and to study additional hypotheses, because RCTs are limited in their generalizability, we believe they should not be used as the only source to learn about real-world comparative effectiveness. Crossdesign synthesis extends meta-analysis by combining results of studies of different designs such as randomized and observational studies. This type of synthesis has been used to assess various treatment strategies, including the effectiveness of bisphosphonates in reducing osteoporotic factures,20 and is a useful tool when important information about device effectiveness arises from studies of fundamentally different designs. A network meta-analysis is method used to assess the comparative effectiveness of experimental treatment among similar patient populations that have not been compared directly in a randomized clinical trial. This methodology permits estimation of heterogeneity in the effect of any treatment and also estimation of the inconsistency of effectiveness from different pairs of treatments. The use of Bayes factors rather than P values in the surveillance setting has the advantage of directly measuring the strength of evidence for 1 treatment compared with another.
We believe that by combining these 3 methodological approaches into 1 approach, we will move the field of medical device surveillance forward significantly. First, our approach reflects the need to view evidence on a continuum rather than silos of information where a device either passes or fails at each step. Second, a model for combining all available information is clear and unambiguous. Assumptions are explicitly stated and their plausibility can be assessed. Third, empirical results from the model will make explicit where uncertainties remain. Our approach provides a unifying framework for obtaining and updating evidence. We believe that this approach will advance the regulatory science by introducing a transparent and dynamic new paradigm for knowledge management throughout the total product life cycle.
© 2010 Lippincott Williams & Wilkins, Inc.