We introduce a quality-effects approach that combines evidence from a series of trials comparing 2 interventions. This approach incorporates the heterogeneity of effects in the analysis of the overall interventional efficacy. However, unlike the random-effects model based on observed between-trial heterogeneity, we suggest adjustment based on measured methodological heterogeneity between studies. We propose a simple noniterative procedure for computing the combined effect size under this model and suggest that this could represent a more convincing alternative to the random effects model.

# A Quality-Effects Model for Meta-Analysis

- Free
- SDC

## Abstract

**ArticlePlus**

Click on the links below to access all the ArticlePlus for this article.

Please note that ArticlePlus files may launch a viewer application outside of your web browser.

Meta-analysis is used in a wide range of disciplines—notably in epidemiology and evidence-based medicine, where results of some meta-analyses have led to changes in clinical practice and health care policies. Meta-analyses use statistical methods to combine the results of several studies that address similar research hypotheses. The basic premise is that the combined results from a group of studies can allow a more precise and balanced estimate of an effect than the individual studies.

One problem with meta-analyses of clinical trials is that differences among trials are not addressed appropriately by current meta-analysis models.^{1} There are several reasons for these interstudy differences, including chance, different definitions of treatment effects, design-related heterogeneity (quality), and finally unexplainable and real differences, all of which may introduce bias.^{2} The most important of the explainable differences is quality, which refers to the likelihood that the trial design has generated unbiased results with sufficient precision to allow clinical application.^{3} The quality of individual studies will also affect the quality of the combined estimates, as well as the magnitude of the results. If the quality of the primary material is inadequate, the conclusions of the review may be invalid regardless of the use of a random-effects model. Such inadequacies may occur in the randomization process, in the masking to the allocated treatment, in the random generation of number sequences, or in the analysis. There is therefore a need to assess the quality of studies in a more explicit way than simply to insert a random term based on heterogeneity,^{4} as is done with the random-effects model.

One way of dealing with this problem is to include only the most methodologically sound studies in the meta-analysis, a practice often termed “best-evidence meta-analysis.”^{5} Another approach is to include weaker studies, and add a study-level predictor variable that reflects the methodologic quality of the studies. This allows an assessment of the effect of study quality on the effect size.^{6} Yet another possibility is to incorporate the quality scores as weights.^{7} However the latter has been done intuitively, without a clear methodologic basis for optimizing the estimate. Our objective here is to present a new method of weighting studies in a meta-analysis using a quality score (akin in some ways to weighting using the homogeneity statistic). We compare the results obtained from applying the old and new methods in a summary of clinical trials that were included in a systematic review of the efficacy of radioactive iodine for the ablation of thyroid remnants after surgery for thyroid cancer.^{8}

## METHODOLOGY

Because the results from different studies investigating different independent variables are measured on different scales, the dependent variable in a meta-analysis is some standardized measure of effect size. When the outcome of the experiments is dichotomous (success vs. failure), one of the commonly used effect measures in clinical trials is a relative risk (RR). The approach frequently used is the “inverse-variance method” based on Woolf.^{9} The average effect size across all studies is computed, with the weights equal to the inverse variance of each study's effect estimator. Larger studies and studies with less random variation are given greater weight than smaller studies. In the case of studies reporting a RR, the log RR has a standard error (se) given by

where P_{iT} and P_{iC} are the risks of the outcome in the treatment group and control groups, respectively, of the *i*th study and n is the number of patients in the respective groups. The weights (*w*) allocated to each of the studies are inversely proportional to the square of the standard error; thus,

which gives greater weight to those studies with smaller standard errors. The combined effect size is computed by the weighted average as

where

is the overall effect size measure; it has a standard error given by

Assuming these estimates are distributed normally, the 95% confidence intervals (CIs) are easily obtained as

As can be seen above, the current approach for combining effects is to incorporate variability in studies by adjustment based on the variance of the estimates in each individual study. Thus, the lower the variance of a RR estimate the greater its weight in the final combined estimate. This approach, although statistically very appealing, does not take into account the further variability that exists among studies arising from differences in the study protocols and how well they were executed. This limitation gave rise to the random-effects model approach.^{10} However, because of the limitations of the random-effects model,^{10} a statistical adjustment for heterogeneity will still produce invalid estimates when used in a meta-analysis of badly designed studies. Furthermore, adjustments based on an artificially inflated variance lead to a widened CI, supposedly to reflect heterogeneity. This added imprecision does not have much clinical relevance.^{4}

We propose a new approach to adjustment for interstudy variability by incorporating a relevant component (quality) that differs among studies, in addition to the weight based on the intrastudy variance used in fixed-effects meta-analysis. We do this by introducing a correction, called

for the quality-adjusted weight of the *i*th study. This is a composite based on the quality of studies other than the study under consideration, and redistributes quality-adjusted weights based on the quality-adjusted weights of other studies. For example, if study *i* is of good quality and other studies are of poor quality, a proportion of quality-adjusted weights from the other studies is mathematically redistributed to study *i*, giving it more weight in the overall effect size. As studies increase in quality, redistribution becomes progressively less and ceases when all studies are of perfect quality. To accomplish this, we first have to adjust weights for quality. One way to incorporate quality scores into such an analysis is as follows^{11–14}:

where

is the judgment of the probability (0–1) that study *i* is credible, based on the study methodology. The variance of this weighted average is then^{11}:

However, this probabilistic viewpoint on quality-adjusted weights has limitations, and we expand on this system of incorporating quality by adjusting the weight as well as redistributing weights based on quality. This is done as follows. Given that

is our quality adjustor for the *i*th study and *N* is the number of studies in the analysis, then the quality effects modified

is given by:

where

and

The final summary estimate is then given by:

while the variance of this weighted average is then

Although it may seem that

is a function of *w _{i}* given that

it would mean that by multiplying

with *w _{i}* we are actually adjusting the product of quality and weight by

By our definition in the text, the latter is a function of the quality and weights of other studies excluding this *i*th study.

Our suggested adjustment has a parallel to the random effects model, where a constant is generated from the homogeneity statistic

Using this and other study parameters, a constant (

) is generated as

The inverse of the sampling variance, plus this constant representing the variability across the population effects, is then used as the weight

In effect, as

gets bigger, the *se _{ES}* increases, thus widening the CI; the weights, however, become progressively more equal. In essence, this is the basis for the random-effects model—a form of redistribution of the weights so that outlier studies do not unduly influence the pooled effect size. This is what our method does, as well—the differences being that we use a method based on quality rather than statistical heterogeneity, and

is not as artificially inflated as in the random-effects model. The random-effects model adds a single constant to the weights of all studies in the meta-analysis based on the statistical heterogeneity of the trials. Our method redistributes the quality adjusted weights of each trial based on the measured quality of the other trials in the meta-analysis. Because

the addition of an external constant will inflate the variance much more than a redistribution of the weights (assuming the studies demonstrate varying effects). Obviously, if a random variable is inserted to inflate the variance based on heterogeneity, it is not clear what aspect of between-trial differences is being assessed. Senn^{4} has provided an analytic demonstration of this.

The computations for the quality effects model were placed into an Excel spreadsheet where the user can substitute trial and quality information to automatically generate a pooled effect size under this model. The spreadsheet is available with the online version of this paper.

## APPLICATION

We have previously published a meta-analysis of radioactive iodine dosage for the ablation of thyroid remnants.^{8} We included 22 studies in this meta-analysis; 6 were randomized controlled trials (RCTs) with mixed surgical status (group 2),^{15–20} 4 were cohorts with near-total thyroidectomy (group 1),^{21–24} and 12 were cohorts with mixed surgical status (group 3).^{23,25–35} We did a quality assessment based on our suggested method in Table 1, and study characteristics are given in Table 2. We calculated the fixed-effects RR for all 22 studies as 0.79 (95% CI = 0.72–0.88) while the random-effects RR for the 22 studies was 0.73 (0.62–0.85). After applying the quality effects model, however, we got a pooled RR for all 22 studies (Fig. 1) of 0.69 (0.58–0.82). The weights for each meta-analysis method are shown in Table 3.

The difference between the random-effects and quality-effects weights is that the latter includes assessment of differences in study quality. Within the mixed surgical status cohort (Table 3), the study by Angelini et al^{26} accounted for 65% weight using the fixed-effects model and also the random-effects model, as there were no large discordances in results among this group of studies. The quality-effects weight for the Angelini study was however, down to 18%, reflecting the fact that it was a 10-year-old abstract for which the whole study was never published. The new weights, depicted in Table 3, suggest that the quality-effects model works in a fashion similar to the random-effects model, except that it includes a nonrandom adjustment of weights driven by a numerical assessment of methodologic quality, which has greater clinical relevance. Another benefit of this model is that it avoids the artificial inflation in variance seen with the random-effects model. With a perfect quality score for all studies, the quality-effects model defaults to the fixed-effects model, just as the random-effects does for homogenous trials. If we look at the subgroup with large heterogeneity (RCTs), the CI for the random model pooled-effect size is substantially wider than for the quality-effects model. However weights are redistributed approximately equally in both models. Again, despite the redistribution, the abstract by Sirisalipoch et al^{20} gets the lowest weight in the quality-effects model, suggesting that quality differences are being used rather than a random statistical factor. Indeed Hackshaw et al^{36} have also reported a meta-analysis of the association we examined here. Those authors concluded that it was not possible to draw a conclusion from the combined studies for several reasons: the trials were heterogenous and therefore could not be combined into 1 stratum; studies with different designs were also not combined; results were based on subgroup analyses. The random-effects model would have been unable to provide an answer due to the problem of inflated variance. However, the quality-effects model suggests benefit for a higher dose even in the most heterogenous subgroup (the 6 RCTs), in keeping with the trend with the rest of the studies (Table 3).

## DISCUSSION

In this report we suggest a method whereby the between-study variability is adjusted based on an assessment of the varying quality of the studies rather than depending on the distribution of the effect estimates from different studies to adjust for heterogeneity. The choice of model includes a judgment of how much weight should be accorded to a trial by virtue of its numbers and effect size alone. In a heterogeneous set of studies, a random-effects meta-analysis will award relatively more weight to smaller studies than such studies would receive in a fixed-effect meta-analysis. However, if results of smaller studies are systematically different from results of larger ones, which can happen as a result of publication bias or low study-quality bias,^{37,38} then a random-effects meta-analysis will exacerbate the effects of the bias. A fixed-effect analysis will be affected less, although it would be inappropriate because it gives a higher weight to a very large trial simply because of its higher precision. However, sample size by itself does not make the estimate more valid or more generalizable. Increased size may be accompanied by simplification of recruitment and data collection in a way that increases the risks of protocol deviation, poor data quality, misclassification, and nontrial use of trial treatments—all of which tend to create a bias towards the null. We demonstrated an alternative approach with a real life application in which neither a fixed-effects nor a random-effects model of classic meta-analysis adequately described the results.

It has been reported that there is no correlation between quality scores and variation in treatment difference in RCTs.^{39} However, the quality effects model does not relate the quality score directly to the effect size; such correction is possible only if we know the effects of study imperfections on the outcome measures. Unfortunately, for many biases the precise effects will not be known and hence cannot be corrected for.^{40} What can be done (and what is done in our quality effects model) is to redistribute study weights by quality, such that the effect sizes more likely to be accurate get a relatively greater weight redistribution when weighting by precision.

The decision to use a random-effects model is often based on a simple test of homogeneity^{41} of the studies involved. It has even been suggested that a random-effects model should be routinely adopted because of the demonstration that, in the presence of even slight between-study heterogeneity, the fixed-effects model results in inferences that substantially underestimate the variation in the data and in the parameter estimates.^{42} Nevertheless, despite this widespread perception, it is now understood that the choice of fixed-effects or random-effects meta-analysis should not be made on the basis of perceived heterogeneity but on the basis of purpose.^{4} It is always valuable to perform a fixed-effects meta-analysis because this tests the null hypothesis that treatments were identical in all trials.^{4} If this null is rejected, then the alternative hypothesis may be asserted, ie, that there is at least 1 trial in which the treatments differed. In other words, the random-effects analysis works as a check on the robustness of conclusions from a fixed-effects model to failure in the assumption of homogeneity^{43}; to go beyond this causal “finding” requires strong assumptions.^{4} If a random variable is inserted to inflate the variance based on heterogeneity, it is not clear what aspect of between-trial differences is being assessed. This approach fails to take into account quality differences among the individual studies. The strength of our quality-effects meta-analysis is that it allows available methodologic evidence to influence subjective random probability.

For a quality-effects meta-analysis, a reproducible and effective scheme of quality assessment is required. The scheme we used in our illustration was developed in part, by the Delphi method,^{44} in which 206 items associated with study quality were reduced to 9 by means of the Delphi consensus technique. The final set of items assesses 3 dimensions of the quality of studies (internal validity, external validity, and statistical analysis), and focuses on clinical trials. Compared with assessment of randomized clinical trials, the tolls for quality assessment of observational designs in systematic reviews are far less well developed.^{45} The feasibility of creating 1 quality checklist to apply to various study designs has been explored.^{46} Research has gone into developing an instrument to measure the methodologic quality of observational studies,^{47} and a scale to assess the quality of observational studies in meta-analyses.^{48} Nevertheless, there is no consensus on how to synthesize information about quality from a range of study designs within a systematic review.

This paper focuses on a statistical model for incorporation of quality information into a meta-analysis, rather than development of a quality score. There are many quality scores available; Moher et al^{49} reported at least 25 quality assessment scores by 1995, and more have been proposed since. In theory, any quality score can be used with our quality-effects meta-analytic approach. Our statistical method makes use of *Qi* and any score can be converted to *Q _{i}* simply by dividing the measured score by its maximum score. The scheme we suggest in Table 1 is based on a combination of the Newcastle-Ottawa quality assessment scale for observational studies and the Delphi model for experimental studies. Adoption of the quality effects meta-analysis should encourage further development of such scoring systems.

## ACKNOWLEDGMENTS

We thank Leon Bax of Kitasato University for agreeing to make this method available in the next update of the MIX program, which is a comprehensive free software for meta-analysis of causal research data available from the web at http://www.mix-for-meta-analysis.info.

## REFERENCES

*BMJ*. 1994;309:789–792.

*Stat Med*. 1987;6:351–360.

*J Clin Epidemiol*. 2001;54:651–654.

*Stat Med*. 2007;26:1417–1430.

*J Eval Clin Pract*. 1995;1:29–36.

*J Clin Epidemiol*. 1992;45:255–265.

*J Clin Epidemiol*. 1998;51:801–807.

*Clin Med Res*. 2007;5:87–90.

*Ann Hum Genet*. 1955;19:251–253.

*Control Clin Trials*. 1986;7:177–188.

*Stat Med*. 1999;18:2135–2145.

*Cancer*. 1986;58:1378–1386.

*J Clin Epidemiol*. 1991;44:127–139.

*Stat Med*. 1995;14:531–544.

*Eur J Nucl Med*. 1987;12:500–502.

*J Nucl Med*. 1991;32:252–254.

*Cancer*. 1996;77:2574–2580.

*Wiad Lek*. 2001;54(suppl 1):278–288.

*J Clin Endocrinol Metab*. 2004;89:1666–1673.

*World J Nuclear Med*. 2004;3:S36.

*Nucl Med Commun*. 2004;25:1077–1081.

*Clin Nucl Med*. 1985;10:791–795.

*Clin Endocrinol (Oxf)*. 2000;52:765–773.

*Eur J Nucl Med Mol Imaging*. 2004;31:499–506.

*Radiother Oncol*. 1998;47:325–330.

*Report of the 24th Annual Meeting of the European Thyroid Association*. Darmstadt, Germany: Merck KgaA, No. 6, Thyroid International; 1997.

*Int J Radiat Oncol Biol Phys*. 2004;59:1330–1336.

*J Nucl Med*. 1992;33:1132–1136.

*Br J Radiol*. 1994;67:1127–1131.

*Taiwan Yi Xue Hui Za Zhi*. 1987;86:524–528.

*Br J Radiol*. 1998;71:307–313.

*J Nucl Med*. 1982;23:483–489.

*Am J Med*. 1976;61:52–58.

*Am J Clin Oncol*. 1998;21:77–81.

*Ann Intern Med*. 1982;96:51–53.

*J Clin Endocrinol Metab*. 2007;92:28–38.

*Am J Epidemiol*. 1999;150:469–475.

*Ann Intern Med*. 2001;135:982–989.

*Control Clin Trials*. 1990;11:339–352.

*Stat Med*. 2002;21:1503–1511.

*J R Stat Soc*. 1937;4:102–118.

*Stat Med*. 2001;20:825–840.

*Stat Med*. 1998;17:841–856.

*J Clin Epidemiol*. 1998;51:1235–1241.

*Health Technol Assess*. 2003;7:iii-x,1-173.

*J Epidemiol Community Health*. 1998;52:377–384.

*ANZ J Surg*. 2003;73:712–716.

*Control Clin Trials.*1995;16:62–73.