## Introduction

A systematic review aims to systematically identify, critically appraise, and summarize all relevant studies that match predefined criteria and answer predefined questions.^{1â€“5} The most common type of systematic review is that assessing the effectiveness of an intervention or therapy.

Conducting a systematic review addressing a question about the effectiveness of an intervention or therapy is a complex research process. In this article, we discuss and provide guidance for some of the common methodological issues that arise when conducting systematic reviews and meta-analyses of effectiveness data.

## Inclusion of quasi-experimental and observational studies

Evidence about effects of interventions may come from three main categories of studies: experimental, quasi-experimental, and observational studies (the latter can be further split into analytical and descriptive studies). Ideally, evidence about the effectiveness of interventions should come from good-quality randomized controlled trials (RCTs) which explore final clinical endpoints such as morbidity, mortality, and quality of life (rather than surrogate endpoints).^{6} However, for many clinical interventions and conditions, RCTs are not available.

There are three approaches to considering which study designs to include that are equally reasonable when conducting an effectiveness systematic review. Option 1 â€“ to consider just RCTs and quasi-experimental studies; this is the option that was favored in the past by the Cochrane Collaboration. Option 2 â€“ if there are good-quality RCTs exploring the interventions, comparators, and the outcomes of interest for a systematic review of effectiveness evidence, reviewers may give priority to these RCTs over quasi-experimental or observational studies, and include in their review only the RCTs; however, if there are no (or limited) RCTs available following the search, reviewers may consider quasi-experimental studies for inclusion; if there are no (or limited) RCTs or quasi-experimental studies, reviewers may opt to include observational studies; this is the option that has historically been favored by the Joanna Briggs Institute. Option 3 â€“ if reviewers want to include all study designs (RCTs and quasi-experimental and observational studies) in their review, this inclusive approach is acceptable as it allows for examination of the totality of empirical evidence and may provide invaluable insights regarding the agreement or disagreement of the results from different study designs. In any case, the approach to be taken should be detailed in the a-priori systematic review protocol. Wherever feasible, we prefer and suggest reviewers consider option 3, the most inclusive approach.

It is important to note that the issues related to the agreement or disagreement of results from experimental and observational studies are complex. Empirical research has found that sometimes the results of RCTs contradict results from observational studies.^{6} However, meta-analyses based on observational studies can produce estimates of effect that are similar to those from meta-analyses based on RCTs.^{7}

## Inclusion or exclusion of studies with risk of bias

Evidence about the effects of interventions may come from studies with diverse risk of bias. There are two approaches which are equally reasonable when conducting an effectiveness systematic review: Option 1 â€“ include only studies with low or moderate risk of bias, and exclude all studies considered at high risk of bias. In this case, reviewers have to provide in the review protocol clear and explicit justification with regard to how the risk of bias will be ascertained, what represents low, moderate, and high risk of bias, and whether any â€˜cutoffâ€™ scores will be used. Option 2 â€“ include all studies regardless of their risk of bias and explicitly consider the risk of bias of different studies during data synthesis, presentation of the results, conclusions, and implications. Reviewers should be aware that there are statistical tools available for the incorporation of risk of bias in data synthesis and we recommend two such approaches, the quality effects model proposed by Doi *et al.*,^{8} and the bias adjustment approach proposed by Thompson *et al.*^{9} However, the issues of incorporating the risk of bias in the conduct of systematic reviews is complex.^{10} In any case, the approach to be taken should be detailed *a priori* in the systematic review protocol.

## Issues related to cross-over trials, pre-post studies, and cluster randomized trials

Some systematic reviewers may not be aware that crossover trials, pre-post studies, and cluster randomized trials have specific characteristics aligned to both their design and statistical analyses that should be carefully considered whenever these study designs are included in systematic reviews and meta-analyses. We recommend reviewers consult appropriate references regarding the design and analysis of these types of studies.^{11â€“15} When including these types of studies in a systematic review, it is essential that reviewers consider the existing statistical guidance regarding meta-analysis of these types of studies.^{16â€“22}

## Use and interpretation of effect sizes

Effect sizes refer to quantitative indicators of the direction and magnitude of the effect of the intervention on outcomes. Despite the vast amount of information regarding interpreting effect sizes, we have found that some systematic reviewers remain confused about the differences between risk (probability) and odds, and the differences between relative risk (RR) and odds ratio (OR), and use incorrect narrative descriptions for these effect sizes. These issues are discussed further and additional resources such as the *Usersâ€™ Guides to the Medical Literature* and the *Tips for Learners of Evidence-Based Medicine* are also useful guides for systematic review authors.^{23,24}

Common effect sizes reported in effectiveness systematic reviews are the OR, RR and risk difference (RD). Risk is defined as the probability that an event will occur; RR (also known as the risk ratio) is the ratio of risk in one group (e.g., intervention group) divided by the risk in another group (e.g., control group). Risk difference is defined as the difference between the risk in one group and the risk in the other group. Odds are the ratio of the probabilities of the two possible states of a binary variable, and an OR is the ratio of the odds for a binary variable in two groups of patients. If we consider the probabilities of the outcome being present and the probabilities of the outcome being absent (i.e., 1 minus the probability of the outcome present) in an intervention group and a control group then:

- The odds of the outcome being present in the intervention group is the probability of the outcome being present in the intervention group divided by 1 minus the probability of the outcome being present in the intervention group.
- The odds of the outcome being present in the control group is the probability of the outcome being present in the control group divided by 1 minus the probability of the outcome being present in the control group.
- The OR of the outcome being present is the ratio of the odds of the outcome being present in the intervention group and the odds of the outcome being present in the control group.

## Absolute and relative effect sizes for meta-analysis of binary data

Relative effect sizes such as the RR are easily misleading in the absence of appropriate contextual information regarding the risk (probability) of the outcome in absence of any intervention, and the absolute difference of the risk (probability) of the outcome between those receiving an intervention and those receiving a different intervention. The RR only indicates the risk (probability) in the intervention group compared with the control group, for example, an RR of 0.5 indicates that the risk (probability) is reduced by half in the intervention group. The RD assists to put the RR into context. For example, an RR of 0.5 may mean the risk (probability) has reduced from 80% in the control group to 40% in the experimental group corresponding to a considerable RD of 40%. Alternatively, an RR of 0.5 could mean that the risk has reduced from 0.8% in the control group to 0.4% in the experimental group, an RD of only 0.4%. Therefore, if reviewers choose to use the RR, ideally they should also report the absolute RD effect size. Reviewers should provide correct and clear interpretation of the computed effect sizes (Tables 1 and 2). It is important to note that the RR is not symmetrical, resulting in potentially very different RRs for a positive outcome compared with its negative component.^{25} We recommend that reviewers be aware of this issue as this can impact the presentation and interpretation of the results.

## Odds ratio: preferred effect size for the computation phase of meta-analysis of binary data

Often the terms â€˜oddsâ€™ and â€˜riskâ€™ are used interchangeably; however, the OR and RR are calculated in different ways and it is important to understand this when interpreting the results of meta-analysis. Fleiss^{26} discussed the statistical properties of the OR and concluded that the OR is the preferred effect size for the computation phase of the meta-analysis of binary data regardless of the study design of the studies. Also, Fleiss and Berlin^{27} recommended OR as the preferred effect size for the computation phase of the meta-analysis of binary data, a view that is agreed to by others.^{28â€“30} We agree with this position that OR should be used as the preferred effect size whenever possible for the computation phase of the meta-analysis of binary data. However, there is no universal agreement on this issue, and others prefer the RR over OR.^{31â€“33}

Reviewers should be aware that OR is not easily interpretable, and they should be mindful of providing accurate and explicit interpretation of the ORs computed in meta-analysis.

In Tables 1 and 2, we provide examples of computation and interpretation of probability (risk), odds, RR, RD, and OR (examples for RR and OR >1 and examples for RR and OR <1).

## Reporting the results in natural (clinical) units for meta-analysis of continuous data

There are different â€˜differenceâ€™ effect sizes for continuous data such as weighted mean difference (WMD), and the standardized mean difference (SMD).

The WMD is used in meta-analysis of continuous data if all studies included in the meta-analysis measured the same outcome on the same measurement instrument. For meta-analysis computation, the difference in means from each included study is used. The results are expressed in the natural (clinical) units used for the measurement instrument. For example, WMD may be used if all studies included in a meta-analysis measured blood pressure expressed in mmHg. Another example, WMD may be used if all studies measured intensity of pain on the same scale of measurement from zero to 100 units.

The SMD is used in the meta-analysis of continuous data if studies included in the analysis measured the same outcome but on different measurement instruments. The results are expressed in units of standard deviation. In order to facilitate the interpretation of the results reported in units of standard deviation, reviewers should convert the results into natural (clinical) units by multiplying the results expressed in units of standard deviation with the standard deviation of the scores on a known or most commonly used measurement instrument. Results will ideally be reported in both these formats: in units of standard deviation and also expressed in the natural (clinical) units for one measurement instrument. For example, there are diverse scales of measurement used to measure pain. Suppose that in one study, pain was measured on a scale from zero to 10; in another study pain was measured on a scale from 0 to 40; and, in yet another study, pain was measured on a scale from zero to 100. It is possible to use the SMD, to report results from all these studies in a standardized form, in units of standard deviation and later to convert the results back to clinical units by using one or each of these scales.

If WMD is used, reviewers should provide explanations regarding the interpretation of the results expressed in units used for the measurement instrument. The minimum score and the maximum score that are possible on the measurement instrument should be specified together with their interpretation. Suppose that intensity of pain was measured on a scale of measurement from zero to 100 units. In this example, the minimum score is zero, and the maximum score is 100. A score of zero may be interpreted as absence of pain; a score of 100 may be interpreted as unbearable pain.

In addition, reviewers should provide explanations regarding the interpretation of positive and negative scores. For example, sometimes, positive scores are used to express specific characteristics or degrees of these characteristics such as medication compliance, or existence of self-help skills, or positive patient behaviors, and negative scores are used for lack of medication compliance, absence of skills, or negative behaviors.

In Table 3, we provide a concise summary of the statistical properties of the effect sizes considered in this article.

## Meta-analysis: objectives of meta-analysis

Essentially, in a systematic review of effectiveness, there are two synthesis options: meta-analysis and narrative summary or synthesis. Meta-analysis refers to the statistical synthesis of quantitative results from two or more studies. Many reviewers appear to adopt a narrow approach to meta-analysis, focusing exclusively on calculating estimates of effects. However, reviewers should be aware that there are different, legitimate objectives for a meta-analysis: to improve statistical power to detect a treatment effect, provide the closest estimate of an unknown real effect, identify subsets of studies (subgroups) associated with a beneficial effect, and explore if there are differences in the size or direction of the treatment effect associated with study-specific variables. Reviewers should explicitly consider and state the objective(s) of meta-analysis for their review.

## Clinical and methodological heterogeneity

Meta-analysis is only appropriate when studies are considered similar enough from a clinical and methodological point of view (homogenous studies). If studies are heterogeneous from a clinical (i.e., population, intervention, comparator, and outcome) or methodological (i.e., study design and risk of bias) point of view, then it is uncertain if it is appropriate to synthesize the respective studies with meta-analysis. The judgment that studies are homogenous enough and that it is appropriate to combine the studies statistically should be based on the understanding of the review question, the characteristics of the studies, and the interpretability of the results. The decision should not be based just on statistical heterogeneity. Studies that are conceptually similar from a clinical point of view (but not necessarily identical) with regard to the participants, interventions, comparators, settings, outcomes, study design, and risk of bias may be combined in meta-analysis. Where there are clinically similar but methodological dissimilar studies (such as in study design or risk of bias), subgroup analyses may be useful to determine whether these differences have an impact on the overall effect size.

## Meta-analysis: statistical models (fixed-effects and random-effects models)

Fixed-effects and random-effects models are the most commonly employed statistical models for meta-analysis. In Table 4, we provide a concise summary of comparative characteristics of the fixed-effects and random-effects model. In Fig. 1, we provide a decision flow chart for the selection of the statistical model for meta-analysis. The decision to use one statistical model or another is complex and often subjective; however, there are criteria that can guide decisions about which model to use.

As the first criterion, reviewers should consider the goal of statistical inference: is there the intention to generalize the results beyond the included studies (generalization inference)? If the answer is â€˜yesâ€™, then the random-effects model is the appropriate statistical model for meta-analysis; is there the intention to apply the results only to the included studies (no generalizations)? If the answer is â€˜yesâ€™, then the fixed-effects model is the appropriate statistical model. As we assume that usually reviewers want to generalize the conclusions beyond the actual studies included in meta-analysis, we suggest that the default model for meta-analysis in reviews should be the random-effects model. However, all decision criteria should be considered and the statistical model used should be appropriate from this multicriteria perspective. A second criterion to consider directly refers to the number of studies included in the meta-analysis. The fixed-effects model is the appropriate model when the number of studies is small. Random-effects models are appropriate when the number of studies is large enough, that is, enough studies to support generalization inferences beyond the included studies. It was suggested that the fixed-effects model should be used when the number of studies included in a meta-analysis is less than five.^{34} A third criterion to consider refers to statistical heterogeneity. The fixed-effects model assumes that all studies included in a meta-analysis are estimating a single true underlying effect. If there is statistical heterogeneity among the effect sizes, then the fixed-effects model is not appropriate. The random-effects model should be considered when it cannot be assumed that true homogeneity exists.

Similarly, a fourth criterion refers to the likelihood of a common effect size. In fixed-effects models, we assume that there is one common effect. A random-effects model assumes each study estimates a different underlying true effect, and these effects have a distribution (usually a normal distribution). Fixed-effects model should be used only if it reasonable to assume that all studies shares the same, one common effect. If it is not reasonable to assume that there is one common effect size, then the random-effects model should be used. If the studies are heterogeneous from a clinical and methodological point of view, it is unreasonable to assume that they share a common effect. Another criterion refers to the heterogeneity of sample sizes of included studies. The fixed-effects model is preferable when one study is much larger (and usually it is presumed that it is more trustworthy) than one or more smaller studies.^{34}

The use of the fixed-effects model and random-effects model presented here are based on a careful examination of the international literature.^{34â€“52} It represents an accurate reflection of a classical or traditional view of the two meta-analysis models. It is worth acknowledging, however, that this traditional approach to meta-analysis has been critiqued by statisticians who suggested that this approach to meta-analysis is flawed and should be replaced.^{53â€“65}

The traditional approach to meta-analysis as described in this article should be viewed as an acceptable simplification for novice reviewers without sophisticated statistical skills. The complexity of study designs, analysis approaches, and considerations related to risk of bias and the influence of moderator and mediators encountered in real statistical practice may require the use of more complex models for meta-analysis, some which include mixed-effects models, hierarchical models, and factorial models.^{37,42,44,45,47,49,66â€“70} These approaches are complex and require sophisticated statistical skills. However, some of the newer approaches, including the inverse variance heterogeneity model and the quality-effects model proposed by Doi *et al.*^{55,56} may be more accessible to novice reviewers, and should be used if possible.

## Statistical significance, practical significance, and clinical significance

Many review authors exclusively focus on statistical significance of the results. We recommend that the significance of the results should be considered from three different perspectives: statistical significance, practical significance, and clinical significance. Different authors use the terms practical significance and clinical significance with different meanings; our use of the terms is summarized in Table 5. Details with regard to these types of significance and a summary of international literature regarding guidance for interpretation of results (what is considered a significant OR, RR, RD, etc., from a practical point of view) are provided by Tufanaru *et al.*^{71}

## Conclusion

Conducting a systematic review of effectiveness can be a difficult undertaking for the reviewer. The commentaries presented within this article are aimed to assist reviewers and it is hoped that it may improve the quality of the systematic review (meta-analysis) process, and presentation and interpretation of the results.

## Acknowledgements

The authors report no conflicts of interest.

## References

*Am J Nurs*2014; 114:47â€“55.

*Am J Nurs.*2014; 114:53â€“56.

*Am J Nurs*2014; 114:49â€“56.

*Am J Nurs*2014; 114:47â€“52.

*Am J Nurs*2014; 114:49â€“54.

*Am J Epidemiol*2007; 166:1203â€“1209.

*Contemp Clin Trials*2015; May 21. pii: S1551-7144(15)30008-2. doi: 10.1016/j.cct.2015.05.010. [Epub ahead of print].

*Int J Epidemiol*2011; 40:765â€“777.

*J Epidemiol Community Health*2015; 69:189â€“195.

*Int J Epidemiol*2002; 31:140â€“149.

*Psychol Methods*2002; 7:105â€“125.

*Stat Med*2002; 21:2131â€“2144.

*Stat Med*2002; 21:2145â€“2159.

*Stat Med*2002; 21:2161â€“2173.

*Stat Med*2002; 21:2971â€“2980.

*Stat Methods Med Res*2001; 10:325â€“338.

*CMAJ*2004; 171:353â€“358.

*J Clin Epidemiol*2014; 67:364â€“367.

*Evidence-Based Med*1998; 3:71.

*Evidence-Based Med*1998; 3:71.

*Evidence-Based Med*1998; 3: 71.

*Evidence-Based Med*1998; 3:71â€“72.

*Arch Pediatr Adolesc Med*2009; 163:438â€“445.

*Stata J*2009; 9:175â€“196.

*Stat Med*1999; 18:321â€“359.

*J Educ Behav Stat*1992; 17:279â€“296.

*Res Synth Methods*2010; 1:97â€“111.

*Value Health*2014; 17:629â€“633.

*Stat Med*2015; 34:2116â€“2127.

*Contemp Clin Trials*2015; May 21. pii: S1551-7144(15)30007-0. doi: 10.1016/j.cct.2015.05.009. [Epub ahead of print].

*Contemp Clin Trials*2015; May 21. pii: S1551-7144(15)30008-2. doi: 10.1016/j.cct.2015.05.010. [Epub ahead of print].

*J Eval Clin Pract*2013; 19:653â€“657.

*Contemp Clin Trials*2011; 32:288â€“298.

*J Clin Epidemiol*2011; 64:119â€“123.

*Epidemiology*2008; 19:94â€“100.

*Stat Med*2007; 26:1417â€“1430.

*Ann Intern Med*2014; 160:267â€“270.

*Stat Med*2011; 30:3304â€“3312.

*Stat Med*2007; 26:4531â€“4543.

*Stat Med*2001; 20:825â€“840.

*Stat Med*1999; 18:643â€“654.

*Stat Med*1999; 18:2343â€“2351.

*Stat Med*2000; 19:3417â€“3432.

*Behav Res Methods Instrum Comput*2001; 33:102â€“107.

*Biometrics*1996; 52:536â€“544.

**Keywords:**

effect size; effectiveness; fixed effects; meta-analysis; random effects; systematic review