Cervical spondylotic myelopathy (CSM) is a complex, multifaceted disorder that is characterized by progressive compression of the spinal cord due to extradural spinal degenerative pathology. Many aspects of the pathophysiologic mechanisms, diagnosis, progression, and treatment of CSM are not well understood, and the overall quality of evidence for some questions is not well described on the basis of formal systematic review methods. Given this critical knowledge gap, the AOSpine North America Clinical Research Network put together an expert, multidisciplinary panel to evaluate the literature critically, undertake some preliminary analyses related to the impact of spinal deformity on CSM, and to undertake an international survey of clinical perspectives on the management of mild CSM. The following summarizes the methods to address the aforementioned objectives.
TOPICS, CLINICAL QUESTIONS, AND ARTICLE TYPES
Topics were developed and agreed upon by an expert panel. After discussion, the ones considered the most clinically relevant in the field of CSM were translated into key clinical questions. These clinical questions were assigned to expert author groups to refine as necessary and to specify the inclusion and exclusion criteria that would best address the clinical questions considering both clinical relevancy and the potential highest strength of evidence. Once the key clinical questions were finalized and the final inclusion/exclusion criteria were agreed upon, systematic literature searches were conducted.
A number of different article types and corresponding methods were needed to address the clinical questions and objectives proposed by the focus issue editors and clinical experts. This focus issue includes 15 unique articles discussing important issues surrounding the diagnosis, treatment, prognosis, and outcome measurement of CSM. These areas were addressed using several different study designs including a research protocol (n = 1), primary research studies (n = 2), narrative literature reviews (n = 2), systematic reviews (n = 7), and articles with a combination of a systematic review component with either a narrative section (n = 2) or provider survey (n = 1). In general, the strength of evidence ratings ranged from insufficient to moderate. Summary statements or clinical recommendations were made according to available evidence and study type: 16 summary statements were made across 8 articles, and 17 clinical recommendations were made across 9 articles. Three articles had both summary statements and clinical recommendations, 5 had summary statements only, 6 had clinical recommendations only, and 1 (the research protocol) was not amenable to either.
This article describes the primary methods used to address the key questions and objectives for these articles. Readers should refer to the individual articles for specific methodological details.
METHODS FOR PRIMARY RESEARCH STUDIES
Two primary research studies are included in this focus issue. Smith et al1 examined the association between cervical alignment, sagittal balance, and myelopathy, which has not been well characterized to date. New methods to determine in vivo cervical spinal cord length, surface area, and volume in relation to vertebral column alignment were developed using a custom version of Surgimap Spine (Nemaris Inc., New York, NY) and a dedicated Matlab (MathWorks Inc., Natick, MA) function. These methods were then applied to the preoperative images of patients from the AOSpine North America myelopathy study and resulting spinal cord dimensions were correlated to preoperative radiographical alignment, myelopathy severity, and disability scores. See the article by Smith et al1 for further details of the methodology. As part of a narrative review of cervical spinal alignment, assessment techniques, and the importance of myelopathy, Ames et al2 reported results from a new study that assessed spinopelvic correlations from the cervical spine to the pelvis in order to determine the regional postural adaptations needed to maintain optimal sagittal position, and maintenance of horizontal gaze was also reported within the narrative review. Fifty-five asymptomatic volunteers (27 men, 28 women; mean age, 45 yr) were evaluated by full-length standing radiographs using a standardized protocol. All radiographs were analyzed using validated image analysis software for C2–C7 cervical lordosis, T4–T12 thoracic kyphosis, L1–S1 lumbar lordosis, pelvic tilt, sagittal vertical axis, pelvic incidence, and pelvic incidence–lumbar lordosis mismatch. Statistical analysis was performed for the study population and after stratifying the population by age (20–39, 40–59, and ≥60 yr). See the article from Ames et al2 for further details.
METHODS FOR CLINICIAN SURVEY
In addition to a systematic review portion, 1 article included the results from an international survey. An English language 18-question survey was designed to address clinical and/or radiographical features that may influence treatment decisions in patients with asymptomatic cervical spinal cord compression/canal stenosis. The questions were sent to the authors of this focus issue for testing and modifications were made accordingly. AOSpine International members were invited via e-mail to participate in an online survey, which was available electronically on SurveyMonkey (https://www.surveymonkey.com) for 21 days, with 2 reminder e-mails sent 10 days apart. All surveys were completed anonymously. Descriptive statistics were used to summarize the data.
METHODS FOR NARRATIVE REVIEWS
Narrative reviews, often called literature reviews, are summaries of selected literature that describe and discuss the current science on a particular topic or theme from a contextual or theoretical perspective. They cover a wider range of issues for a given topic than is possible in a systematic review. These articles were guided by general questions or objectives that were broad and qualitative in nature. Unlike systematic reviews, they do not describe or follow a specified methodology for searching literature sources or have set inclusion/exclusion criteria that would permit replication. Nonetheless, limited searches were conducted to assemble appropriate literature for the individual topics. Narrative reviews do not lead to evidence-based clinical recommendations, so none were made. These reviews did not undergo rigorous individual article risk of bias evaluation or overall strength of evidence assessment as was done for systematic reviews.
METHODS FOR SYSTEMATIC REVIEWS
Literature Search and Article Selection
Articles for inclusion were identified in 3 stages (Figure 1). In the first 2 stages, a systematic search of the PubMed/MEDLINE and Cochrane Central Register of Controlled Trials electronic databases was conducted to identify studies of CSM on the basis of the clinical questions identified in each systematic review. The search period ended September, October, or November 2012, depending on the systematic review. Searches were conducted using standard MeSH terms (controlled vocabulary) as well as specific free-text terms and combinations of terms related to the clinical conditions. Bibliographies of key articles were hand searched to ensure that each topic was comprehensively examined. All possible relevant articles were screened using titles and abstracts by a single individual. Articles that did not meet a set of a priori retrieval criteria for each topic area were excluded. In the third stage, the full-text articles of the remaining citations were retrieved and individually examined independently by 2 individuals. Again, articles that did not meet a set of a priori inclusion criteria for each topic area were excluded. Any disagreement between screeners was resolved through discussion. The articles selected form the evidence base for the individual reports.
Most systematic reviews attempted to include articles on adults with CSM and, depending on the article focus, may have also included persons with ossification of the posterior longitudinal ligament. Inclusion of an article depended on whether the study question was descriptive, one of therapy/treatment (whether one intervention or technology is superior to another), or one of prognosis that identified risk of CSM or prognostic factors for the development of CSM (Figure 2A, B). In general, the focus for article inclusion and synthesis was on studies with the least risk of bias with a focus on comparative studies. For systematic reviews evaluating the treatment of CSM, we sought to include randomized trials or comparative cohort studies but included case series if no comparative studies evaluating the treatment of CSM were identified. For questions on the comparative safety of an intervention or technology, randomized controlled trials or comparative cohort studies were sought and included. Comparative cohort studies were defined as those clinical studies comparing the treatment or technology of interest with another treatment or technology in the same underlying patient population. For systematic reviews evaluating risks and risk factors for the development of or treatment outcomes for CSM, we sought to include articles from the highest quality observational studies (e.g., cohort or case control studies) that included raw data where we could calculate effect estimates or results from multivariate analyses so that we could report adjusted effect estimates for individual risk factors.
In general, systematic reviews excluded articles including subjects with pathologies other than CSM (e.g., tumors, infection). Other exclusions included reviews, editorials, case reports, articles not written in English, cadaver studies, animal studies, and laboratory simulations. Individual articles may have had other specific exclusion criteria that are specified in their methods.
Investigators extracted the following data from included clinical articles into evidence tables: study design, study population characteristics, study interventions for therapeutic studies, prognostic factors for studies of prognosis, follow-up time, follow-up rates, and rates or prevalence of CSM, depending on the focus of the article. An attempt was made to reconcile conflicting information among multiple articles presenting the same data.
The analysis methods for each individual systematic review are distinct and, therefore, not described here. A detailed explanation of the analysis methods from descriptive statistics to meta-analysis is described in the individual systematic review methods sections.
Rating the Evidence of Each Article
For reviews that were more descriptive in nature, the class of evidence of individual articles may or may not have been rated depending on the purpose. For formal systematic reviews, articles selected for inclusion were given a class-of-evidence rating. The method used for assessing the quality of evidence of individual articles incorporates aspects of the rating scheme developed by the Oxford Centre for Evidence-Based Medicine3 and used with modification by The Journal of Bone & Joint Surgery, American Volume4 and additional modifications to delineate criteria associated with methodological quality and risk of bias on the basis of recommendations made by the Agency for Healthcare Research and Quality (AHRQ).5,6 The appraisal system used in this focus issue accounts for features of methodological quality and important sources of bias by combining epidemiologic principles with characteristics of study design to determine the class of evidence.
Class-of-evidence ratings were assigned to each included article independently by 2 reviewers on the basis of preset criteria that resulted in an evidence quality rating (class of evidence I, II, III, or IV). Disagreements were resolved through discussion. Criteria used to assess an article's evidence varied according to the study type (therapy or prognosis) (Tables 1 and 2). Treatment (therapeutic) studies were assessed on the basis of the study design, how patients were selected for treatment (concealed allocation vs. nonconcealed), if the analysis included intention to treat, the presence of blinding during outcome assessment, whether cointerventions were applied equally, the completeness of follow-up, whether the study sample size was large enough to demonstrate statistical significance, and whether the investigators made an attempt at controlling potential confounding. Prognostic studies were classified by study design: a cohort study, a case-control study, or a cross-sectional study. Cohort studies were assessed on the basis of whether patients were at a similar point in the course of their disease or treatment, the completeness of follow-up, if patients were followed long enough for the outcome of interest to occur, and whether the investigators made an attempt at controlling potential confounding. Assessment of case-control studies included whether the incident cases were from a defined population and included all eligible cases during a specified time period, if controls represented the population from which the cases arose, if exposure preceded the outcome of interest, and whether the investigators accounted for other prognostic factors. For cross-sectional studies, we sought to determine whether the study population was a representative sample of the population of interest, whether the exposure preceded the outcome of interest, if the investigators accounted for other prognostic factors and for surveys, and whether there was a return rate of at least 80%. Whether or not criteria were met is displayed for each author as illustrated in Table 3 for therapeutic studies and in Table 4 for prognostic studies. All criteria met are marked. A blank for the criterion indicates that the criterion was not met, could not be determined, or was not reported by the author.
One systematic review in this issue7 evaluated the genetics and heritability of CSM and ossification of the posterior longitudinal ligament using gene association studies. To assess the risk of bias in studies included in that systematic review, we evaluated each article against criteria outlined in a 2009 article by Attia et al8 for genetic association studies. Studies were appraised by the following 5 standards. (1) The disease phenotype should be properly defined and accurately recorded by one blind to the genetic information. (2) Authors must have included similar disease and control groups especially in the categories of age, sex, weight, ancestry, or metabolic conditions. (3) The study should provide sufficient information to evaluate measurement accuracy and potential measurement bias. (4) The authors must have stated that they observed Hardy-Weinberg equilibrium in the population. (5) Statistical correction for multiple comparisons, if 3 or more single-nucleotide polymorphisms or haplotypes were examined, should be performed. See the article and its Supplemental Digital Content material for individual article critical appraisal.
Overall Strength of Evidence
After individual article evaluation, the strength of the overall body of evidence with respect to each outcome was determined on the basis of precepts outlined by the Grading of Recommendation Assessment, Development and Evaluation (GRADE) Working Group9,10 and recommendations made by the AHRQ.5,6 For all analytical systematic reviews (some descriptive reviews did not require this rigor), qualitative analysis is performed according to the AHRQ-required and additional domains (Table 5).11 In general, risk of bias was evaluated during the individual article evaluation described previously in the section “Rating the Evidence of Each Article.” The initial or “baseline” strength of the overall body of evidence was considered high if the majority of the studies were class I or II and low if the majority of the studies were class III or IV. Criteria for downgrading published evidence 1 or 2 levels included the following:
- Inconsistency of results, which refers to the amount of similarity in the effect sizes of different studies within an evidence base. If effect sizes indicate the same direction of effect and if the range of effect sizes is narrow, an evidence base was judged to be consistent. Single article evidence bases were judged “consistency unknown (single study)” and downgraded.
- Indirectness of evidence, concerns whether the evidence being assessed reflects a single, direct link between the interventions of interest and the ultimate health outcome, that is, a determination of whether the most clinically relevant outcome was measured or whether a surrogate outcome was assessed. Directness also applied to indirect comparisons of treatment when head-to-head comparisons of interest could not be made within individual studies. If proxy respondents are used for surveys/outcomes instruments/interviews, these data are considered indirect.
- Imprecision of the effect estimates pertains to the degree of certainty surrounding an estimate of effect for a specific outcome. Determination is based in part on whether the estimate of effect reached statistical significance and/or confidence interval width. When there were only 2 subgroups, the overlap of the confidence intervals of the summary estimates of the 2 groups was considered. No overlap of the confidence intervals indicates statistical significance, but the confidence intervals can overlap to a small degree and the difference still is statistically significant. Sufficient sample size and rarity of the outcome number of events are considered. The extent to which a clinically useful conclusion is possible was also be considered.
In addition, for some studies, an additional downgrade for “risk of bias” was applied to trials that had multiple limitations in aspects of study design and execution that would call the internal validity of the trial into question.
Alternatively, the body of evidence could be upgraded 1 or 2 levels on the basis of the following factors: (1) large magnitude of effect or (2) dose-response gradient.
The final overall strength of the body of literature expresses our confidence that the effect size lies close to the true effect and the extent to which it is thought to be stable on the basis of the adequacy or deficiencies in the body of evidence. An overall strength of “high” means that we are very confident that the true effect lies close to that of the estimated effect. A “moderate” rating means that we are moderately confident in the effect estimate; the true effect is likely to be close to the estimated effect, but there is a possibility that it is substantially different. An overall strength of “low” means that our confidence in the effect estimate is limited: The true effect may be substantially different from the estimate. Finally, a rating of “insufficient” means that we have very little confidence in the effect estimate: the true effect is likely to be substantially different than the estimated effect. In addition, this rating may be used if there is no evidence or it is not possible to estimate an effect.
GRADE does not have a system for rating overall strength of evidence across studies of genetics and heritability. For this study,7 a modified GRADE was used to determine overall strength of evidence on the basis of the scores from following 3 domains: (1) the amount of evidence, (2) replication, and (3) protection from bias as suggested by The HuGENet Working Group in the Venice Interim Guidelines,12 which assigns an ordered score (A, B, or C) to each domain, with A representing the highest credibility and C representing the lowest credibility of that domain. Details on how the evidence was graded for each domain are summarized in Table 6.
The scores from each domain in this system are combined and an algorithm applied to produce an overall strength of evidence rating of strong, moderate, or weak. A “strong” evidence rating is given if all 3 domains receive an A score (i.e., AAA). An evidence rating of “moderate” is assigned if the ratings are a combination of A and B scores (e.g., ABA, BBA). A rating of “weak” is designated for the evidence if any of the domains receives a C score (e.g., AAC, CBB). To provide consistency among the key questions for this study and among the other publications for this focus issue, we modified the HuGENet guidelines by substituting the GRADE terminology (“high” instead of “strong,” “low” instead of “weak”) and added an “insufficient” evidence category if each of the 3 domains received a score of C (CCC) or if there is only a single study without replication and the sample size is less than 1000 (CCA, CCB, BCA, BCB, BCC) (Table 7). The interpretation of the overall strength of evidence was the same as described previously.
Actionable clinical recommendations were made using the GRADE system, which separates strength of evidence from recommendation. After evaluation of the quality of individual studies, the overall strength of the body of evidence across studies was determined and classified as high, moderate, low, or insufficient, which provides an estimate of the confidence we have in the effect estimates as described previously. After the strength of evidence was evaluated, an actionable recommendation was made and the strength of that recommendation was classified as either “strong” or “weak.” Factors that influence the strength of recommendation include the overall quality of the body of evidence, the balance between desirable and undesirable effect, patient values and preferences, and considerations for resource allocation. Evidence-based clinical recommendations were considered appropriate only for evidence on the basis of systematic review, thus no clinical recommendations were made on the basis of primary studies, surveys, or narrative literature reviews. Recommendations and their potential implications were discussed at a meeting of primary authors and an informal modified Delphi process was used to finalize them.
Systematically evaluating the strength of evidence across included articles for a given study question provides a focused, comprehensive, and unbiased assessment of the evidence that will allow readers to easily ascertain both the conclusions and the quality of that evidence that supports the conclusions. The strength of evidence and its associated conclusions can be used to guide the decisions of clinicians and patients, as well as inform those who may use the evidence to develop clinical guidelines or coverage policies. To this end, the authors have used GRADE in this focus issue to evaluate overall quality of evidence for key outcomes of interest and to generate clinical recommendations. The GRADE Working Group, which consists of a wide range of guideline developers from across the globe, has proposed that GRADE should function as the single system for use grading the quality of evidence, because it was designed to avoid shortcomings of other systems that grade evidence. Compared with other systems that are available to evaluate the overall strength of evidence across studies, GRADE has a number of advantages.13 First, GRADE results in strength of evidence ratings that are distinct from clinical recommendations, which is important because a high strength of evidence rating does not necessarily correlate with a strong clinical recommendation (and vice versa). Second, GRADE uses an explicit and comprehensive methodology for arriving at strength of evidence ratings and includes criteria for downgrading and upgrading these ratings. The clearly defined ratings of “high,” “moderate,” “low,” and “insufficient” aid in conveying what the quality of the evidence means to the reader. Third, the well-defined methodology of GRADE moves from quality of evidence ratings to focused clinical recommendations using a transparent process. Fourth, GRADE uses straightforward interpretation of strong versus weak clinical recommendations.13 Finally, GRADE has been adopted by more than 20 organizations, including The Cochrane Collaboration, the World Health Organization, the American College of Physicians, the Canadian Agency for Drugs and Technology in Health, the United Kingdom's National Institute for Health and Clinical Excellence, and BMJ Clinical Evidence.14 Other organization have adopted variations of GRADE, including the AHRQ.6
Summary Statements. The objectives of this focus issue were met using a variety of manuscript and study designs, each of which has some unique methodological aspects associated with them. The reader should refer to the full article in this issue for additional details specific to that topic. The methods for systematic review follow accepted standards for rigor and, together with the application of GRADE, are intended to allow for transparency in the process for creating the clinical recommendation.
- Systematic reviews, narrative reviews, primary research studies, and a research protocol were undertaken to understand the mechanisms, diagnosis, progression, and treatment of CSM and ossification of the posterior longitudinal ligament.
- This article reports the methods used in the studies in this focus issue.
- This article also provides the methodology used to arrive at summary statements and clinical recommendations.
1. Smith JS, Ryan DJ, Shaffrey CI, et al. Association of myelopathy scores with cervical sagittal balance and normalized spinal cord volume: Analysis of 56 preoperative cases from the AOSpine North America myelopathy study. Spine 2013;38:S161–70.
2. Ames CP, Shaffrey CI, Massicotte EM, et al. Cervical Radiographical Alignment: Comprehensive Assessment Techniques and Potential Importance in Cervical Myelopathy. Spine 2013;38:S149–60.
3. Phillips B, Ball C, Sackett D, et al. Levels of evidence and grades of recommendation. Available at: http://www.cebm.net/?o–1025
. Updated 2001. Accessed December 2, 2006.
4. Wright JG, Swiontkowski MF, Heckman JD. Introducing levels of evidence to the journal. J Bone Joint Surg Am 2003;85-A:1–3.
5. West S, King V, Carey TS, et al. Systems to Rate the Strength of Scientific Evidence. Rockville, MD: Agency for Healthcare Research and Quality; 2002. Evidence Report/Technology Assessment No. 47 (Prepared by the Research Triangle Institute-University of North Carolina Evidence-Based Practice Center, Contract No. 290-97-0011).
6. Methods Guide for Effectiveness and Comparative Effectiveness Reviews. Rockville, MD: Agency for Healthcare Research and Quality; 2012. AHRQ Publication No. 10(12)-EHC063-EF. Available at: www.effectivehealthcare.ahrq.gov
7. Wilson JR, Fehlings MG, Patel AA, et al. Genetics and Heritability of Cervical Spondylotic Myelopathy
and Ossification of the Posterior Longitudinal Ligament
: Results of a Systematic Review. Spine 2013;38:S123–46.
8. Attia J, Ioannidis JP, Thakkinstian A, et al. How to use an article about genetic association: B: are the results of the study valid? JAMA 2009;301:191–7.
9. Balshem H, Helfand M, Schunemann HJ, et al. GRADE guidelines: 3. Rating the quality of evidence. J Clin Epidemiol 2011;64:401–6.
10. Atkins D, Best D, Briss PA, et al. Grading quality of evidence and strength of recommendations. BMJ 2004;328:1490.
11. Owens DK, Lohr KN, Atkins D, et al. AHRQ series paper 5: grading the strength of a body of evidence when comparing medical interventions—Agency for Healthcare Research and Quality and the effective health-care program. J Clin Epidemiol 2010;63:513–23.
12. Ioannidis JP, Boffetta P, Little J, et al. Assessment of cumulatwive evidence on genetic associations: interim guidelines. Int J Epidemiol 2008;37:120–32.
13. Guyatt GH, Oxman AD, Vist GE, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ 2008;336:924–6.
14. Higgins JP, Green S eds. Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0. The Cochrane Collaboration; 2011. Available at: http://www.cochrane-handbook.org
. Accessed May 20, 2013.
Keywords:© 2013 by Lippincott Williams & Wilkins
cervical spondylotic myelopathy; CSM; ossification of the posterior longitudinal ligament; OPLL