Development of the Scientific, Transparent and Applicable Rankings (STAR) tool for clinical practice guidelines : Chinese Medical Journal

Secondary Logo

Journal Logo

Original Article

Development of the Scientific, Transparent and Applicable Rankings (STAR) tool for clinical practice guidelines

Yang, Nan1,2,; Liu, Hui2,3,; Zhao, Wei4; Pan, Yang5; Lyu, Xiangzheng6; Hao, Xiuyuan7; Liu, Xiaoqing8; Qi, Wenan9; Chen, Tong10; Wang, Xiaoqin11; Zhang, Boheng12; Zhang, Weishe13; Li, Qiu14; Xu, Dong15; Gao, Xinghua16; Jin, Yinghui17; Sun, Feng18; Meng, Wenbo19; Li, Guobao20; Wu, Qijun21; Chen, Ze1; Wang, Xu22; Estill, Janne1,2; Norris, Susan L.23; Du, Liang24; Chen, Yaolong1,25,26,; Wei, Junmin27,

Author Information
Chinese Medical Journal ():10.1097/CM9.0000000000002713, May 16, 2023. | DOI: 10.1097/CM9.0000000000002713



Clinical practice guidelines (hereafter referred to as "guidelines") are critical tools for guiding physicians in clinical practice.[1] High-quality guidelines can help to standardize medical practice, improve the quality of healthcare, and reduce health costs.[1–4] In the past 30 years, over 1000 guidelines have been published in China, with more than 200 produced every year over the last 3 years.[5–7]

Guideline researchers have evaluated guidelines from different perspectives using various tools, most notably the Appraisal of Guidelines for Research and Evaluation II (AGREE-II) tool[8] and Reporting Items for Practice Guidelines in Healthcare (RIGHT).[9–14] However, the existing evaluation tools have several limitations. First, the evaluation tools lack some of the key elements of guideline quality, such as guideline applicability,[15] transparency of the processes and methods used for development,[15] and prospective registries.[16] Second, some of the evaluation tools have not been adequately assessed for reliability and validity.[17–20] Third, most evaluation tools have a narrow focus such as methodological quality, reporting quality, or implementation of the guidelines. Thus, a comprehensive evaluation of a guideline is time-consuming as it requires an assessment with multiple tools encompassing different dimensions. In addition, there may be an overlap of items across different evaluation tools. Fourth, interpreting the results of multiple tools in combination and comparing them across different guidelines is challenging.

To overcome these barriers and to improve the quality of Chinese guidelines, we formed a working group to develop a unified, comprehensive, and practical evaluation tool for guidelines, named the Scientific, Transparent and Applicable Rankings (STAR) tool. STAR was intended for use by a wide range of users, including healthcare providers, policymakers, and guideline methodologists and researchers.


STAR was developed according to the following steps: (1) establishment of working groups; (2) scoping review; (3) Delphi survey of potential items; (4) hierarchical analysis of items; (5) consensus meeting; (6) collection of guidelines for evaluation by STAR and testing of STAR; (7) selection of evaluation methods and evaluators; and (8) testing for reliability, validity, and ease of use.

Establishment of working groups

The STAR working groups included the development group and the testing group. The development group was composed of 39 members, including, amongst others, guideline methodologists, statisticians, journal editors, and clinicians from 15 provincial-level administrative divisions in China. The development group was further divided into four subgroups (with some overlap of members): Guidance committee (3 members), Secretary group (6 members), Delphi group (20 members), and Consensus expert group (25 members).

The testing group was composed of 90 clinical evaluators and 35 methodological evaluators. We recruited methodological evaluators who have rich experience in guideline development and evaluation, and Chinese clinicians as clinical evaluators. All clinical evaluators had attended and passed the final examination of the guideline methodology training course organized by the Chinese Medical Association Publishing House and the World Health Organization Collaborating Centre for Guideline Implementation and Knowledge Translation in 2021.[21] All members completed a conflict of the interest disclosure form.

Scoping review

Referring to a previous survey of guideline evaluation tools,[20] the Secretary group searched Medline (PubMed), China Biomedical Literature Database, Wanfang Data Knowledge Service Platform, and China National Knowledge Infrastructure (CNKI) for published quality evaluation tools and methodological literature on guidelines' quality from January 2017 to December 2021 [Supplementary file 1,]. Key data were extracted by two independent members of the Secretary group, and disagreements were resolved through discussion. The Secretary group then formed a list of initial items for STAR by categorizing and de-duplicating the items based on the content and attributes of the surveyed tools and literature.

Delphi survey

The Secretary group created the questionnaire and collected experts' comments and suggestions using the WenJuanXing platform ( The members of the Delphi group were invited to rate each initial item as either "agree", "disagree", or "not sure". Items with more than 75% "agree" responses were included in the tool; other items were modified according to suggestions made by the respondents. Items that did not reach consensus after two rounds of Delphi surveys were removed.[22] The Secretary group then collated the items and grouped them by domains.

Hierarchical analysis

The Secretary group used hierarchical analysis to determine the weights for each item.[23] STAR was divided into three levels, with the entire STAR tool as the top level, the domains as the middle level, and the items contained in the domain as the bottom level. The Delphi expert group was surveyed by email about the importance of the domains and items. Each expert was asked to score the importance of each domain (second level) directly.[24] For the items in the same domain (third level), a pairwise comparison of importance was performed.[25] The item that was considered more important than its counterpart was given a score from one to nine where one implied that the two items were the same or equally important, and nine implied that the item was far more important than the counterpart. The less important item was then given the inverse as a score: for example, if the score of the more important item was nine, the counterpart item was given the score of 1/9. Based on the results of the importance survey, the Secretary group used hierarchical analysis software (Yaanp V2.3, Shanxi Meta-Decision Software Technology Co. Ltd., Taiyuan, Shanxi, China) to construct a judgment matrix to obtain the weight of each domain and the weight of each item within the domain.

Consensus meeting

The Secretary group presented the results of the scoping review, Delphi survey, and hierarchical analysis at an in-person meeting held on July 03, 2021, in Lanzhou, China. All members of the Consensus expert group discussed the items and scoring methods, and all comments were recorded and filed. The Secretary group revised the items based on the comments and drafted the first version of the STAR tool, which was subsequently approved by all members of the Secretary group.

Samples for evaluation and testing

This study collected two samples of guidelines to evaluate and test STAR. The first sample included the top 50 Chinese guidelines (ranked according to a composite score from three established evaluation tools, described in Section "Criterion Validity") published in the journal series of the Chinese Medical Association in 2020 (hereinafter referred to as "Top 50 guidelines").[26] "Top 50 guidelines" were used in the evaluation of inter-rater reliability and criterion validity [Supplementary file 2,].

The second sample included all guidelines (n = 352) and consensus statements (n = 929) from China published in English and Chinese journals in 2021 (hereinafter referred to as "2021 guidelines and consensus statements"). This was used to test the intrinsic reliability of STAR.

Evaluators and evaluation methods

Four of 35 methodological evaluators and 30 of 90 clinical evaluators evaluated "Top 50 guidelines". 30 of 35 methodological evaluators and 67 of 90 clinical evaluators evaluated the "2021 guidelines and consensus statements".

Ten guidelines from the "Top 50 guidelines" sample were randomly selected for evaluation by the four methodological evaluators. The 30 clinical evaluators were randomly assigned to 1 of 5 subgroups, and each subgroup of 6 clinicians evaluated 10 guidelines in "Top 50 guidelines". To ensure the reliability of the clinical evaluators' assessments, the scores of one clinical evaluator in each subgroup with the lowest rate of agreement with others were excluded from the analysis.

Each guideline or consensus statement from the "2021 guidelines and consensus statements" was independently evaluated by one methodological evaluator and one clinical evaluator, who subsequently discussed the results of the evaluation to resolve the disagreements.

The Secretary group designed the evaluation approach and data collection table for STAR through discussion. Upon review of each guideline and its accompanying documents, the evaluators independently judged whether the guideline adhered to each STAR item fully (indicated as "1"), partially (indicated as "0.5"), or not at all (indicated as "0"). The weighted sum of the item scores, calculated as the sum of 100 × domain weight × item weight × item score of 37 items, was defined as the total score of this guideline.

The evaluation results for "Top 50 guidelines" were used for assessment of inter-rater reliability and criterion validity [Supplementary file 2,]. The evaluation results for "2021 guidelines and consensus statements" (published as an article[27]) were used for assessing content reliability.

Assessment of reliability, validity, and usability

Intrinsic reliability

Intrinsic reliability (internal consistency) was evaluated using the "2021 guidelines and consensus statements" sample with coefficient by using SPSS 26.0 software (IBM, Carlsbad, CA, USA). Cronbach's α coefficient ≥0.7 was considered acceptable consistency. If the coefficient was <0.3, indicating a low correlation between the item and the overall score, the item should be deleted from STAR in the next STAR meeting if at least two-thirds of the participants agree. This study also excluded each item in turn and recalculated the Cronbach's α coefficient of the remaining items in the domain.[28]

Inter-rater reliability

This study evaluated inter-rater reliability between the methodological evaluators, and between the clinical evaluators, for the "Top 50 guidelines" using Cohen's kappa coefficient with SPSS 26.0 software (IBM). The coefficient ranged from -1 to 1, with values >0.6 indicating high consistency and values ≤0.2 poor consistency.[29]

Content validity

The content validity index (CVI) of each STAR item and the overall content validity of the STAR score were calculated using Microsoft Excel 16.60 (Microsoft Corp., Redmond, WA, USA) for the sample of "Top 50 guidelines". After completing the evaluations of the 10 guidelines assigned to their group, the clinical evaluators assessed the relative importance of each item, and suggested changes, deletions, or additions to the items. The degree of importance of each item to guideline evaluation was rated on a scale of 1–5, with 5 indicating the greatest importance. The mean score across the thirty clinical evaluators was then calculated for each item: items that received a mean score <2 were excluded from the tool and after approval by the Consensus group. The item-level of CVI (I-CVI) was calculated as the number of evaluators who gave an assessment of 4 or 5 (indicating high or greatest importance of the item) divided by the total number of clinical evaluators. The mean of the I-CVI of all items was defined as the scale-level CVI (S-CVI). The content validity of the item and the tool were considered good when I-CVI was ≥0.78 and S-CVI ≥0.90, respectively; I-CVI <0.78 indicated that researchers needed to carefully revise, delete, or add items according to the comments,[30] and the item should be deleted from STAR in the next STAR meeting if at least two-thirds of the participants agree.

Criterion validity

This study derived a composite score from the U.S. Institute of Medicine (IOM) guideline standards,[1] the Appraisal of Guidelines for Research and Evaluation in China (AGREE-China),[31] and the RIGHT[9] to evaluate criterion validity in the "Top 50 guidelines" sample. The composite score was calculated as the total score of AGREE-China × 0.4 + the overall reporting rate of RIGHT (convert percentages into scores out of 100) × 0.4 + the proportion of items selected as "yes" in the IOM (convert percentages into scores out of 100) × 0.2.[26] This study calculated the Pearson's correlation coefficient r between the composite score and the STAR tool using Microsoft Excel 16.60 (Microsoft Corp.). The closer r was to 0, the weaker the linear relationship. The closer r was to -1 or 1, the stronger the negative or positive linear relationship, respectively. r ≥0.8 indicated a good fit and thus suggested a high criterion validity of the STAR tool.[32]

Usability survey

A usability survey was also administered to the clinical evaluators with assessment of each item on a scale ranging from one to five. A score of one was considered to be an extremely complex item and difficult to interpret and apply, whereas a score of five was considered to be the easiest to assess and could be mastered quickly. The usability of each item was then calculated as the mean value across the respondents. In addition, the clinical evaluators reported the average time spent evaluating each guideline in minutes and provided suggestions for STAR use and promotion.


STAR items

The scoping review included seven evaluation tools[1,8,9,15,33–35] and two methodological articles.[16,36] Based on the identified evaluation tools, the Secretary group initially compiled 42 items related to the three dimensions of guideline scientificity, transparency, and applicability. These items were categorized into 11 domains.

Thirteen members of the Delphi group participated in the Delphi survey. In the first round of the Delphi survey, a total of 39 items reached the threshold for consensus agreement. After the Secretary group modified some items based on 36 comments from the experts, a second round of the Delphi survey was conducted. The remaining three items did not reach the threshold for consensus agreement during the second round either.

The weights of the 11 domains and 39 items were presented in Table 1. The highest weight of 0.170 was assigned to the domains of "Clinical questions" and "Evidence"; the lowest weight of 0.012 was assigned to "Other."

Table 1 - The STAR (Scientific, Transparent and Applicable Rankings) tool checklist.
Domain Domain weight Items Item weight Item score
Registry 0.050 1. Register the guideline on an appropriate platform. 0.293 1.5
2. Provide information about the registry platform and registry ID of the guideline. 0.707 3.5
Protocol 0.050 3. Provide details of the guideline protocol. 0.377 1.9
4. Identify how the guideline protocol is accessible from an open-source platform (e.g., guideline registry platform or website). 0.623 3.1
Funding 0.031 5. Describe the sources of funding for the development of the guideline. 0.305 1.0
6. Describe the role of funder(s) in the guideline development. 0.289 0.9
7. Declare that the funder(s) did not influence the guideline's recommendations. 0.406 1.3
Guideline development groups 0.073 8. List the institutional affiliations of all individuals involved in developing the guideline. 0.128 0.9
9. Describe the composition of the development groups. 0.137 1.0
10. Describe the responsibilities of all individuals or sub-groups involved in developing the guideline. 0.175 1.3
11. Identify experts from at least two disciplines in addition to the guideline's topic who took part in the development. 0.182 1.3
12. Identify guideline methodologists or experts in evidence-based medicine who took part in the development. 0.378 2.8
Conflicts of interest 0.092 13. Describe whether conflicts of interest existed. 0.474 4.4
14. Indicate information about the evaluation and management of conflicts of interest. 0.526 4.8
Clinical questions 0.170 15. Identify the clinical questions that the guideline focuses on. 0.377 6.4
16. Introduce the methods of collecting clinical questions, such as literature search, survey of users, or consultation of experts. 0.146 2.5
17. Indicate how the clinical questions were selected and sorted. 0.197 3.4
18. Format clinical questions in PICO (population/patients, intervention, control/comparator, and outcome) or other formats. 0.281 4.8
Evidence 0.170 19. Identify the references for evidence supporting the main recommendations. 0.098 1.7
20. State to the details of the systematic search (e.g., names of databases, selection criteria, search strategies). 0.131 2.2
21. Indicate the inclusion and exclusion criteria of research evidence. 0.090 1.5
22. Assess the risk of bias or methodological quality of the included studies. 0.113 1.9
23. Summarize and analyze the research evidence. 0.125 2.1
24. Indicate the standard used to grade the evidence quality. 0.132 2.2
25. Provide the GRADE evidence profile or summary of the results of evidence grading. 0.139 2.4
26. Provide reference to the full text of systematic reviews. 0.101 1.7
27. Identify the clinical questions with insufficient evidence (low quality) and indicate future research directions to collect more evidence. 0.072 1.2
Consensus method 0.107 28. Indicate the specific method(s) used to reach consensus, such as the Delphi method, Nominal group technique, or informal approaches. 0.478 5.1
29. Describe the criteria to inform decisions other than the certainty of the evidence (e.g., resource requirements, preferences and values of patients, cost–benefit balance, accessibility, health equity, acceptability). 0.355 3.8
30. Provide the records of the consensus process. 0.167 1.8
Recommendations 0.170 31. Make the recommendations clearly identifiable, e.g., in a table, or using enlarged or bold fonts. 0.240 4.1
32. Indicate the strength of all recommendations. 0.367 6.3
33. Provide the explanations for all recommendations. 0.231 3.9
34. Indicate the considerations (e.g., adverse effects) in clinical practice when implementing the recommendations. 0.162 2.8
Accessibility 0.073 35. Make the guideline accessible through multiple platforms (such as guideline libraries, conference presentations, and websites). 0.349 2.5
36. Provide tailored editions of the guidelines for different groups of target users (e.g., patients, public, primary care physicians) 0.186 1.4
37. Present the guideline or recommendations visually, such as with figures or videos. 0.152 1.1
38. Make the full guideline downloadable free of charge. 0.314 2.3
Other 0.012 39. Provide a flowchart of clinical pathways reflecting the recommendations. 1.000 1.2

GRADE: Grading of Recommendations, Assessment, Development and Evaluation; ID: Identification number; STAR: Scientific, Transparent and Applicable Ranking.

None of the experts participated in the consensus meeting suggested deleting any of the initial 39 items or adding any new items. The Secretary group revised the wording of some items based on the experts' comments.


Intrinsic reliability

Of the domains with at least three items, the Cronbach's α coefficients for 10 domains ranged from 0.078 to 0.902(mean:0.588;95% confidence interval [CI]: 0.414–0.762), the "registry" domain had the highest Cronbach's α coefficient (0.920), followed by the "Evidence","Clinical questions","Funding","Recommendations","Consensus method", and "Protocol" domains, with coefficients ranging from 0.515 to 0.807, suggesting acceptable consistency [Table 2]. The "Guideline development groups", "Conflict of interest", and "Accessibility" domains had the lowest coefficients (<0.5), suggesting poor consistency. The item coefficients of Item 6 in the "Funding" domain, Items 8 and 9 in "Development groups" domain, Items 19 and 27 in "Evidence" domain, Item 29 in "Consensus method" domain and all items in "Conflicts of interest" or "Accessibility" domains were <0.3, suggesting that these items could be deleted from the tool to improve consistency in their respective domains.

Table 2 - Intrinsic reliability, importance, and usability of STAR domains and items.
Domain Intrinsic reliability (domain coefficient) Item Number Intrinsic reliability Importance Usability score
Domain coefficient Domain coefficient after removing the respective item Importance score I-CVI
Registry * 0.920 1 0.859 4.5 0.867 4.7
2 0.859 4.3 0.833 4.8
Protocol * 0.515 3 0.432 4.6 0.900 4.7
4 0.432 4.4 0.867 4.6
Funding 0.684 5 0.850 0.243 4.3 0.733 4.6
6 0.168 0.876 4.1 0.733 4.3
7 0.910 0.143 4.3 0.767 4.4
Guideline development groups 0.471 8 0.015 0.545 4.7 0.933 4.7
9 0.275 0.407 4.7 0.967 4.6
10 0.369 0.367 4.5 0.900 4.5
11 0.294 0.410 4.7 0.933 4.4
12 0.395 0.315 4.8 0.967 4.6
Conflicts of interest * 0.374 13 0.258 4.8 0.967 4.6
14 0.258 4.5 0.833 4.2
Clinical questions 0.774 15 0.447 0.782 4.8 0.933 4.2
16 0.638 0.704 4.8 1.000 4.2
17 0.694 0.661 4.7 0.900 4.1
18 0.606 0.719 4.7 0.967 4.1
Evidence 0.807 19 0.130 0.820 4.9 1.000 4.5
20 0.740 0.753 4.9 1.000 4.2
21 0.647 0.776 4.8 1.000 4.3
22 0.585 0.786 4.9 1.000 4.2
23 0.794 0.751 4.6 0.933 3.9
24 0.513 0.791 4.9 0.967 4.4
25 0.476 0.797 4.9 0.967 4.3
26 0.755 0.750 4.7 0.967 4.0
27 0.190 0.849 4.4 0.867 3.7
Consensus method 0.623 28 0.544 0.353 4.7 0.967 4.3
29 0.265 0.762 4.7 0.967 4.1
30 0.522 0.411 4.1 0.800 4.0
Recommendations 0.634 31 0.675 0.415 4.8 0.967 4.8
32 0.455 0.579 4.9 1.000 4.6
33 0.647 0.416 4.9 1.000 4.4
34 0.049 0.760 4.7 0.967 3.9
Accessibility 0.078 35 0.063 0.009 4.4 0.767 4.4
36 0.062 0.060 4.0 0.700 4.3
37 0.153 -0.005 4.0 0.733 4.4
38 -0.019 0.265 4.6 0.900 4.3
Other 39 4.4 0.833 4.4
*The number of items in the domain is two, and thus consistency could not be calculated after deleting a single item. The number of items in the domain is <2, and thus the consistency of the domain could not be calculated. I-CVI: The item-level of CVI; STAR: Scientific, Transparent and Applicable Rankings.

Inter-rater reliability

The Cohen's kappa coefficients of the four methodological evaluators ranged from 0.716 to 0.802 (mean,0.774; 95% CI [0.740, 0.807]), suggesting strong consistency [Supplementary file 3-1,]. Cohen's kappa coefficients within the group of clinical evaluators ranged from 0.320 to 0.924(mean, 0.586; 95% CI [0.564, 0.607]), indicating moderate consistency. After excluding one clinical rater per group, the range was 0.406 to 0.924 (mean: 0.618; 95% CI: [0.587,0.648]), indicating strong consistency. Cohen's kappa coefficients for all groups of evaluators are shown in Supplementary files 3-2 to 3-6,


Content validity

The importance of the STAR items was assessed by clinical evaluators and the result showed that Items 32 and 33 had the highest importance (score 4.9 out of 5.0), while Items 36 and 37 had the lowest importance (score 4.0) [Table 2]. The mean of the importance scores was 4.6. The clinical evaluators made a total of 21 comments for changes and deletions to the initial set of items. The I-CVI for three items (Items 5, 6, and 7) in the "Funding" domain and three items (Items 35, 36, and 37) in the "Accessibility" domain were <0.78, suggesting that the Secretary group should consider revising or deleting these items, following which the Consensus expert group would discuss and vote on any changes at the next consensus meeting. The overall S-CVI of STAR was 0.905, indicating good overall content validity.

Criterion validity

Pearson's correlation coefficient was 0.885, (95% CI: 0.804–0.932, P<0.001), suggesting good criterion validity of STAR.

Usability survey

The usability scores for the STAR items ranged between 3.7 and 4.8, with a mean of 4.6 [Table 2]. The score for Item 2 was the highest and Item 27 was the lowest. The clinical evaluators took between 10 min and 60 min to evaluate each guideline, with a median of 20 min. The clinical evaluators made 55 comments for the use and promotion of STAR, and the Secretary group made corresponding changes and improvements.


To meet the need for a comprehensive evaluation of Chinese guidelines, the STAR working groups developed a multidimensional ranking tool for clinical practice guidelines. The tool included 39 items grouped into 11 domains with different weights. The STAR successfully assigns weights to the domains and items. Evaluators can calculate an overall STAR score for a guideline based on the weights and this overall score can be used to rank and compare a group of guidelines.

Only 17% of existing guideline evaluation tools reported the assessment of the reliability or validity of the tools.[8,17] The National Guideline Clearinghouse (NGC) published the Extent of Adherence to Trustworthy Standards (NEATS) tool in 2019, based on the AGREE and IOM standards. NEATS was a relatively comprehensive tool with both reliability and validity verified. However, in contrast to STAR, NEATS did not conduct assessments of intrinsic reliability, structural validity, criterion validity, or usability.[37]

In the intrinsic reliability assessment of STAR, unsurprisingly the best agreement was found in the "Registration" domain, because the guidelines that were registered also reported registration information. The items within each of the domains "Clinical questions", "Working group", "Evidence" and "Protocol" reflected aspects of the same information and were also consistent with each other. The first two items in the "Recommendations" domain reflected the transparency of reporting, and the last two items reflected the rigor of the guideline methodology, so the consistency across items in this domain was low. The consistency in the "Conflicts of interest" domain was also poor, mainly because the proportion of guidelines with a conflict of interest management strategy in the sample was low but the proportion reporting no conflicts of interest was high. Based on the results of the intrinsic reliability test, the deletion of certain items could be considered, but any changes to STAR need the agreement of two-thirds of the Consensus experts group members, and until the next consensus meeting to present results and discuss. The inter-rater consistency of the guideline evaluation results among the methodological evaluators was higher than the clinical evaluators of STAR (weighted kappa value: 0.783 vs. 0.579), probably because the methodological evaluators were more familiar with guideline methodology and had more experience in guideline development and evaluation than clinicians.

According to the content validity assessment, at least 70% of the clinical evaluators felt that each item in STAR should be included. The overall CVI reached 90%, which reflected that the experts generally agreed on the high importance of most items. The content validity in NEATS was similar: between 80% and 100%. The low content validity of six items in the domains "Funding"and "Accessibility"reflected the less attention to funding and accessibility of Chinese guidelines, and this finding was consistent with the results of previous guidelines evaluations.[14,26] The removal of these items from STAR would, however, also require agreement by two-thirds of the Consensus expert group members.

The strong correlation between STAR rating results and a composite score of three established guideline checklists also demonstrated that STAR could be considered as an alternative to the composite scores of multiple evaluation tools. An evaluation using a composite score from different tools was inefficient: it required the use of multiple tools at the same time, and assigned weights to the results of different evaluation tools when calculating the composite score was also problematic.[38–40] For example, the published "Evaluation report of published guidelines in the 2020 Chinese Medical Association series" needed to be completed by a total of 58 methodological evaluators, clinical evaluators, and quality control evaluators, and each guideline needed to be evaluated by three members using three different evaluation tools, as well as an additional member for results verification. In contrast, STAR allowed a comprehensive evaluation using only one tool, with each guideline evaluated by only two evaluators. The median time spent by the clinical evaluators was only 20 min, which was substantially less than the 2 h to 3 h spent by the methodological evaluators to use NEATS to evaluate one guideline.[37] However, as the guidelines selected for evaluation for STAR tended to be on average shorter and of higher quality than those selected for the evaluation by NEATS, these durations may not be directly comparable.

Item 27 had the lowest usability score, possibly because the relevant content can appear anywhere in the body of the guideline text, and the item lacked clearly identifiable keywords to facilitate the evaluator. Item 2, on the other hand, had clearly identifiable keywords ("registration") and Item 31 required clearly identifiable recommendations: both items were thus relatively easy to assess and had also the highest usability scores. The STAR working group will develop a user manual and provide training courses, with the aim of helping users to easily obtain the pertinent information.

The STAR had the following limitations: the weights of domains and items of STAR were determined subjectively and the overall score and ranking may be sensitive to the weighting. In conclusion, the STAR guideline evaluation tool has good reliability and validity, and clear advantages in efficiency compared to existing tools. STAR is therefore well suited for the comprehensive evaluation of clinical practice guidelines.


Nan Yang was funded by China Scholarship Council (Grant No. 202206180007); Hui Liu was funded by China Scholarship Council (Grant No. 202206180006).

Conflicts of interest

Declared no conflict of interest. Yaolong Chen is the Co-Founder and Co-Chair of RIGHT (Reporting Items for Practice Guidelines in Healthcare) working group.


1. Institute of Medicine (US). Clinical practice guidelines we can trust. Washington, DC: The National Academies Press, 2011. ISBN: 0309164222.
2. Grimshaw JM, Russell IT. Effect of clinical guidelines on medical practice: A systematic review of rigorous evaluations. Lancet 1993;342: 1317–1322. doi: 10.1016/0140-6736(93)92244-n.
3. Grimshaw J, Russell I. Achieving health gain through clinical guidelines. I: Developing scientifically valid guidelines. Qual Health Care 1993;2: 243–248. doi: 10.1136/qshc.2.4.243.
4. Graham RP, James PA, Cowan TM. Are clinical practice guidelines valid for primary care? J Clin Epidemiol 2000;53: 949–954. doi: 10.1016/s0895-4356(99)00224-3.
5. Yang N, Chen YL. Survey and evaluation of clinical practice guidelines in China in 2019. Med J PUMCH 2021;12: 407–410. doi: 10.12290/xhyxzz.2021-0323.
6. Chen Y, Wang C, Shang H, Yang K, Susan Norris S. Clinical practice guidelines in China. BMJ 2018;360: j5158. doi: 10.1136/bmj.j5158.
7. Wang ZJ, Shi QL, Liu YL, Ren MJ, Zhao SY, et al. Evaluation of Chinese Clinical Practice Guidelines Published in Medical Journals in 2019: status of the Authorship and Guideline Development Group. Med J PUMCH 2021;12: 552–559. doi: 10.12290/xhyxzz.2021-0438.
8. Brouwers MC, Kho ME, Browman GP, Burgers JS, Cluzeau F, Feder G, et al. AGREE II: Advancing guideline development, reporting and evaluation in health care. CMAJ 2010;182: E839–E842. doi: 10.1503/cmaj.090449.
9. Chen Y, Yang K, Marušic A, Qaseem A, Meerpohl JJ, Flottorp S, et al. A reporting tool for practice guidelines in health care: The RIGHT statement. Ann Intern Med 2017;166: 128–132. doi: 10.7326/M16-1565.
10. Chen YL, Yao L, Xiao XJ, Wang Q, Wang ZH, Liang FX, et al. Quality assessment of clinical guidelines in China: 1993–2010. Chin Med J 2012;125: 3660–3664. doi: 10.3760/cma.j.issn.0366-6999.2012.20.011.
11. Wei D, Wang SQ, Wu QF, Yao L, Wang Q, Wang TF, et al. Quality evaluation on Chinese clinical practice guidelines in 2011. Chin J Evid Based Med 2013;13: 760–763. doi: 10.7507/1672-2531.20130134.
12. Li N, Yao L, Wu QF, Wei D, Wang Q, Wang XQ, et al. Quality evaluation of the of clinical practice guidelines published in mainland Chinese journals from 2012–2013. Chin J Evid Based Med 2015;15: 259–263. doi: 10.7507/1672-2531.20150045.
13. Zhou Q, Wang Z, Shi Q, Zhao S, Xun Y, Liu H, et al. Clinical epidemiology in China series. Paper 4: The reporting and methodological quality of Chinese clinical practice guidelines published between 2014 and 2018: A Systematic Review. J Clin Epidemiol 2021;140: 189–199. doi: 10.1016/j.jclinepi.2021.08.013.
14. Liu YL, Zhang JY, Shi QL, Yang N, Wang ZJ, Luo XF, et al. Investigation and evaluation of Chinese Clinical Practice Guidelines published in medical journals in 2019: Methodological and reporting quality. Med J PUMCH 2021;12: 324–331. doi: 10.12290/xhyxzz.2022-0027.
15. Yang L, Long YL, Cheng YF, Hu T, Yang ZX, Liu LC, et al. Evidence-based construction of a transparency evaluation tool for clinical practice guidelines. Chin J Evid Based Med 2021;21: 869–875. doi: 10.7507/1672-2531.202106027.
16. Chen Y, Guyatt GH, Munn Z, Florez ID, Marušićc A, Norris S, et al. Clinical Practice Guidelines Registry: Toward reducing duplication, improving collaboration, and increasing transparency. Ann Intern Med 2021;174: 705–707. doi: 10.7326/M20-7884.
17. Vlayen J, Aertgeerts B, Hannes K, Sermeus W, Ramaekerset D. A systematic review of appraisal tools for clinical practice guidelines: Multiple similarities and one common deficit. Int J Qual Health Care 2005;17: 235–242. doi: 10.1093/intqhc/mzi027.
18. Siering U, Eikermann M, Hausner E, Hoffmann-Eßer W, Neugebauer EA. Appraisal tools for clinical practice guidelines: A systematic review. PLoS One 2013;8: e82915. doi: 10.1371/journal.pone.0082915.
19. Zhang Y, Zhang S, Zhou Z, Wu M. Development and insights of international clinical guideline evaluation tools. Chin J Med Libr Inf Sci 2015;24: 11–16. doi: 10.3969/j.issn.1671-3982.2015.01.003.
20. Wang Q. Study on the quality evaluation of clinical practice guidelines in China. Lanzhou: Lanzhou University, 2017.
21. Chinese Medical Association Publishing House. Strengthening research and evaluation of guideline methodology to promote the development of China's medical and health care–The first training course on clinical practice guideline methodology for 2021 was successfully held in Beijing. China Med News 2021;36: 2. doi: 10.3760/cma.j.issn.1000-8039.2021.08.102.
22. Diamond IR, Grant RC, Feldman BM, Pencharz PB, Ling SC, Moore AM, et al. Defining consensus: A systematic review recommends methodologic criteria for reporting of Delphi studies. J Clin Epidemiol 2014;67: 401–409. doi: 10.1016/j.jclinepi.2013.12.002.
23. Thomas PG, Doherty PC. The analytic hierarchy//process: Planning, priority setting, resource allocation. McGraw-Hill, New York, NY, USA; 1980.
24. Luo ZQ. A new method of judgment matrix construction in hierarchical analysis. J Univ Electron Sci Technol 1999;5: 557–561. doi: 10.3969/j.issn.1001-0548.1999.05.027.
25. Saaty RW. Decision making in complex environments: The analytic network process (ANP) for dependence and feedback; A Manual for the ANP Software SuperDecisions. Pittsburgh, PA: Creat Decis Foundation; 2002.
26. Guidelines and Standards Research Center of Chinese Medical Association Publishing House; WHO Collaborating Center for Guideline Implementation and Knowledge Translation. Evaluation report of guidelines published in the Chinese Medical Association Journal Series 2020. Chin Med J 2021;101: 1839–1847. doi: 10.3760/cma.j.cn112137-20210402-00803.
27. Guidelines and Standards Research Center of Chinese Medical Association Publishing House; WHO Collaborating Center for Guideline Implementation and Knowledge Translation. Evaluation and Ranking for Scientificity, Transparency and Applicability of Chinese Guidelines and Consensus Published in the Medical Journals in 2021. Chin Med J 2022;102: 2319–2328. doi: 10.3760/cma.j.cn112137-20220602-01232.
28. George D, Mallery P. SPSS for Windows step by step: A simple guide and reference. 11.0 update. 4th ed. Boston: Allyn & Bacon; 2003. ISBN: 0205515851.
29. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas 1960;20: 37–46. doi: 10.1177/001316446002000104.
30. Lynn MR. Determination and quantification of content validity. Nurs Res 1986;35: 382–385. doi: 10.1097/00006199-198611000-00017.
31. Wang JY, Wang Q, Wang XQ, Jin XJ, Zhang BY, Chen SY, et al. Development and initial validation of AGREE China. Chin Med J 2018;98: 1544–1548. doi: 10.3760/cma.j.issn.0376-2491.2018.20.004.
32. Hair JF, Ringle CM, Sarstedt M. PLS-SEM: Indeed a silver bullet. J Mark Theory Pract 2011;19: 139–152. doi: 10.2753/MTP1069-6679190202.
33. Brouwers MC, Kerkvliet K, Spithoff K, AGREE Next Steps Consortium. The AGREE Reporting Checklist: A tool to improve reporting of clinical practice guidelines. BMJ 2016;352: i1152. doi: 10.1136/bmj.i1152.
34. Brouwers MC, Spithoff K, Kerkvliet K, Alonso-Coello P, Burgers J, Cluzeau F, et al. Development and validation of a tool to assess the quality of clinical practice guideline recommendations. JAMA Netw Open 2020;3: e205535. doi: 10.1001/jamanetworkopen.2020.5535.
35. Kashyap N, Dixon J, Michel G, Brandt C, Shiffman R, et al. GuideLine implementability appraisal v. 2.0. New Haven, CT: Yale Center for Medical Informatics; 2011.
36. Schünemann HJ, Wiercioch W, Etxeandia I, Falavigna M, Santesso N, Reem Mustafa R, et al. Guidelines 2.0: Systematic development of a comprehensive checklist for a successful guideline enterprise. CMAJ 2014;186: E123–E142. doi: 10.1503/cmaj.131237.
37. Jue JJ, Cunningham S, Lohr K, Shekelle P, Shiffman R, Robbins C, et al. Developing and testing the agency for healthcare research and quality's national guideline clearinghouse extent of adherence to trustworthy standards (NEATS) instrument. Ann Intern Med 2019;170: 480–487. doi: 10.7326/M18-2950.
38. Wayant C, Cooper C, Turner D, Vassar M. Evaluation of the NCCN guidelines using the RIGHT Statement and AGREE-II instrument: A crosssectional review. BMJ Evid Based Med 2019;24: 219–226. doi: 10.1136/bmjebm-2018-111153.
39. Yao X, Ma J, Wang Q, Kanters D, Ali MU, Florez ID, et al. A comparison of AGREE and RIGHT: Which clinical practice guideline reporting checklist should be followed by guideline developers? J Gen Intern Med 2020;35: 894–898. doi: 10.1007/s11606-019-05508-3.
40. Zhao S, Lu S, Wu S, Wang Z, Guo Q, Shi Q, et al. Analysis of COVID-19 guideline quality and change of recommendations: A systematic review. Health Data Science 2021;2021: 9806173. doi: 10.34133/2021/9806173.

Practice guideline; Evidence-based practice; Quality control

Supplemental Digital Content

Copyright © 2023 The Chinese Medical Association, produced by Wolters Kluwer, Inc. under the CC-BY-NC-ND license.