Consensus methods have been used increasingly within the last 60 years as a way to gather general agreement on topics that do not yet have empirical evidence to support future decisions or actions; often, these topics are ambiguous or controversial. Consensus methods can also be used as a way to forecast future events or create decision protocols—for instance, the creation of diagnostic algorithms and treatment plans in clinical settings. One such setting that has used consensus methodology is the reporting by the National Institutes of Health (NIH), in which researchers detail results of consensus development panels. These meetings and their subsequent reports serve to gather consensus on how to diagnose and treat common morbidities in the United States.
Generally speaking, there are three main consensus methods in use by researchers across multiple fields: nominal group process, consensus development panels, and the Delphi technique.1 Each of these methods of determining consensus has been used in health care settings increasingly in the past 40 years. However, despite the widespread utility of consensus methods and the variety of approaches available, there is a lack of guidelines for conducting such studies. This lack of stringency in guidelines for conducting consensus studies has led to variability not only in reporting results but in conducting the studies themselves. We set out to determine best practices for conducting such research as well as reporting on results in the hopes that future studies are more reliable and valid. To do this, we performed an extensive literature review of studies using consensus methodology both in academic medicine/health care and in multiple fields outside of medicine, focusing on several aspects of the studies and how results were reported.
The initial question for our literature review was, what are the standards for conducting a well-run consensus study, among Delphi, nominal group process, or consensus development panel approaches? We searched several electronic research publication databases including PubMed, ERIC, Psych Articles, PsycINFO, Web of Science, and Google Scholar with the following terms: Delphi, nominal group, consensus panel, and method. Our main interest was in those articles that either formally focused on the nature of the technique being used, or had sufficient description of the technique employed to allow us to fully identify its components. We also attempted to contact authors of articles that were aimed at informing potential users of the specific methods, their advantages, and drawbacks. Articles that only briefly described the consensus methods used, did not describe their specific adaptations of the method, or focused only on the outcomes of the application of the method were not included.
Our initial review documented a lack of rigor in methodological discussions surrounding consensus methodology, so we shifted our focus and reviewed the same publications to determine how a majority of researchers were conducting consensus studies, analyzing their results, and reporting findings. We also looked for how they measured the success of their studies—did they evaluate for reliability or do follow-up studies to measure how well their newly implemented criteria were holding up? We felt that this was important because different consensus methodologies are commonly used when creating clinical diagnostic criteria and in making decisions that affect medical education in the United States. Without a rigorous set of guidelines for these studies, it is difficult to speak to the reliability and validity of such studies and findings, and thus difficult to justify their use in making decisions related to clinical and academic medicine.
We synthesize the various approaches to conducting consensus methodology research and give a brief overview of each method in Table 1. Given our findings, we have developed a set of guidelines and suggestions to assist researchers hoping to use consensus methodology in their work, which can be found in the Conclusions portion of this article. To develop these guidelines, we reviewed 50 publications dealing with the three consensus methods mentioned above, ultimately using 32 of those publications in our final assessment. We discuss how the authors of each study approached several tenets of each study. Those publications that gave supporting arguments, whether in the form of statistical analysis (reliability of group size and number of rounds, for instance) or literature support, were considered for use in the synthesis and development of our guidelines.
Table 1: Overview of Three Approaches to Conducting Consensus Methodology Researcha
Nominal Group Process
The nominal group process (which has also been called the expert panel method) is a consensus method that allows a group of experts to develop and suggest ideas/solutions within a group, but these ideas are privately formed and suggested anonymously to the group as a whole.2 In this method, the panel is composed exclusively of experts. Typically, a panel of experts meets and is asked to generate ideas and solutions in response to a specific question—this is the first phase. Depending on subject matter, the respondents may or may not be presented with pertinent literature to help them make their initial decision. During the second phase, each participant is asked to submit her or his solutions to a moderator, who then presents the solutions to make them visible to the entire panel. During this time, each member is to sit and listen to each idea or solution, and no discussion can occur. In the third phase, a discussion led by a moderator covers and clarifies each solution suggested. In the fourth and final phase, each of the ideas is ranked by the panel members anonymously on a predetermined scale; those with the highest ranking are kept. The cutoff for consensus is predetermined by the researchers running the nominal group process meeting.1,3
Though this method is used across multiple fields of study, it has been used to examine the appropriateness of interventions in health care, ultimately impacting the decision-making process that occurs in treating patients.4 Scott and Black5 use this method to determine the appropriateness of gallbladder removal, a procedure commonly performed following acute cholecystitis. However, there are less common conditions that may warrant a cholecystectomy (e.g., cancer of the gallbladder), but there was no consensus concerning the appropriateness of removal in these instances. Ultimately, a panel of clinicians was used as an expert panel to determine the level of appropriateness for various conditions that may have warranted removal of the gall bladder.5 This process can be extrapolated for further use in academic medical settings—for instance, testing appropriateness of current standards—or for developing standards in general (e.g., length of stay following certain surgeries).
This particular method is of use in academic medical research because it relies on anonymity, which ultimately reduces any outside influence other than the true expert opinion. In addition to this anonymity, the panel is given time to clarify thought processes and suggestions without hindering the timeliness of the study. Lastly, this method is time efficient, and this particular process is relatively fast because the panel of experts is kept on task and focused throughout the duration of the meeting.
Despite the many advantages of this particular method, some limitations do exist. First, this process is particularly expensive and complicated to organize. It takes resources that may not be available, such as time, meeting place, and money. In addition to these obstacles, the face-to-face meeting may pose a problem for credibility among the panel of experts; while generation of ideas is independent, the discussion that ensues has the potential to expose certain beliefs or practices that may not be in line with the majority of the panel members. This also causes a lack of collaboration and thus potentially hinders the quality of the results.
Although there are several components of the nominal group process that can be modified according to the needs and design of the study, Clayton2 states that it is necessary for the initial prompt to the members of the panel to be as clear and concise as possible to get the intended variety of responses from the panel. Of the components that may be modified, perhaps the most variable is the size of the panel itself. Horton6 suggests that the ideal size is between 7 and 10 members,6 while Delp et al7 recommend 5 to 9 members, suggesting that any fewer than 5 members would prevent quality and diversity of opinions. On the other hand, they suggest that larger groups (10 or more) inherently have more differences of opinion, unnecessarily lengthening the process without a compensatory increase in quality of results.7
Delp et al7 also discuss how the composition of the group in the nominal group process may impact the results. Those groups that were known to be more diverse and heterogeneous came up with more varied suggestions and ultimately approached the nominal question in a more creative manner (compared with homogeneous groups). However, as groups grew larger and more heterogeneous, communication became more difficult, and obvious interpersonal differences in the group were thought to have hindered outcomes. Ultimately, the goal of the nominal group process is to generate and then prioritize hypotheses and suggestions in response to the nominal question, so these varying approaches may be necessary to accommodate for the nature of the question itself.7
After reviewing several approaches and explanations for these modifications, we have decided on a set of guidelines for using the nominal group process in research related to academic medicine as well as suggestions for reporting:
- The research question used to prompt the panel must be clear and concise to obtain valid suggestions from panel members;
- The size of the panel should range from 5 to 10 members; any fewer than 5 members would lead to a lack of novel suggestions, and any more than 10 would lead to lack of consensus (i.e., too many cooks in the kitchen);
- Groups should be heterogeneous to allow for more creative solutions; and
- To allow for reproducible results, explain how groups were chosen as well as the criteria used to determine how and when a consensus was met.
Consensus Development Panels
Consensus development panels (which are also known as consensus development conferences) are organized meetings of experts in a given field, and depending on the topic of the conference may require a mixture of experts from various fields to create a multidisciplinary approach to the topic at hand. This consensus method is commonly used in health care research—namely, to formulate policies and come up with strategic plans. This particular approach to developing a consensus is useful in health care because it allows a multidisciplinary approach to solving a problem or creating a policy. There are a variety of specific methods for conducting consensus development panels, specific to the type of topic and the type of experts deemed appropriate to make recommendations. The most well-developed method has been used by the NIH.
The NIH is known to hold conferences to evaluate current scientific literature surrounding pertinent biomedical issues. Ultimately, researchers produce consensus statements that address the topic at hand in a way that is accessible to both laypeople and professionals. In addition, Halcomb et al8 address this tool’s utility in debating the state of science as it pertains to making clinical decisions, but we recognize that this is an underrepresented method in academic medical studies. Despite this, we believe this method is of use in academic medicine because panel members are presented with literature and data that make this particular method more reliant on evidence-based opinions rather than personal experience.
Because of the face-to-face interaction associated with this method, there are some key advantages to using a consensus development conference. The first advantage is the synthesis of the best available information in that field, and the second is the fact that experts take ownership over the material because the topic tends to impact them directly—that is, members of the panel will take the process seriously, which Halcomb et al8 suggest adds to the validity of this consensus method. Lastly, this particular consensus method delivers rapid results.
On the other hand, there has been little research done regarding the methods of this particular approach, advantages/limitations of the method, and the reliability/validity of outcomes from consensus development conferences.8 It is important to note that while the NIH successfully holds consensus development conferences and ultimately produces clear and concise consensus statements, the resources required to successfully run such a function may not be feasible for many researchers, especially when factoring in both cost and time.8 Additionally, because of the logistics associated with organizing an event and obtaining a moderator and locations to hold the conference itself, this method can be quite expensive and requires excellent organization.1 Lastly, one important disadvantage of consensus development conferences is the possible introduction of bias due to overly vocal members of the panel.
Considering the advantages and limitations of consensus development conferences along with a review of the limited number of articles using this consensus method, we decided on a set of guidelines for using the consensus development conference method in academic medical research as well as suggestions for reporting:
- Panel composition: the panel should be made up of experts in the field; the publication should report on how they were chosen and why;
- Panel size: the panel should have between 8 and 12 members, with Nair et al3 suggesting 10 as the optimum number; and
- Statistical analysis: must be reasonable for the research question, and should be as rigorous as possible. Explain what constituted consensus and how this was assessed.
Delphi Technique
Defined by Bloor and Wood9 as “a method for achieving consensual agreement among expert panelists through repeated iterations of anonymized opinions and of proposed compromise statements from the group moderator,”9 the Delphi technique is named for the Oracle at Delphi. It uses multiple rounds of questionnaires and ultimately aims to come to a consensus using the opinions and feedback of experts in a given field. Originally developed by the RAND Corporation in 1948 as a way to forecast military events, this is a consensus method that has been used with increasing frequency across multiple fields, including medical education and medicine itself, primarily as a way to develop diagnostic criteria within clinical settings.10,11 With the advent of the Internet and ease of electronic mailing systems, this method has become overwhelmingly popular with researchers across multiple fields because it is both time- and cost-efficient.12–18
Despite a shift in how questionnaires are distributed, the basic methodology has remained the same since its inception in 1948 by the RAND Corporation. The respondents are to be experts in their field (and the study may require a multidisciplinary set of experts), their responses are to remain anonymous, there must be two or more rounds before a consensus can be reached, and finally responses from each round should be analyzed by researchers and reported back to respondents after each round. Although this is similar to the other consensus methods previously described, in this approach the panel of experts never interacts with one another, and each panel member is unaware of who else is responding to the prompts. Occasionally, researchers report their findings and use the term “modified Delphi” to describe their method. It appeared to us that “modified Delphi” is a nonspecific term with no concise definition because each study uses different modifications. Many of the publications we reviewed were modifications of a traditional Delphi technique, just not explicitly defined in that manner.
This method’s utility in academic medicine is primarily the establishment of decision-making criteria/protocol.6 For instance, Lindsay et al19 used two rounds in their modified Delphi with each round containing a literature review of different medical conditions that are commonly seen in emergency departments. In the first round, these conditions were matched to potential outcomes (e.g., acute myocardial infarction paired with mortality); the panel was asked to rank the likelihood that each condition would result in a particular outcome. In the second round, the expert panel met face-to-face in the style of a nominal group to discuss their ratings for the current questionnaire as well as the results from the first round.19 Ultimately, this was to help clinicians make better decisions based on common scenarios encountered in the emergency department; future Delphi studies could be used for similar purposes.
In an example of a modified Delphi technique that deals with the curriculum in a medical school setting, Turner and Weiner20 surveyed a panel of experts on addressing pain management in medical education. The first-round questionnaire asked the expert panel to rate different topics for consideration in a curriculum focusing on pain management. The main modification in their study was the second-round questionnaire’s focus. The second round asked the participants to rate how important knowledge, attitudes, and skills were for each topic that passed the rigors of the first round.20 The purpose of the study was to gain consensus on what the most important topics within the curriculum were and then to address which aspects of each topic deserved the most attention within the medical curriculum.
The Delphi technique can be used for various purposes in an academic medical setting—from developing decision-making protocols to deciding the weight of certain topics in a medical school curriculum. There are several advantages to using the Delphi technique to obtain consensus—namely, that it eliminates the bias and influence that can occur in face-to-face meetings as the respondents are to remain anonymous. This allows for the respondents’ opinions to be expressed more freely without any fear of reproach or loss of credibility in their field. In addition, respondents are less likely to “jump on the bandwagon” if their views are not in line with the majority. To avoid a compromised decision versus an actual consensus of opinions, multiple rounds ensure thoughtful consideration, and the ranking of each item by the entire response group helps make the ultimate conclusions more reliable than a single meeting.21,22 Lastly, in response to the fatigue that can accompany conferences and face-to-face meetings, the Delphi technique does not require specified meeting times, which allows respondents to thoughtfully make their decisions and respond when they are ready.21
Although there are many advantages to using the Delphi technique in academic medicine research, there are arguably some important drawbacks. First, judgments in the second (and subsequent) rounds may be influenced by feedback given by researchers over the course of the rounds because overall feedback is given to each participant. Second, there is a decided lack of collaboration coupled with the increased potential for participant burnout as the number of rounds increases. Lastly, the success of these studies is largely dependent on the quality of the questionnaire design.
Unlike the nominal group process and consensus development panel methods, we found some literature addressing the methodology of the Delphi technique. For instance, research done by Dagenais23 reported a monotonic increase in reliability of Delphi as the size of the panel increases; his study ended at 11 members on the panel, with a reliability index of 0.76. Although this sort of statistical analysis is valuable in designing Delphi studies, this specific study did not comment on a fluctuation in reliability as the number of rounds changed.23 Conversely, Nair et al3 suggest in their publication, which focused on diagnostic criteria and guideline development, that the number of experts on a panel “can be hundreds, but at least 10–30,”3 but in the same publication note that a panel size smaller than 6 members has limited reliability and any group over 12 has an insignificant increase in reliability.3
In addition to the two studies addressing reliability pertaining to the size of the panel, one study did a long-term inquiry into the accuracy of the consensus garnered from a group of experts. Anderson et al24 used 600 participants to conduct their study (though they never mention why this large a panel was needed).24 Another study aiming to develop a framework of criteria for patient decision aids cited using 12 separate panels, each with five members, to garner their consensus.25 The research question in this particular study called for a multidisciplinary panel, but the reasoning behind the specific number of panel members is omitted. It seems the size of the panel in most Delphi studies is incredibly variable and its reliability can be measured, but the size seems to often be a reflection of the initial research question rather than a decision based on how statistically reliable researchers want their results to be.
Nair et al3 suggest three ways to define consensus: a predetermined agreement percentage (e.g., 80%), a rating scale of 1 to 5 for each topic, or a majority of participants must rate a topic for inclusion.3 Another study determined consensus by measuring the means of the interquartile range but did not give a definitive value that would suggest that a topic had reached consensus.26
Although there was a decided lack of solid statistical techniques in a majority of the studies we reviewed, some were very descriptive of their methods. Raine et al27 assessed multiple groups’ median scores with a weighted kappa statistic. This particular publication compared outcomes of different groups in concurrent Delphi studies, so they were able to compare and contrast how groups responded; they also used ANOVA to compare the different results garnered by each group as well as a Bonferroni adjustment to analyze significant differences between groups.27
Another study focusing on the Delphi technique in health sciences education research described its analytical techniques: Researchers collected data and determined the mean and standard deviations, and from there flagged these topics with a mean that fell within a certain range to determine consensus. Although not incredibly complicated, the study details why certain ranges were used and how they determined which topics would remain on future questionnaires. Specifically, using a four-point scale system, those mean values between 2 and 3 indicated uncertainty, while 3.5 and 1 pointed to positive certainty (3.5) and negative certainty (1). Small standard deviations pointed to a degree of certainty (either positive or negative). Topics that scored within the 2–3 range were considered still lacking a consensus and were therefore used in the forthcoming questionnaire.9 Another study decided to define consensus as 100% agreement on criteria; however, this study was rather small and consisted of relatively few, low-stakes topics.28
Many studies describe their methods for collecting data and that they did have a benchmark that would point to a consensus, but a lack of a description of the analytical techniques is apparent in many studies.29–32 Of course, some sort of statistical analysis must be done to determine which items should be reexamined in subsequent rounds, so this lack of description common to many of the studies found is particularly troubling. In addition to the lack of descriptive techniques in these articles, there is a wide range of criteria that points to consensus. How these particular benchmarks are determined is also not a topic in many of the studies.
Given the lack of current research, we believe that the methodology used in subsequent studies should be described more thoroughly in the manuscript. Below are our recommendations for improving future Delphi studies and their reporting:
- Panel composition: experts in a given field—explain how “expert” was defined and method of obtaining participants;
- Panel size: 6 to 11 members recommended;
- Rounds: two are required, and this is the optimal number; if a study goes beyond two rounds, explain reason for doing so; and
- Statistical analysis: must be reasonable for research question, and should be as rigorous as possible. Explain what constituted consensus and how this was assessed.
Conclusions
Although we set out to determine what the standard practice for conducting consensus methodology studies was, we quickly realized that no such standard existed and decided to shift our focus on the most common practices as well as those that were described in the most detail and analyzed most stringently. Ultimately, we hope that future studies will be conducted using more stringent standards, especially as it pertains to reporting of design, methods, and results. Several publications we reviewed drew on examples from creating medical curriculum and decision-making protocol; therefore, it is our belief that each could be beneficial in addressing topics pertaining to academic medicine. Although a majority of the publications reviewed were from fields outside of academic medicine, we drew several parallels and suggested several potential uses for future research.
For each consensus methodology we have discussed, we developed a set of guidelines for future researchers specifically so that they could report more clearly how they designed their studies and why. For nominal group process, we concluded that the research question given to prompt the panel must be clear and concise, that the size of the panel could range from 5 to 10 members, that the panel should be as heterogeneous as possible, and that they should explain how their groups were chosen in their paper—specifically, the reasoning for each member, and what constituted consensus. In consensus development panels, we suggested that panels should be made primarily of experts in a field (again with reporting on how and why these members were chosen), with a panel size between 8 and 12 members, and that they use a statistical analysis appropriate for their study with an explanation of how and what determined consensus. Finally, for the Delphi technique, we suggested that the panel be made up of experts in a field (with an explanation of how expert was defined and how participants were recruited); a panel size of 6 to 11 members participating in an optimum number of two rounds (and if more rounds were conducted, reasons for doing so); and finally a statistical analysis appropriate for the research question, with a clear explanation of what consensus meant for their study.
References
1. Bowling A. Research Methods in Health: Investigating Health & Health Science. 2009.New York, NY: McGraw Hill.
2. Clayton MJ. Delphi: A technique to harness expert opinion for critical decision-making tasks in education. Educ Psychol. 1997;17:373356.
3. Nair R, Aggarwal R, Khanna D. Methods of formal consensus in classification/diagnostic criteria and guideline development. Semin Arthritis Rheum. 2011;41:95105.
4. Jones J, Hunter D. Consensus methods for medical and health services research. BMJ. 1995;311:376380.
5. Scott EA, Black N. Appropriateness of cholecystectomy in the United Kingdom—a consensus panel approach. Gut. 1991;32:10661070.
6. Horton NJ. Nominal group technique: A method of decision-making by committee. Anaesthesia. 1980;35:811814.
7. Delp P, Thesen A, Motiwalla J, et al. Delp, P. Nominal group technique. Systems Tools for Project Planning. 1977:Bloomington, Ind: Pasitam; 1418.
8. Halcomb E, Davidson P, Hardaker L. Using the consensus development conference method in healthcare research. Nurse Res. 2008;16:5671.
9. Bloor M, Wood F. Keywords in Qualitative Methods: A Vocabulary of Research Concepts. 2006.Thousand Oaks, Calif: Sage Publications.
10. de Villiers MR, de Villiers PJ, Kent AP. The Delphi technique in health sciences education research. Med Teach. 2005;27:639643.
11. Bloor M, Sampson H, Baker S, Dahlgren K. Useful but no oracle: Reflections on the use of a Delphi group in a multi-methods policy research study. Qual Res. 2015;15:5770.
12. Hsu C, Sanford BA. The Delphi technique: Making sense of a consensus. Pract Assess Res Eval. 2007;12:18.
13. Avouac J, Huscher D, Furst DE, Opitz CF, Distler O, Allanore Y; EPOSS Group. Expert consensus for performing right heart catheterisation for suspected pulmonary arterial hypertension in systemic sclerosis: A Delphi consensus study with cluster analysis. Ann Rheum Dis. 2014;73:191197.
14. Baumann MH, Strange C, Heffner JE, et al.; AACP Pneumothorax Consensus Group. Management of spontaneous pneumothorax: An American College of Chest Physicians Delphi consensus statement. Chest. 2001;119:590602.
15. Brown DW, Gauvreau K, Powell AJ, et al. Cardiac magnetic resonance versus routine cardiac catheterization before bidirectional Glenn anastomosis: Long-term follow-up of a prospective randomized trial. J Thorac Cardiovasc Surg. 2013;146:11721178.
16. Graham B, Regehr G, Wright JG. Delphi as a method to establish consensus for diagnostic criteria. J Clin Epidemiol. 2003;56:11501156.
17. Okoli C, Pawloski SD. The Delphi method as a research tool: An example, design considerations and applications. Inform Manage. 2004;42:1529.
18. Tammela O. Applications of consensus methods in the improvement of care of paediatric patients: A step forward from a “good guess.” Acta Paediatr. 2013;102:111115.
19. Lindsay P, Schull M, Bronskill S, Anderson G. The development of indicators to measure the quality of clinical care in emergency departments following a modified-delphi approach. Acad Emerg Med. 2002;9:11311139.
20. Turner GH, Weiner DK. Essential components of a medical student curriculum on chronic pain management in older adults: Results of a modified Delphi process. Pain Med. 2002;3:240252.
21. Schmidt RC. Managing Delphi surveys using nonparametric statistical techniques. Decis Sci. 1997;28:763774.
22. Tersine RJ, Riggs WE. The Delphi technique: A long-range planning tool. Bus Relat. 1976;4:5156.
23. Dagenais F. The reliability and convergence of the Delphi technique. J Gen Psychol. 1978;98:307308.
24. Anderson JK, Parenté FJ, Gordon C. A forecast of the future for the mental health profession. Am Psychol. 1981;36:848855.
25. Elwyn G, O’Connor A, Stacey D, et al.; International Patient Decision Aids Standards (IPDAS) Collaboration. Developing a quality criteria framework for patient decision aids: Online international Delphi consensus process. BMJ. 2006;333:417.
26. Landeta J. Current validity of the Delphi method in social sciences. Technol Forecast Soc. 2006;73:467482.
27. Raine R, Sanderson C, Hutchings A, Carter S, Larkin K, Black N. An experimental study of determinants of group judgments in clinical guideline development. Lancet. 2004;364:429437.
28. Williams PL, Webb C. The Delphi technique: A methodological discussion. J Adv Nurs. 1994;19:180186.
29. Wedemeyer DJ. Forecasting Communication Needs, Supplies and Rights for Policy Making and Planning in the State of Hawaii. 1978.Los Angeles, Calif: University of Southern California.
30. Steinert M. A dissensus based online Delphi approach: An explorative research tool. Technol Forecast Soc. 2009;76:291300.
31. Wiegand DM, Chen PY, Hurrell JJ Jr, et al. A consensus method for updating psychosocial measures used in NIOSH health hazard evaluations. J Occup Environ Med. 2012;54:350355.
32. Campbell SM, Braspenning J, Hutchinson A, Marshall M. Research methods used in developing and applying quality indicators in primary care. Qual Saf Health Care. 2002;11:358364.
References cited only in Table 1
33. Fink A, Kosecoff J, Chassin M, Brook RH. Consensus methods: Characteristics and guidelines for use. Am J Public Health. 1984;74:979983.
34. Rowe G, Wright G. The Delphi technique as a forecasting tool: Issues and analysis. J Forecast. 1999;15:353375.