In 1995, the Society of Cardiovascular Anesthesiologists (SCA) created a task force with the responsibility of designing and implementing an assessment test for perioperative transesophageal echocardiography (PTE). The task force members, who were chosen for their expertise in perioperative echocardiography, represented a cross-section of geographic and practice diversity. After a pilot examination in 1997, the task force offered the first official examination in 1998. The PTE examination committee is now a subcommittee of the National Board of Echocardiography, which also administers an examination of special competency in general adult echocardiography (1,2). Since it was initially introduced, more than 1200 applicants have taken the examination. This article describes the steps in the development of the certification examination and reviews the relationship between biographic data and performance on the examination over the last 5 yr. The history and rationale for the administration of the examination are reported elsewhere (1). Briefly, the primary objective of the examination is to provide practicing physicians with an opportunity to document a level of proficiency in PTE that can be measured with an objective standard. The test is a method to acknowledge individuals who are proficient, to establish a standard for knowledge of skills in PTE, to stimulate continuing education, and to identify strengths and weaknesses of training. Because both anesthesiologists and cardiologists perform PTE, the only prerequisite to sit for the examination is a valid medical license.
The ability of a test to measure knowledge in a content domain is an important consideration in test development. A test of PTE must measure skills and knowledge that are needed in the practice of PTE. The first step, therefore, in the development of the examination was to create a content outline (test specifications). This outline defined the content domain of the examination and was useful for the examinees, educational program directors, committee members, and item writers. The content outline (Table 1) was developed by the PTE task force, with the distribution of examination material determined by consensus of the committee with advice from a cardiologist (whose focus of practice included perioperative echocardiography). In addition to the basic principles of ultrasound, intraoperative application and interpretation of echocardiography were emphasized.
The second step in test development was writing the questions. Each committee member was assigned a specific category in the content outline. The sections of the content outline were allocated a percentage of emphasis reflected in the number of questions for the section (Table 1). The task force worked closely with invited guests from the American Board of Anesthesiology Examination Committee and other experts in question writing to improve question-writing skills. The questions were then reviewed for accuracy and quality. The National Board of Medical Examiners (NBME) assisted the members with items designed to measure special competencies for the PTE. A technical/medical editor from the NBME with experience in item-writing techniques reviewed the questions at various stages of development. The editor verified that the items were free of grammatical errors, misspellings, or inconsistent terminology. The editor also identified potential ambiguities in the items or duplications that may have cued answers to other items. Subsequently, at two meetings, the items were peer-reviewed for clinical significance, accuracy, relevance, and content domain. Editorial comments and queries were also answered at this time. A total of 201 questions were selected for the pilot test. The questions were distributed over the 23 major categories of the content outline shown in Table 1.
Of the 201 questions for the initial pilot examination, 160 (80%) were A-type questions (one best answer), and 41 (20%) were K-type questions (complex multiple choice). On the basis of recommendations of the NBME and psychometric analysis of question item quality, the PTE examinations eliminated the K-type questions in 1998.
The SCA conducted a pilot test of the PTE examination for 95 participants on May 9, 1997, in Baltimore. The candidates were representative of the intended examination population. After this process, statistical information was obtained on each item. The committee selected questions from the pilot test with the most desirable psychometric characteristics (e.g., discriminating performance level, difficulty level, and clarity) for the examination in the following year, thereby enhancing test validity and reliability.
In the first part of the examination (Book 1), multiple-choice test items accompanied 15 videotaped echocardiography cases; each case had from 2 to 4 test questions associated with it, for a total of 43 test items. At the beginning of each video case, the examinees were given time to review the test questions pertaining to that case. Each case was then presented twice, with a pause after each viewing, to allow the examinees ample time to read and answer the questions. After the 60-min video section, the participants completed one of two different examination books (Book 2), each of which contained multiple-choice questions. The content of the examination was evenly distributed across the two versions of the tests.
In addition to the test questions, pilot test participants were asked to answer a 10-item questionnaire after taking the examination. This question was designed to enable the committee to evaluate the examination itself. The responses were overwhelmingly positive (Table 2). The lowest ratings were given to the time permitted to view the videotape with each case study (64% positive) and the structure and focus of the K-type questions (69% positive). Both of these concerns have been addressed in subsequent examinations, because more time has been allotted for viewing the video-type questions and the K-type items have been eliminated. Overall, 91% of the candidates rated the quality of the examination excellent (34%) or good (57%). Information gathered from the questionnaire about the pilot examination was used to modify procedures for subsequent administrations of the examination.
Before scores were produced for the 1997 pilot test, a key validation process was conducted whereby item responses were recorded and preliminary analysis of questions was conducted. This process, termed “key val- idation,” identified questions that were miskeyed or were not functioning as expected. Responses of each participant to each question were recorded and statistically analyzed. Questions were identified for further review by the NBME staff and PTE committee members. Items were identified when they appeared to be inconsistent with the discriminative power of the overall test to distinguish good from poor performance.
During the key validation process, 25 items were identified for deletion from scoring or inclusion on the basis of statistical analysis and content relevance. For example, if the item was unable to discriminate the good performer from the poor performer because of an editorial or content error, it was eliminated. If an item was reviewed and determined to be acceptable, but perhaps just difficult, it was retained for analysis. Of the 25 items, 11 were deleted from the scoring process in the field test. These items included 6 of the 160 A-type questions and 5 of the 43 K-type questions. The committee believed that the deletions enhanced the validity of the final scores.
The examinees’ responses were scored. Raw scores (number of items answered correctly) were converted to percentages and rounded to the nearest whole number. Although two different forms of Book 2 were used, the scores were considered equivalent because the two forms were similar in content, mean difficulty, and score distribution (Table 3).
The examination was considered reliable to the extent that the administration of a different sample of items from the same content domain would have resulted in little or no change to an examinee’s rank order in the group. The reliability of the examination was assessed by using a KR20 coefficient (0.93 for Form 1 and 0.92 for Form 2) that estimated the internal consistency of the scores (3).
The first operational PTE was administered to 243 candidates in April 1998. Subsequent tests were offered to 281 candidates in April 1999, to 350 candidates in May 2000, and to 345 candidates in May 2001. The first examination consisted primarily of the A-type items that had been field-tested in 1997 and new items that were included for validation, as well as additional video cases. For each subsequent year, new test items were prepared and validated to expand the pool of questions for the examination.
The video portion of the examination made the administration of the test complex because the committee could not ensure similar conditions each time the test was administered. Seating was aligned so that examinees could view the video monitors in front of them. To permit a good viewing angle and adequate distance between examinees, two tables with three examinees at each were set up in front of each monitor. The monitors were comparable in contrast and brightness. At least two committee members were present at each test to ensure that the color and contrast were appropriate for PTE videos.
Four and a half hours were allotted to complete the examination. The video portion took 60 min, and all examinees watched the same video in the same time frame; therefore, rewinding or pausing the tape was not possible. There was a brief break after the video, followed by the 3-h nonvideo portion of the examination.
Key validation procedures were performed for the 200 items in the first operational PTE examination. The initial item analysis identified six items for review. After several committee members and the NBME staff reviewed the items with performance statistics, five items were deleted from scoring. The examinees’ responses for each of the remaining 195 items were scored. Responses of each candidate to each item were entered into a computer program that applied item response theory (the Rasch model) to calculate two measures: item difficulty and candidate proficiency. The Rasch model gives the probability that a candidate will answer an item correctly by matching the item’s difficulty to the candidate’s proficiency level (4). The candidates’ proficiencies and item difficulties were calibrated on a common scale; therefore, the meaning of the candidates’ scores could be referenced directly to an item. Standardized scores were then calculated by transforming the candidates’ proficiency measures so that the score distribution had a mean of 500 and an sd of 100. The scale on which the 1998 standard scores were placed was the base reference scale. The candidates who took the PTE in 1998 were defined as the base reference group for this program.
Descriptive statistics for the examination are presented in Table 4. The P value of a question represents the proportion of examinees that answered the question correctly and describes the perceived difficulty of a question. Easy questions have high P values, and difficult questions, low P values. The standard deviations describe the diversity of difficulty around the mean P value.
The discrimination index of a question shows the correlation between the question score and the total examination score. Because a question is scored only as 1 for correct or 0 for not correct and total scores are continuous, a point-biserial correlation coefficient was used. The discrimination index estimates the degree to which a question discriminates between examinees with high examination scores and examinees with low scores. Items with negative discrimination indices are suspected of content or structural flaws and are typically deleted from final scoring.
NBME staff conducted a standard-setting study with 10 members of the PTE committee to determine the passing standard for the 1998 examination. The expertise of a psychometrician was required to lead the group of content experts in determining the passing standard. To ensure a consistent definition of an acceptable level of competence, a criterion-referenced (content-based approach, i.e., modified Angoff procedure) was conducted in which the content experts completed an item-by-item analysis of the examination (5). In this procedure, the committee members gave each question an initial estimate of probability that a borderline candidate would respond correctly. The actual performance on the question by the candidates was then examined. After looking at the actual performance, the experts had the opportunity to revise their estimates. The probabilities were then averaged across all questions and committee members. After the item-by-item procedure, committee members provided global estimates for the minimum and maximum percentage of correct questions required to pass the examination. They also estimated the maximum and minimum percentage of failure rates with which they were comfortable (i.e., the Hofstee procedure). The 1999, 2000, and 2001 examinations were scored by using the process described previously for the initial 1998 examination. The scores on these examinations were placed on the 1998 reference scale by using the Rasch model equating procedure. Common (linking) items were used to place items from different tests and different calibrations onto a common metric. The performance of several candidate groups was followed across time, so that scores from the three operational examinations could be compared.
The results of the standard-setting study and projected failure rates at various total raw and correct scores contributed to the committee’s decision that candidates master at least 71% of the content to pass. Application of the passing standard resulted in the failure of 24% of the 1998 total group. Table 5 shows the failure rates resulting from applying the passing standard for the four operational examinations, 1998 through 2001.
Test performance data were examined to determine the effects of training and experience on performance. The variables were length of dedicated training in echocardiography, type of dedicated training, time spent in echocardiography practice, number of echocardiograms performed or interpreted each week, primary clinical area, and primary setting (Table 6).
In general, higher scores were achieved by participants who had performed or interpreted at least six studies each week, who had dedicated training in echocardiography, who were in postgraduate year (PGY) 5 or more advanced, or who interrupted a practice temporarily for training. These data confirmed the effect of training and experience on performance and suggest that the examination could distinguish levels of skill (Figs. 1–3).
Passing rates for different groups are shown in Table 7. Examinees who performed or interpreted more than six echocardiograms weekly had the highest pass rates. Examinees at higher PGY levels tended to have higher standard scores. Examinees with at least 3 mo of formal training had higher standard scores than examinees with <3 mo of formal training. Analysis of variance showed that both PGY level training (F = 3.4;P < 0.005) and number of examinations performed (F = 6.9;P < 0.0001) had effects on PTE scores.
Chi-square statistics and the contingency coefficient (measures of association between nominal variables) indicate the degree of association of the passing rate and the levels of the variable. The passing rate on the PTE was significantly correlated with PGY level of training and with the number of echocardiograms performed or interpreted per week (Table 7).
Each candidate was given a report of his or her score, a pass/fail classification, and feedback in the form of percentage of correct scores in the major areas.
Since its inception, the PTE examination has been administered to more than 1200 physicians, mostly anesthesiologists, who have chosen voluntarily to assess their level of knowledge in PTE by using an objective standard. Overall, more than 70% of physicians who have taken the examination have received a passing score. The examination content outline describes 6 major content domains and 23 smaller content areas that include the fundamentals of ultrasound imaging and the assessment of myocardial and valvular function, particularly as this pertains to the perioperative period. The half-day examination has included a 60- to 90-minute videotape-interpretation segment and a three-hour multiple-choice written examination, delivered under the administration of the NBME.
Between 1998 and 2001, there was a progressive increase in the examination pass rate, suggesting an increase in candidates’ quality, as measured by statistical analysis of candidates and fellowship training in cardiac anesthesiology that incorporates distinct curricula in PTE. To reassess the level of transesophageal echocardiography knowledge that should be acquired by a candidate who passes the PTE examination, a repeat standard-setting session was conducted in the fall of 2001. More than a dozen individuals, including members of the NBME staff, participated in the standard setting and the PTE examination. A new standard was determined that was applied in 2002. This process was intended to maintain an examination that discriminates effectively between those physicians with and those without the requisite knowledge in PTE.
Other efforts to improve the PTE examination include the addition of written materials in color for testing and conversion to computerized testing. This would offer a number of advantages, including improved video presentation of digitized images, with a viewing speed and review capability that are under the control of the examinee. In addition, computerized testing could allow administration at multiple sites, either simultaneously or on different dates, making the examination more convenient, accessible, and affordable for candidates. In conjunction with the National Board of Echocardiography and the NBME, the PTE examination committee is working continually to improve the PTE examination and meet the needs of the medical community and the public.