Secondary Logo

Journal Logo

Can the Strength of Candidates Be Discriminated Based on Ability to Circumvent the Biasing Effect of Prose? Implications for Evaluation and Education

Eva, Kevin W.; Wood, Timothy J.

Papers: Words That Influence Judgment
Free

Purpose. Residents have greater confidence in diagnoses when indicative features are presented in medical terminology. The current study examines the implications of this result by assessing its relationship to clinical ability.

Method. Candidates writing the Medical Council of Canada’s Qualifying Examination completed six questions in which the terminology used was manipulated. The influence of aptitude was examined by contrasting groups based on performance on the medicine section of Part I.

Results. The difference between the candidates was greatest in the mixed conditions in which the features consistent with one diagnosis were presented in medicalese and those consistent with a second diagnosis were presented using lay terminology; weaker candidates were more biased by language than stronger candidates.

Conclusions. The results suggest that the language used in presenting case histories will influence the reliability of medical examinations. Furthermore, they suggest that weaker candidates might benefit from practice in making the translation between lay terminology and medicalese.

Washington, DC, November 9-12, 2003

Moderator: Hilary J. Schmidt, PhD

Correspondence: Kevin Eva, Room 101, T-13, 1280 Main Street West, McMaster University, Hamilton, ON L8S 4K1 Canada; email: (evakw@mcmaster.ca).

Various models of clinical expertise have used differences in the use of medical terminology as an indicator of internal knowledge representations. For example, Bordage and Lemieux argued that the use of semantic qualifiers (e.g., acute versus chronic, distal versus proximal) allow clinicians to organize the clinical signs and symptoms of a case onto semantic axes, thereby allowing use of the axes to facilitate accurate diagnosis.1 Similarly, Schmidt, and Boshuizen argued that expert physicians are more likely to encapsulate clinical information into broader medical terms than are those with intermediate levels of expertise.2 Often at issue has been the direction of causality; does the use of medical terminology increase diagnostic accuracy or does diagnostic accuracy increase the likelihood that medical terminology will be used?

To examine this issue, Eva and colleagues studied the influence of “medicalese” on residents’ construal of case histories.3 In one version of a case history that was indicative of both lung cancer and bronchitis, for example, features consistent with lung cancer were presented using lay terminology (e.g., coughing up blood) and features consistent with bronchitis were presented using medicalese (e.g., episodes of fever and chills). In another version, the reverse was true; features consistent with lung cancer were presented using medicalese (e.g., hemoptysis) and features consistent with bronchitis were presented using lay terminology (e.g., feeling hot and cold). When participants were asked to assign probability ratings indicating their degree of belief in the diagnostic hypotheses, participants were more confident in diagnoses when the features were presented using medicalese.

Not studied by Eva and colleagues was the relationship between the magnitude of this effect and the level of expertise; all participants were medical residents, and no attempt was made to differentiate participants on the basis of aptitude. Although medicalese is the language typically used to describe features during training, as students gain more experience, the presentation of features is more varied than those found in textbooks, and the language used to describe the features tends to be less technical. This change could cause the effect of terminology to be smaller in residents (or more competent students) relative to novices (or less competent students). Alternatively, medicalese might simply hold greater prestige.4 This explanation also predicts an effect of expertise if we assume that the extent to which medicalese maintains special meaning declines with experience or ability level. Such an effect would potentially provide medical educators with insight regarding the role language plays in medical diagnosis and would have consequences for evaluation.

If language affects the likelihood of candidates demonstrating their “true ability,” language must be taken into account when writing test items. Item-writing techniques used for high stakes licensing examinations have tried to account for language effects by emphasizing the use of clinical vignettes.5,6 The features presented in these vignettes are often described in terms of clinical presentations rather than medical terminology. The rationale for this item-writing technique is that clinical presentation is expected to provide a more valid assessment of candidates’ abilities. Despite this emphasis, the impact language has on the ability of test questions to discriminate performance level has not been adequately demonstrated. Previous research has suggested that language may have some effect on the difficulty of a question but little effect on its reliability.7 This study, however, was designed to investigate the effect of “window dressing” on item performance. As a result, factors like language, length of vignette, and the number of extraneous features were confounded with one another, making it difficult to specify the precise source of the result.

Part I of the Medical Council of Canada’s Qualifying Examination (MCCQE Part I) is a convenient place to test the hypothesis that the language used will impact candidates’ ability to reveal their true competence. Candidates attempt this examination at the end of medical school, and a wide range of diagnostic ability is observed. In study questions, the language used to describe the clinical features was manipulated to determine whether terminology can influence the ability of the question to discriminate performance levels of candidates.

Back to Top | Article Outline

Experimental Hypotheses

We predicted that mixed conditions in which features consistent with one diagnosis are presented in lay terminology and those consistent with a competing diagnosis are presented in medicalese would reveal a significant interaction between cohort and language with low ability candidates showing a larger biasing effect of medicalese than high ability candidates. All-lay and all-medicalese conditions served as baseline conditions.

Back to Top | Article Outline

Method

Participants

The MCCQE Part I is a prerequisite to licensure in Canada. It is a written exam that consists of a computer-based adaptive multiple-choice question exam and a set of key feature short answer and short menu questions called the clinical reasoning skills (CRS) component. Questions were added to the CRS component of the exam for the purpose of this study. Participants included all 2,036 candidates who attempted the Spring 2002 MCCQE Part I.

Back to Top | Article Outline

Material

The six cases used by Eva et al.3 were adapted for use in the current study by adding an all-lay and an all-medicalese version to the existing cases. Each case has been shown to be indicative of two diagnoses.8 Three features consistent with each diagnosis were manipulated. In the all-lay condition, both sets of features were presented in lay terminology; in the all-medicalese condition, both sets of features were presented using medical terminology. In one of the two mixed conditions, the features consistent with diagnosis A were presented using lay terminology whereas the features consistent with diagnosis B were presented in medicalese. In the second mixed condition, the reverse occurred. These two mixed conditions are simply counterbalancing conditions and, therefore, were combined for the purpose of data analysis.

Back to Top | Article Outline

Design and Procedure

The six questions were structured to resemble typical CRS questions. For each case, participants were asked to read the vignette and list up to three potential diagnoses in order of likelihood. French and English versions of each case were created. The conditions were randomly assigned to candidates.

These questions were treated as pilot items and did not contribute to candidates’ examination score. For the purpose of this study, synonyms were recoded with the aid of a clinician, and responses were grouped into diagnostic categories consistent with diagnosis A, diagnosis B, or other. For each candidate, we calculated the number of times the diagnosis indicated by lay features was named as one of the three most plausible diagnoses and the number of times the diagnosis indicated by medicalese features was named as one of the three most plausible diagnoses.

To allow for an examination of the interaction between “language used” and aptitude, aptitude was defined as a function of percentile rank based on the scores candidates received on the internal medicine (IM) component of the MCCQE Part I. Candidates who scored in the 90th percentile or above constituted the high performance group (n = 213) and those in the 10th percentile or below constituted the low performance group (n = 215). While an arbitrary cutoff, the pattern of data described was the same when the analysis was repeated using more encompassing definitions (e.g., 75th percentile or greater and 25th percentile or lower).

Finally, demographic data, including age, gender, graduation year, number of times sitting the MCCQE, and site of training, were collected to allow for an assessment of the extent to which any significant findings might be attributable to these variables.

Back to Top | Article Outline

Results

Participants routinely provided a full complement of three diagnoses in response to each of the six test questions (1,545/2,036 = 75.9%). The MCCQE Part I scores for candidates assigned to each study condition were analyzed to ensure that there were no baseline differences between candidates. The Part I score (medicine component) for candidates in the all-lay condition (M = 506), the mixed condition (M = 503), and the all-medicalese condition (M = 509) did not differ (p > .05). Similarly, after converting the candidates’ score across the six study items to a percentage correct, the overall mean scores for the three conditions did not differ (all-lay: M = 68%, mixed: M = 70%, all-medicalese: M = 70%; p > .05). The internal consistency (Cronbach’s α) for the six study items was 0.24 in the all-lay condition, 0.27 in the mixed conditions, and 0.11 in the all-medicalese condition. Using the Spearman-Brown prophecy formula, it would take 76 items, 66 items, and 194 items, respectively, to reach an internal consistency of 0.80.

To examine the impact on performance of the interaction between “language used” and aptitude, we first performed a three (condition: all-lay, all-medicalese, mixed) × two (aptitude: 90th percentile versus 10th percentile) analyses of variance (ANOVA) on the number of times in which each diagnosis was generated. This analysis revealed a significant main effect of aptitude [MHigh = 72.9%, MLow = 60.0%; F(1,422) = 111.82, MSE = 0.01, p < .001] and a significant main effect of condition [F(2,422) = 4.08, MSE = 0.01, p < .05] with more “correct” diagnoses named in the all-medicalese group (M = 69.5%) relative to either the all-lay group (M = 65.7%) or the mixed group (M = 65.1%). The all-lay group and mixed group did not differ significantly from one another and the condition × aptitude interaction was not significant.

To explore the effect of language further, a subsequent analysis of the number of lay diagnoses and medicalese diagnoses that were listed was conducted for the mixed and the consistent conditions separately. The proportion of times in which each diagnosis was generated and the relevant statistics are shown in Figures 1 and 2. We first performed a two (language: lay versus medicalese) × two (aptitude) ANOVA on the consistent conditions in which all features were presented in consistent terminology (i.e., the all-lay and all-medicalese conditions). This is a between-subjects analysis because individuals in the consistent conditions were assigned to view either all-lay terminology or all-medicalese terminology. This analysis revealed a significant main effect of aptitude (i.e., better candidates, defined by performance on the MCCQE, were more likely to answer study items correctly relative to poorer candidates—effect size = 0.9) and a significant main effect of language (i.e., diagnoses were more likely to be generated when indicative features were presented in medicalese relative to when the same features were presented in lay terminology—effect size = 0.2) but no language × aptitude interaction (Figure 1). In contrast, a mixed two × two ANOVA performed, with the language features were presented in (lay versus medicalese) treated as a within-subject factor (individuals in the mixed conditions contributed data to both the “lay” diagnosis and the “medicalese” diagnosis conditions) and aptitude treated as a grouping factor, revealed no main effect of language but did reveal a significant main effect of aptitude and a significant language × aptitude interaction (i.e., the difference between good and poor candidates was greater for diagnoses indicated by “medicalese” features—effect size = 1.3—relative to those indicated by “lay” features—effect size = 0.7) (Figure 2).

Figure 1. The proportion of times correct diagnoses were named as a function of the Language used to present the diagnostic features, Condition, and Aptitude.* —High (> 89th percentile), – – – Low (< 11th percentile).*Error bars equal standard error of the mean.Consistent Conditions: Main Effect of Aptitude - F(1,230) = 55.8, MSE = 0.51,

Figure 1. The proportion of times correct diagnoses were named as a function of the Language used to present the diagnostic features, Condition, and Aptitude.* —High (> 89th percentile), – – – Low (< 11th percentile).*Error bars equal standard error of the mean.Consistent Conditions: Main Effect of Aptitude - F(1,230) = 55.8, MSE = 0.51,

Figure 2. The proportion of times correct diagnoses were named as a function of the Language used to present the diagnostic features, Condition, and Aptitude. *—High (>89th percentile), – – – Low (<11th percentile).*Error bars equal standard error of the mean.Mixed Conditions: Main Effect of Aptitude - F)(1,192) = 69.87, MSE = 0.03,

Figure 2. The proportion of times correct diagnoses were named as a function of the Language used to present the diagnostic features, Condition, and Aptitude. *—High (>89th percentile), – – – Low (<11th percentile).*Error bars equal standard error of the mean.Mixed Conditions: Main Effect of Aptitude - F)(1,192) = 69.87, MSE = 0.03,

Graduation year, gender, number of times sitting the MCCQE (first versus repeat), site of training [Canadian Medical Graduates (CMG) versus International Medical Graduates (IMG)], and age were entered into a regression equation to examine the factors predicting aptitude. Older individuals (β = −0.453, p < .001), IMGs (β = 0.303, p < .001) and repeat candidates (β = 0.148, p < .01) were all statistically predictive of being in the bottom decile relative to the top decile. Inclusion of these variables as covariates in the analysis of language and aptitude had no impact on the results.

Back to Top | Article Outline

Discussion

Consistent with the results of Eva et al.,3 when all of the candidates were considered, participants were equally likely to generate diagnostic hypotheses regardless of the language used to describe the features presented in the case history. Interestingly, the overall reliability of the cases was lowest in the all-medicalese condition and highest in the mixed condition. When aptitude was considered, weaker students were more influenced by language than stronger students, although in a different manner than initially hypothesized. In the consistent conditions, weaker candidates and stronger candidates were equally more likely to name the correct diagnosis when medicalese was used to describe the features compared with when lay terminology was used. In the mixed conditions, stronger candidates showed the same tendencies, but weaker candidates were more likely to name the diagnosis consistent with the lay language than the one consistent with the medicalese. This effect is particularly surprising given (a) the high stakes nature of the examination and (b) the lack of a correction factor.

Back to Top | Article Outline

Implications for Evaluation

Item-writing initiatives emphasize the use of clinical vignettes primarily due to the intuition that simpler language will provide a more valid assessment of a candidate’s “true ability.” The results of the current study demonstrate the benefit of using lay language to present the details of a clinical case. The reliability of the six cases was lowest when all features were presented in medicalese. In addition, the presence of a significant interaction between language and aptitude within the mixed conditions supports the idea that the language used in presenting case histories will influence the validity of an examination, albeit in a more sophisticated manner than that previously hypothesized. Relative to the baseline conditions in which the language used was consistent for both sets of features, the poor-performing candidates were more affected by mixed language cases than were high-performing candidates. It appears as though this pattern of results cannot be attributed to demographic differences between the high performance and low performance groups or to the candidates assigned to each condition. There were no significant differences in Part I performance for the candidates assigned to each group, and the inclusion of demographic variables such as IMG versus CMG, repeaters, and age as covariates did not eliminate the significance of the interaction observed in the mixed conditions.

Back to Top | Article Outline

Implications for Education

The greatest difference in mean scores between the high performance group and the low performance group occurred in the ability to generate diagnoses indicated by features presented in medicalese when features indicative of a competing diagnosis are presented in lay terminology; this difference was 50% larger than the other three low versus high comparisons illustrated in Figures 1 and 2. Psychologically, such an effect might occur if poorer candidates have greater difficulty mentally aligning medicalese with diagnostic conditions; poorer candidates might be able to exert the cognitive effort to allow an adequate diagnosis of medicalese features, but the presence of lay terminology supportive of a competing diagnosis appears to preclude expenditure of such effort. This possibility is purely speculative at this point, but it warrants further testing because it might suggest that practice in making the translation between lay terminology and medicalese can help overcome the biasing effect of language.

Finally, it should be noted that although these results are relevant and potentially informative with regard to understanding the organization of internal knowledge structures and their relation to medical expertise, these results do not directly inform the specific debate over whether the use of semantic axes or encapsulations contribute to expertise or vise versa. We followed the lead of Eva et al.3 and used four methods of transformation when creating medicalese and lay versions of each manipulated feature, only one of which corresponded to the presence or absence of semantic axes. This strategy increased the flexibility with which features could be described. In principle, it would be interesting to know if some forms of medicalese are more influential than others, but we opted against manipulating all forms of medicalese independently for this initial investigation into the impact of language on the characteristics of evaluation tools due to limitations in the number of cases that could be used.

Order of authorship on this paper was determined arbitrarily. Both authors would like to thank David Blackmore and Patricia O’Sullivan for many useful comments provided during the completion of this work, John Cunnington for assistance in data coding, and Andre-Phillipe Boulais, Martine Trudeau, and David Miller for their help in adding the study questions to the examination. This work was supported by a Medical Council of Canada Research Grant.

Back to Top | Article Outline

References

1.Bordage G, Lemieux M. Semantic structures and diagnostic thinking of experts and novices. Acad Med 1991;66:S70–2.
2.Schmidt HG, Boshuizen HP. On the origin of intermediate effects in clinical case recall. Mem Cognit. 1993;21:338–51.
3.Eva KW, Brooks LR, Norman GR. Does “shortness of breath” = “dyspnea”?: The biasing effect of feature instantiation in medical diagnosis. Acad Med. 2001;76: S11–3.
4.Norman GR, Arfai B, Gupta A, Brooks LR, Eva KW. The privileged status of prestigious terminology: impact of “medicalese” on clinical judgments. Acad Med. 2003;78:000–000.
5.The Medical Council of Canada. Objectives for the Qualifying Examination. 2nd ed. Ottawa, Ontario, Canada: Canso Printing Services, 1999.
6.Case SM, Swanson DB. Constructing written test questions for the basic and clinical sciences. 3rd ed. Philadelphia, PA: National Board of Medical Examiners, 2001.
7.Case SM, Swanson DB, Becker DF. Verbosity, window dressing, and red herrings: do they make a better test item? Acad Med. 1996;71:S28–30.
8.Cunnington JPW, Turnbull JM, Regehr G, Marriott M, Norman GR. The effect of presentation order in clinical decision making. Acad Med. 1997;72:S40–2.
© 2003 by the Association of American Medical Colleges