To become an expert in any field, including medicine, an individual must learn content-specific facts and procedures and then understand how and when to apply that knowledge.1–4 However, there is substantial evidence that a critical additional step in the development of expertise is the incorporation of that knowledge into an elaborate and well-integrated framework or knowledge structure.1,2,4,5 In other words, as expertise grows through learning, training, experience, and reflective practice, the components of knowledge grow and become increasingly connected within a knowledge structure that supports higher-order thinking and problem-solving activities. In medical education, we measure content-specific and procedural knowledge throughout the course of physicians’ training using testing modalities such as written and OSCE examinations. On the other hand, knowledge structure, despite its importance, usually goes unmeasured because of the difficulty designing and administering objective assessment measures.
Research in secondary- and college-level science education suggests that concept mapping can provide a representation of a learner’s knowledge structure.4,6–11 In the most common form of concept mapping, learners demonstrate how concepts are related to each other by creating maps that link together related concepts using directed lines (showing the direction of relationship with an arrow) that are labeled with phrases that describe the relationship. Thus, a “concept map” is a network with concept terms as nodes, directed lines representing relationships, and labels on the lines indicating the nature of the relationships.12 Concept maps can provide a teacher a window into how learners organize and understand relationships among concepts.4,13
There is evidence that concept maps can be administered and scored in a reliable way in secondary science education classes.9,14 However, in medical education the reliability of concept maps is unknown. Previous work by one of the authors (D.C.W.) demonstrated that the knowledge structures of first- and third-year pediatric residents were significantly different, using a concept map scoring system that accounted for the structure of the concept map.15,16 However, in that study, the concept mapping system employed cannot be used on a large scale, and the scoring system did not capture the full complexity of the maps. To address these limitations, in this study, we sought to develop an alternative scoring method. In addition, we wanted to determine the reliability of our new and previous scoring systems using generalizability theory (G-theory). This is important because establishing the reliability of concept mapping assessment would allow us to test its validity in future studies and, ultimately, provide a quantitative tool to measure a critically important aspect (knowledge structure) of the development of expertise in physicians-in-training. It is also potentially important for practicing physicians as a measure of the maintenance and refinement of their expertise.
Study participants and design
All second- and third-year (senior) medicine (47 total) and pediatric (22 total) residents and fourth-year medical students (96 total) at the University of California–Davis School of Medicine were offered the opportunity to participate in this study. A total of 56 agreed to participate, which represents approximately 50% of senior residents in internal medicine (24) and pediatrics (12) and 20% of fourth-year medical students (20). Of the 56 participants who started the study, 52 completed all aspects of the study and were included in the analysis. Four participants failed to complete the study because of scheduling conflicts. The final sample consisted of 33 advanced learners (second- and third-year pediatric  and internal medicine  residents) and 19 early learners (fourth-year medical students). Demographically, the study population was 43% male, 44% white, and 38% Asian, which was very similar to the demographic makeup of the total resident and student population at our institution. Our institutional human subjects protection committee approved the study as exempt.
To learn how to make concept maps, each participant completed a one-hour standardized concept map training session performed by a single person as described previously.16 Briefly, training included an introduction to concept mapping followed by practice making concept maps on nonmedical and medical topics. Immediately after training, learners created concept maps about two medical conditions: asthma (AS) and diabetes mellitus (DM) (occasion one). Participants were given 30 minutes to complete each map. Participants created maps on a second occasion (occasion two) approximately four weeks later on the same two topics. They received no intervening instruction about AS or DM. Maps were coded so that the identity, specialty, and level of training of participants would be unknown to raters.
Study participants created separate concept maps for each subject domain using concepts provided to them (61 concepts for DM and 56 for AS). Concepts were derived from expert maps (two each in DM and AS) and refined through discussions with experts and two authors (D.C.W. and M.S.). Concepts could be classified under the following general categories: diagnostic approach, therapy, pathophysiology, public health, cost-effectiveness, and complications. A complete list of concepts for both AS and DM is available from the authors on request. Participants created the maps by affixing labels with the concepts to 24 × 17-inch sheets of paper and then connecting related concepts together using a directed line with a statement (called linking phrase) describing how the concepts are related written above the line. A concept linked to another concept by a directed line and linking phrase is referred to as a proposition. Participants could use any or all of the provided concepts; however, they could only use concepts provided.
Table 1 summarizes key terms and examples used in concept map training and scoring. The maps were scored using three different scoring systems and a hybrid scoring system that combined unique features of the three.
Structural scoring system.
Maps were scored independently by two of the authors (M.M. and J.S.) using a modification of a structural scoring system (S) previously reported.12,16 Before scoring the maps, raters underwent scoring training with opportunities to practice, followed by periodic meetings to refresh training and discuss problems. Raters scored a random subset (60%) of the same maps to assess reliability.
In S, scores were determined by classifying and counting the number of propositions and determining the maximum level of hierarchy. Points were assigned based on categories (Table 1). A proposition was considered a “cross-link” when it linked together two clearly divergent series of propositions on the map (Figure 1). Propositions or cross-links that lacked linking phrases above the connecting line were counted separately and given less credit. No credit was given for an invalid (i.e., incorrect or wrong) proposition or cross-link. Hierarchical levels were determined by counting the maximum number of concept links from the concept at the apex of the hierarchy (Figure 1). Further details of the scoring system and the points assigned to each category are outlined in Table 1. In the high-scoring map in Figure 1, the proposition including the concept “complications” connected to the concept “macrovascular” with the linking phrase “chronic, poor control or underdiagnosed DM cause” would be scored as a concept link (two points, as noted in Table 1). The proposition including the concepts “DKA” connected to “glucose monitoring” by the phrase “needs careful” would be scored as a cross-link because it connects two different linking strands (one originating from the concept “complications” and the other originating from the concepts “therapy” and “education”).
Quality and importance scoring system.
We developed an alternative quality scoring system expanding on several previously reported scoring systems.9,14,15 This system used a two-step process that first rated the importance or proximity of the relationship between concepts and then the accuracy (or depth) of understanding reflected in the linking phrases used to describe the relationship between concepts. Using this method, two authors (M.S. and D.C.W.) independently rated each map using the criteria that follow.
Step 1: Importance of concept–concept relationship.
This step assesses the importance of the relationship between two concepts. Before scoring the maps, a matrix was created to capture all possible paired connections between two concepts. Using expert maps and consensus opinion of authors, each possible concept–concept relationship was rated on a three-point scale (zero, one, or two) (Table 1). Incorrect, tangential, or trivial concept relationships were given zero points. Less important, but nevertheless correct, concept relationships were given one point. Closely related concepts that are central to the understanding of AS or DM were given two points. For each map, step 1 scores were assigned to each concept–concept relationship using this matrix. In Figure 1, the concept “complications” connected to the concept “macrovascular” would be scored two points because it was considered “correct” and “closely related” (from Table 1).
Step 2: Accuracy of proposition.
In step 2, the raters independently scored each linking phrase on each map for the accuracy and depth of understanding of the relationship described by the phrase. Linking phrases were scored on a scale of zero to five as described in Table 1. Briefly, zero points were given for propositions describing incorrect relationships. One point was given if the linking line lacked a phrase describing the concept–concept relationship, two points for propositions that were either incomplete or only sometimes true, and three points if they were simple and correct. Four points were given if the propositions demonstrated a more sophisticated understanding by using one conditional modifier. Five points were given for using an even more sophisticated proposition, usually using two or three conditional modifiers. In Figure 1, the proposition including the concept “complications” connected to “macrovascular” with the linking phrase “chronic, poor control or underdiagnosed DM cause” would be scored as complex and correct (five points, Table 1).
Final scores: Quality score and importance/quality score.
On the basis of the two-step scoring system described above, we assigned two scores to each map, called quality (Q score) and importance/quality (I/Q score). The Q score was the sum of step 2 scores, thus it only considers the accuracy of the proposition and makes no assessment of the importance of the linkage. The I/Q score was the sum of step 1 scores multiplied by the sum of step 2 scores, representing an importance-weighted quality score that takes into account both the importance and the accuracy and depth of understanding demonstrated by the proposition. We chose to multiply step 1 and 2 scores together so that if either score were incorrect (i.e., assigned score of zero points), the I/Q score would also be zero.
Hybrid scoring system.
Cross-links indicate map complexity and are unique to the structural system. Proposition accuracy and importance are unique to the Q and I/Q systems. In the hybrid scoring (H) system, the I/Q score of an individual proposition was multiplied by two if it had been scored a cross-link in the structure system.
To assess the reliability of our method of administering and scoring concept maps, we used G-theory to estimate the variation attributable to learner (referred to as universe score variance) and the measurement error attributable to rater, occasion, and subject domain (AS or DM) to concept map total score variance. We chose G-theory because classical test theory measures sources of error or variability one at a time (e.g., retest reliability, internal consistency alpha, parallel forms reliability). G-theory allowed us to simultaneously evaluate the relative contribution of each source of variation, and their interactions, to total score variability. We performed our analysis, then, by performing a generalizability study (G-study) to estimate the contributions of learner, rater, occasion, domain, and their interactions to total map score variance.
The resulting G-study variance components were then used to estimate a generalizability coefficient (G-coefficient; scale of zero to one), which is analogous to the reliability coefficient in classical test theory. The G-coefficient allowed us to estimate the degree to which we can reliably generalize administration of this assessment method across all possible conditions that contribute to error (e.g., rater, occasion). In a “what-if” type of simulation, called a decision study (D-study), the numbers of raters and occasions were manipulated until arriving at a desired G-coefficient (≥0.8)—increasing sample size lowers measurement error and increases the G-coefficient (cf. Spearman–Brown prophecy formula in classical test theory).17
Map concepts, linking phrases, and scores were entered into an Excel database. Only complete datasets were analyzed. Excel was used to generate total scores at occasions one and two for raters one and two in each of the two domains. G-theory analysis was performed using the GENOVA program.18 Moreover, mean scores (over raters) for each scoring method were compared statistically for differences at occasions one and two using two-tailed paired Student t tests.
Concept map scores
Figure 1 shows a partial reproduction of a low- and high-scoring concept map about DM. For all scoring systems, scores for both the AS and DM maps increased significantly from the first to the second occasion according to paired Student t tests (P < .002 for all scoring systems). Additional data about the concepts provided to learners and the use of concepts within learner maps are available on request.
Reliability of scoring systems
To assess the reliability of each scoring system, we performed separate three-facet (rater, occasion, domain) G-studies with learners as the object of measurement. For this initial analysis, the effects of all facets were treated as random. The relative contribution of learners, domain, occasion, and rater to score variance in each scoring system is shown in Table 2. We found that universe score variance (variance attributable to effect of systematic learner differences) was substantial for the Q, I/Q, and H scoring systems (41.3%, 43.9%, and 40.7% of variance, respectively) but low for S (10%). Across all scoring systems, rater effects (both main and interaction effects) were small, accounting for no more than 4% of variance. Residual sources of variance were small (range, 1.2%–2.2%) for all scoring systems. In contrast, we found a large interaction effect between learners and the occasion in which they created the map. This effect was greatest for the S scoring system, accounting for 50% of the estimated variance. In Table 2, the variance results for the Q system were not included because they were essentially identical to those of the I/Q system.
We also found a large learner–occasion–domain interaction effect (ranging from 18.8% to 25.2% across scoring systems). This finding suggested that learner scores varied substantially in different domains from one occasion to another. On the basis of these findings, we decided to examine the reliability of different scoring methods separately for each domain (see below).
The G-coefficient for two domains, two occasions, and two raters was essentially the same for the Q, I/Q, and H scoring systems (0.77, 0.78, and 0.76, respectively). For S, the G-coefficient was only 0.24.
The final G-study examined the variation attributable to learners, raters, occasions, and their interactions in each domain (AS and DM) separately. Recall that in the initial G-study, we treated domain as a random effect, averaging scores in different domains across the universe of domains to examine its impact on score variation. The large learner–occasion–domain interaction effect we observed indicated that learner scores varied substantially in different domains from one occasion to another. This finding is consistent with the expectation that learner performance should differ depending on which domain is being tested. For this reason, we concluded that it was not appropriate to average scores over domains. Instead, we chose to perform the G-study, and all subsequent analyses, on each domain (AS and DM) separately.
Accordingly, we performed a two-facet G-study using occasions and raters as facets for each subject domain separately. Table 3 shows the relative contributions of learner, occasion, rater, and their interactions to total I/Q score variance. We report only the I/Q scoring system results because the reliability of the S scoring system was poor and the variance and reliability of the other three scoring systems were essentially the same as I/Q. Our results demonstrated a large learner–occasion (L × O) interaction effect for both AS and DM domains (36.9% and 31.2%, respectively). Learner (universe score) variance was also substantial, accounting for 37.9% of variance in AS and 49.1% in DM. The rater main effect and interaction effects were small, ranging from 0% to 1.6% in AS and 0% to 7.7% in DM. The residual variance was small for both domains.
Among the conditions tested in our model, learner–occasion interaction effects and occasion main effects accounted for most of the measurement error, whereas rater effects were small. On the basis of these findings, we performed a D-study that modeled an increasing number of occasions using a single rater. Table 4 shows the effect of increasing the number of occasions, from one to five, on the G-coefficient. The threshold level of 0.8 is achieved after four occasions in DM and five occasions in AS.
Effect of raters—“interrater reliability”
To assess the effect of raters alone on score variance, we performed a one-facet G-study for each domain separately, using data from the second occasion only. We chose occasion two because we reasoned that our participants had reached asymptote in their concept mapping skills. This analysis is analogous to interrater reliability in classical test theory.
Learner variance (universe score variance) was high for the I/Q scoring system (93.9% [AS] and 89.7% [DM]), whereas rater variation was low (2.2% [AS] and 6.3% [DM]). The residual effect, including the interaction effect between rater and learner, was similarly low (5.0% [AS] and 4.0% [DM]). The G-coefficient (analogous to interrater reliability) was 0.98 for both AS and DM using the I/Q scoring system. Results for the other scoring systems were essentially the same (data not shown, but available on request).
Exploration of variations in scoring systems
For the purpose of hypothesis generation and future study design, we explored the reliability of several variations in all scoring systems. For the structural system, we varied the weighting factors of score components (i.e., varied the factor by which each component [e.g., concept links, cross-links, hierarchy] was multiplied). For the Q and I/Q systems, we explored alternative scales by collapsing step 1 and step 2 score levels. In each situation, we found that universe score variance and G-coefficients decreased rather than increased, compared with the original weighting and expanded scales (data not shown).
To be a reliable assessment method, observed variation in scores should be primarily attributable to systematic differences among learners’ knowledge structures (universe score variation) rather than other factors, such as rater variation, testing occasions, or interaction effects among these factors. In our initial three-facet G-study, we found that universe score variation was moderate for the Q, I/Q, and H scoring systems but very low for the S scoring system. The next largest sources of variance were the learner–occasion–domain and learner–occasion effects, indicating that learner scores vary depending on the occasion and the domain tested. The domain–interaction effect suggests that learners have different knowledge structures for each domain—an expected finding because one would anticipate that an individual’s level of knowledge and expertise would vary across different subject domains. Variance attributable to raters and rater interactions with other sources of variation were low regardless of scoring system or occasion, indicating that the concept maps can be scored in a reliable way.
We also demonstrated a large learner–occasion interaction effect and, to a lesser extent, occasion main effect for all scoring systems. The D-study indicated that, under the conditions used in our study, concept map assessment should be administered on four to five occasions (four for DM and five for AS) to achieve sufficient reliability to distinguish between individuals. Similar occasion-related effects were not found in earlier concept mapping studies in secondary school science education assessment.9,14 An important difference between our study and those studies is that we provided learners substantially more concepts (50–60 versus 15–20), which we hypothesize added to the complexity of the mapping task. This, coupled with the observation that scores increased significantly from occasion one to occasion two, suggests that the performance of many learners may have improved through learning how to make the concept maps. However, the learner–occasion effect also indicates that the scores of some learners decreased over time, perhaps because of individual performance factors (such as fatigue). Therefore, using many fewer concepts with adequate opportunity to practice might result in a reduced occasion main effect and, possibly, learner–occasion interaction effect.
The universe score variance and G-coefficient using the S scoring system was very low, indicating that this system was much less reliable than the Q, I/Q or H scoring systems. This finding contrasts our earlier work in which we demonstrated that the S scoring system had good interrater reliability and test–retest reliability.15,16 However, in those studies we used an unconstrained form of concept mapping in which learners provided their own concepts to create the maps, rather than having the concepts provided for them. In addition, we used elements of classical test theory, rather than G-theory, to assess reliability. Our current finding that the quality scoring systems were reliable and the S system was not reliable is consistent with other studies of concept mapping assessment using similar mapping and scoring methods and G-theory analysis.9,14
The Q and I/Q scoring systems that we used are substantial modifications of previously described quality scoring systems.9,15,19 The important differences are that our I/Q scoring system weighted the importance of the relationship between two linked concepts with the overall depth of understanding of the subject domain. In addition, in both the Q and I/Q systems, we expanded the scale used to rate the proposition, which helped minimize the score ceiling effect we observed in earlier work and provided better discrimination between individuals.15 We also created a hybrid scoring system that combined the I/Q score with the feature of the structural scoring system that indicated map complexity. Although we found that the reliability of this scoring method was equivalent to the Q and I/Q methods, the H method necessitates scoring an additional element (cross-links) in the maps, thus adding complexity to the scoring task. We prefer the I/Q method because it is simple and takes into account not only the quality of the proposition but also its importance to understanding the subject domain.
Our study was limited by a relatively small sample size derived from a single institution and the fact that we only assessed two subject domains. Therefore, it remains possible that our findings may be related to an unmeasured confounding variable unique to our study population or that our results might not generalize across different groups of learners or subject domains. However, our sample of learners included those at both early (medical students) and late (senior residents) stages of clinical skill development, and from two different medical disciplines. An additional limitation is that we administered the assessments at only two occasions. Considering the large occasion main and interaction effects in retrospect, an ideal study would have been to administer the assessments on at least several more occasions. This approach would have allowed us to determine whether the hypothesized learning curve associated with performing the concept mapping task would reach a plateau, as one would expect. Finally, we provided the concepts to learners (constrained maps), rather than allowing them to choose their own concepts (unconstrained maps). Unconstrained maps have the potential to maximize learner knowledge structure variation, but they cannot be scored using automated methods. However, too many constraints can limit the ability to capture learner variation.14 Constraining maps by providing concepts allows for simplified administration and the possibility of automated scoring, and it retains the ability to have substantial variability in response.
We conclude that concept mapping assessment using three scoring systems (Q, I/Q, and H) that assess the quality and importance of concept map propositions and features of map complexity can be reliably administered in medical education. However, there is substantial variability in learner scores dependent on the occasion in which they take the assessment. Using the method of concept mapping described in this study, learners must take the assessment on four to five occasions to achieve a level of reliability to allow for discrimination among individuals. It is possible that modification of the concept map assessment task or additional learner training could reduce the number of testing occasions required to achieve adequate reliability. The new I/Q scoring system reported in this study provides an ability to discriminate the relative importance of propositions to the depth of understanding of the subject domain. Although the reliability is similar, we prefer this method over others described in this study because it has greater face validity. Future studies are needed to determine whether reducing the number and choice of concepts (e.g., randomly stratified concept sampling) will allow concept mapping assessment to be administered over fewer occasions while maintaining adequate reliability. Finally, the widespread adoption of concept mapping assessment to evaluate learner knowledge structure will depend on validity studies that explore the relationship of knowledge structure to other systems of evaluation and clinical performance outcomes, and the availability of technology to further automate the scoring process.
The authors thank Nicholas Kenyon, MD, and Hank Swartz, MD (Department of Medicine, UC Davis), and Nicole Glaser, MD, and Jesse Joad, MD (Department of Pediatrics, UC Davis), for making expert maps. The authors also thank Richard Pomeroy, PhD (School of Education, UC Davis), for administrating concept map training to participants, and LuAnn Cruys and Sally Colvin (Department of Pediatrics) for administrative support.
Dr. Srinivasan was funded in part by the Robert Wood Johnson Foundation Generalist Physicians Faculty Scholars Program and the UC Davis Health System Award.