Share this article on:

A Validity Study of the Writing Sample Section of the Medical College Admission Test


Section Editor(s): Albanese, Mark PhD

PAPERS: Close but No Bananas: Predicting Performance

Correspondence: Mohammadreza Hojat, PhD, Jefferson Medical College, Philadelphia, PA 19107; e-mail: 〈〉.

The current version of the Medical College Admission Test (MCAT), introduced in 1991, includes four sections: Biological Sciences, Physical Sciences, Verbal Reasoning, and Writing Sample. The Writing Sample assesses skills in organizing thoughts and presenting ideas in a cohesive manner, and provides evidence of analytic thinking and writing skills.1 Scoring is based on two 30-minute essays about general topics. An example of an essay prompt is “In a free society, individuals must be allowed to do as they choose.”

Each essay is holistically scored by two trained reviewers on a six-point scale with regard to specific criteria such as developing the central idea, synthesizing concepts logically, and writing clearly with good grammar, syntax, and punctuation. Essays receiving scores that differ by more than one point are re-evaluated by a third expert reviewer. The scores for the two essays completed by each examinee are summed and converted to an 11-point alphabetical scale ranging from J to T. According to reports by the Association of American Medical Colleges (AAMC), 98% of the essays are given identical scores or scores within one scale point of each other by the independent reviewers.1

The results of multi-institutional studies, conducted by the MCAT Validity Study Advisory Group,2 have been published and presented at professional meetings.2,3,4,5 However, while the need for additional studies of the psychometric properties of the MCAT continues, there is a particular need for study of the predictive power of the Writing Sample. The unique alphabetic scores of the Writing Sample discourage the usual correlational analyses used in validity studies. Although it is possible to convert the alphabetic scores to the integers from 1 to 11 by assuming that the letters constitute an interval scale, such an assumption might not be widely accepted.

We designed the present study to examine the validity of the Writing Sample section of the MCAT for students at Jefferson Medical College in Philadelphia, Pennsylvania. We speculated that the ability to organize and express ideas effectively in writing could have relevance to the analytic and problem-solving skills demanded in clinical performance. Furthermore, such skills might also be related to a better presentation of one's self, and to effective verbal expression of ideas, both of which are critical in promoting interpersonal relationships. Therefore, we hypothesized that scores on the Writing Sample would be associated more closely with indicators of clinical competence than with measures of achievement in basic sciences.

Back to Top | Article Outline


Data for 1,776 matriculants (1,086 men, 690 women) at Jefferson Medical College between 1992 and 1999 were retrieved from the database of the Jefferson Longitudinal Study of Medical Education.6 The students were classified into three groups (top, middle, and bottom) based on their scores on the Writing Sample. The “top” group included 314 (18% of the sample) who scored R, or S, or T. The “middle” group consisted of 1,115 (65%) who scored N, O, P, or Q. The 307 (17%) students who scored J, K, L, or M comprised the “bottom” group.

Three sets of criteria were used.

  • Admission measures. The first set included the measures typically used for screening applicants, such as undergraduate grade-point averages (UGPAs) in science and non-science courses, admission interview scores, and MCAT scores on Biological Sciences, Physical Sciences, and Verbal Reasoning.
  • Performance in the basic sciences. The second set consisted of achievement measures in the basic science disciplines, including grade-point averages (GPAs) in first- and second-year medical school courses. Scores on Step 1 of the United States Medical Licensing Examinations (USMLE) were also included.
  • Performance in clinical sciences and ratings of clinical competence. Included in this set were scores on written examinations in six core clerkships (family medicine, internal medicine, obstetrics—gynecology, pediatrics, psychiatry, and surgery) in the third year of medical school. Written examinations in basic and clinical sciences are in either multiple-choice or uncued formats,7 with reliability estimates usually over r =.75.

Combined global ratings of clinical competence in the six core clerkships, on a 100-point scale,8 and scores on Step 2 of the USMLE were also included. In addition, medical school class rank (percentile), a composite measure with two thirds weight for clinical competence in the core clerkships and one third weight for the combined first- and second-year GPAs,8,9 was used, as were the ratings of graduates' clinical competence from a 33-item rating form measuring three clinical competence areas of “data-gathering and processing skills” (16 items), “interpersonal skills and attitudes” (ten items), and “socioeconomic aspects of patient care” (seven items). These ratings were made on a four-point Likert scale by program directors near the end of the first postgraduate year. Data have been reported in support of the measurement properties of this rating form, including construct validity (factor structure), the internal consistency aspect of reliability, and criterion-related validity.10,11

Continuous measures were transformed to a distribution with a mean of 100 and a standard deviation of 10 to facilitate comparisons of the magnitudes of differences on a scale with a uniform mean and standard deviation. This transformation was used to mitigate the issue of scale incompatibility within each class and between classes. The numbers of observations vary for different analyses because data were not yet available for the entire sample at the time of this study.

The three groups were compared with respect to the criterion measures by using analysis of variance for continuous measures, followed by the Duncan test and the Kruskal—Wallis test for class rank. Analysis of covariance was also employed to make statistical adjustments for baseline differences in the scores of other MCAT sections.

Back to Top | Article Outline


Admission Variables. The means and sample sizes for the criterion measures and a summary of the statistical analyses are presented in Table 1. Comparisons of the top, middle, and bottom groups on the Writing Sample showed no significant difference for undergraduate science GPA, or for the Biological and Physical Sciences sections of the MCAT. However, significant differences were observed for undergraduate non-science GPA (p <.05), and the Verbal Reasoning test (p <.01). Duncan tests indicated that the top group's undergraduate non-science GPA was significantly higher than those of the middle and bottom groups (p <.05). As expected, the top group also obtained the highest mean score in Verbal Reasoning, followed by the middle and bottom groups (p <.01).



Performances in Basic Sciences Disciplines in Medical School. Data reported in Table 1 indicate that although the top group consistently outperformed the bottom group in first- and second-year basic science courses, as well as on USMLE Step 1, the differences were not statistically significant.

Performances in Clinical Science Disciplines and Ratings of Clinical Competence. Statistically significant differences were observed among the top, middle, and bottom groups on a number of performance measures in clinical disciplines. Both the top and the middle groups obtained significantly higher mean grades (p <.01) than did the low group on written examinations in the six core clerkships. A similar pattern of findings was observed for medical school class rank.

The top group was also rated significantly higher than the middle and bottom groups in global ratings of clinical competence in the third-year core clerkships (p <.01). The difference between the top and bottom groups' Step 2 scores was also statistically significant (p <.05).

Results for the six measures of clinical competence in residency showed that the differences for ratings in interpersonal skills and attitudes were statistically significant (p <.05), where the top group was rated significantly higher than the bottom group. Although the differences in other areas of postgraduate competence did not reach the conventional level of statistical significance (p <.05), a consistent pattern was observed in which the highest average ratings were obtained by the top group, and the lowest by the bottom group.

In additional analyses, the two extreme groups (top and bottom) were compared regarding the ratings in other areas of clinical competence in residency, and standardized effect-size estimates (d) were calculated for the significant pairwise differences. The top group was rated higher than the bottom group in data-gathering and processing skills (p <.05, estimated effect size =.52), socioeconomic aspects of patient care (p <.05, effect size =.51), and physician as a patient educator (p <.05, effect size =.56). Effect-size estimates of this order of magnitude are not small according to Cohen's definition.12 These differences are not only statistically significant, but also of practical significance.

Controlling for Differences on the Other Sections of the MCAT. Statistical adjustments were made for baseline differences using both the Biological Sciences and the Physical Sciences sections of the MCAT as covariates through analysis of covariance. Each of the previously-reported differences among the three groups remained unchanged. This confirms that the previous findings were not confounded by score differences in these two sections of the MCAT.

Further statistical adjustments were made by adding scores on the Verbal Reasoning section of the MCAT to the other two covariates (scores on the Biological and Physical Sciences sections). The differences remained unchanged for the following criterion measures: clinical clerkship examinations (adjusted p =.02), clinical clerkship ratings (adjusted p =.02), and medical school class rank (adjusted p =.008). However, changes in statistical significance levels were observed in the undergraduate non-science GPAs (adjusted p =.10), Step 2 scores (adjusted p =.31), and postgraduate ratings of data-gathering and data-processing skills (adjusted p =.11).

Back to Top | Article Outline


The findings of the present study confirm the research hypothesis that scores on the Writing Section of the MCAT yield a closer association with measures of clinical competence than with achievement in the basic sciences.

These findings provide support for the validity of the Writing Sample from a number of perspectives. We hypothesized that high scorers on the Writing Sample would outperform others in clinical sciences evaluations and in ratings of clinical competence. The hypothesis was confirmed, providing support for the predictive validity of the test.

The fact that scores on the Writing Sample were significantly associated with performance in the clinical areas in medical school and residency provides evidence in support of convergent validity, whereas their lack of associations with measures of achievement in science prior to and during medical school supports the discriminant validity of the test. In addition, concurrent validity was demonstrated by the relationships between the Writing Sample and Verbal Reasoning scores.

Clinical grades in medical school are based on the observations of faculty and supervising residents during the actual provision of clinical care to patients, and reflect the ability of students to relate well to others. These dimensions of clinical competence require basic medical knowledge, which may be predicted on the basis of MCAT science scores. However, while necessary, medical knowledge is not sufficient for effective clinical decision making. The significant relationship between the Writing Sample scores and clinical ratings after adjustment for MCAT science scores confirms that the associations between Writing Sample scores and measures of clinical performance are beyond those that would be expected from attainment of knowledge only. Therefore, it can be concluded that the Writing Sample measures a unique skill, different from those measured by the other sections of the MCAT, including the Verbal Reasoning section. It can be speculated that such a unique skill might be attributed more to factors that are not associated with achievement in sciences. Such speculation needs to be verified further by empirical evidence.

The results generally suggest that, for a sample of students at one medical school, Writing Sample scores of J, K, L, or M predicted poorer clinical performance during and after medical school. This particular grouping of the Writing Sample scores should be studied further in samples from other medical schools before implementation in decision making.

Certain aspects of this study could be questioned and deserve comment. It may be argued that the statistically significant findings of this study could have been due to chance as a result of the large number of statistical comparisons that were performed. However, this argument can be refuted based on the findings for the 18 criterion measures reported in Table 1. While only one statistically significant finding would be expected by chance alone at p <.05, seven were reported in this table. Similarly, the internal validity of the findings could be questioned by arguing that the statistically significant findings could be attributed to the large sample size, rather than underlying relationships among the variables. This argument can also be refuted based on the findings that the significant associations were observed only for the conceptually relevant scores, such as Verbal Reasoning, whereas there was no relationship with the less relevant scores such as the Biological and Physical Sciences, despite the fact that the sample size was equally large (n = 1,776) in all analyses. Furthermore, the magnitudes of the effect-size estimates between top and bottom scorers suggest that the obtained differences are of practical importance to decision makers.

These findings, coupled with the relatively large sample size and the longitudinal design of this study, provide assurance for the internal validity of the results. However, more data from other medical schools are needed to assure the external validity and the generalization of the findings.

In earlier studies we found that validity coefficients for the MCAT varied for students who graduated from different colleges and universities,13 that the validity of the MCAT varied for different sets of scores when applicants repeated the examination,14 and that different sections of the MCAT have different predictive validity depending upon the criterion measures.15 Empirical evidence also suggests that validity coefficients for the MCAT vary among medical schools.2 It will be essential to consider these factors in future studies of the validity of MCAT.

Back to Top | Article Outline


1. Association of American Medical Colleges. Use of MCAT Data in Admissions: A Guide for Medical School Admissions Officers and Faculty. Washington, DC: AAMC, 1991.
2. Koenig JA, Wiley A. Medical school admission testing. In: Dillon RF (ed). Handbook of Testing. West Port, CT: Greenwood Press, 1997:274–95.
3. Mitchell K, Haynes R, Koenig JA. Assessing the validity of the updated Medical College Admission Test. Acad Med. 1994;69:394–401.
4. Wiley A, Koenig JA. The validity of the Medical College Admission Test for predicting performance in the first two years of medical school. Acad Med. 1996;71(10 suppl):S83–S85.
5. Koenig JA, Sireci SG, Wiley A. Evaluating the predictive validity of MCAT scores across diverse applicant groups. Acad Med. 1998:73:1095–106.
6. Hojat M, Gonnella JS, Veloski JJ, Erdmann JB. Jefferson Medical College's longitudinal study: a prototype of assessment of changes. Education for Health. 1996;9:99–113.
7. Veloski JJ, Rabinowitz HK, Robeson MR, Young PR. Patients don't present with five choices: an alternative to multiple-choice tests in assessing physicians' competence. Acad Med. 1999;74:539–46.
8. Blacklow RS, Goepp CE, Hojat M. Class ranking models for dean's letters and their psychometric evaluation. Acad Med. 1991;66(9 suppl):S10–S12.
9. Blacklow RS, Goepp CE, Hojat M. Further psychometric evaluations of a class ranking model as a potential predictor of graduates' clinical competence in the first year of residency. Acad Med. 1993;68:295–7.
10. Hojat M, Veloski JJ, Borenstein BD. Components of clinical competence ratings: an empirical approach. Educ Psychol Meas. 1986;46:761–9.
11. Hojat M, Borenstein BD, Veloski JJ. Cognitive and noncognitive factors in predicting the clinical performance of medical school graduates. J Med Educ. 1988;63:323–5.
12. Cohen J. Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence, Erlbaum, 1987.
13. Zeleznik C, Hojat M, Veloski JJ. Predictive validity of the MCAT as a function of undergraduate institutions. J Med Educ. 1987;62:163–9.
14. Hojat M, Veloski JJ, Zeleznik C. Predictive validity of the MCAT for students with two sets of scores. J Med Educ. 1985;60:911–8.
15. Glaser K, Hojat M, Veloski JJ, Blacklow RS, Goepp CE. Science, verbal, or quantitative skills: which is the most important predictor of physician competence? Educ Psychol Meas. 1992;52:395–405.
Back to Top | Article Outline

Section Description

Research in Medical Education: Proceedings of the Thirty-ninth Annual Conference. October 30 - November 1, 2000. Chair: Beth Dawson. Editor: M. Brownell Anderson. Foreword by Beth Dawson, PhD.

© 2000 by the Association of American Medical Colleges