Secondary Logo

Journal Logo

Reliability in Evaluating Letters of Recommendation

Dirschl, Douglas R., MD; Adams, George L., MD


To measure the interobserver reliability in evaluating letters of recommendation for residency applicants, three letters were collected from each of the application files of 58 residents at one program. The letters were rated by six faculty. Interobserver reliability, calculated using the kappa statistic, was slight. These preliminary results show significant variability in the interpretation of letters of recommendation.

Dr. Dirschl is associate professor of orthopedics, University of North Carolina at Chapel Hill School of Medicine, and executive director, Wake Area Health Education Center, Raleigh, North Carolina. Dr. Adams is an intern in internal medicine at the University of Texas, Southwestern School of Medicine, Dallas, Texas.

Correspondence should be addressed to Dr. Dirschl, Executive Director, Wake Area Health Education Center, 3024 New Bern Avenue, Suite 302, Raleigh, NC 27610.

Letters of recommendation are considered an important source of information about residency applicants,1 but no published study documents the reliability of rating these letters. This study measured the interobserver reliability of evaluating letters of recommendation for applicants to an orthopedics training program.

Back to Top | Article Outline


Three letters of recommendation (from a chair of orthopedics, an orthopedist, and a non-orthopedist) were collected from the application files of 58 orthopedics residents who completed training at the University of North Carolina at Chapel Hill School of Medicine, 1983–1987. The applicant's name, author's name, and author's institution were all masked.

Six orthopedics faculty rated the content of each letter “outstanding,” “average,” or “poor,” the method of evaluation faculty use to rate such letters. Interobserver reliability was expressed using the kappa statistic.2 Agreement by kappa value was characterized as <0 = poor; 0-.2 = slight; .21− .40 = fair; .41− .60 = moderate; .61− .80 = substantial; and .81−1 = almost perfect.2

Back to Top | Article Outline


The mean kappa values were .17 ± .03 (mean ± standard error) for letters from a chair,.28 ± .03 for letters from an orthopedist, and .20 ± .04 for letters from a non-orthopedist. The differences in kappa values were not statistically significant (p = 0.15).

Of the residents whose letters were evaluated, eight had performed at a superior level in the residency program and eight had had inferior performances. There was no significant difference in kappa values for rating letters for either the eight superior or eight inferior residents (p > 0.5). While the superior and inferior groups had equal percentages of letters rated “poor” (15%), the superior group had a greater percentage of letters rated “outstanding” (33 versus 18%).

Back to Top | Article Outline


No previous investigation has established benchmarks to determine acceptable assessments of letters of recommendation. Using the kappa statistic, this preliminary study found slight interobserver reliability.

Although many faculty believe the identity of the author of a letter is important in determining the letter's value, the observers in this study focused solely on each letter's content. Some faculty believe that references to an applicant's positive or negative attributes are the most important information to glean from a letter. However, the study of content yielded only slight interobserver reliability.

Residency programs should be aware of the potential for interobserver variability in the interpretation of letters of recommendations. While use of an objective scoring system for letters might improve reliability, it is possible that much of the variability is due to the letters themselves. Authors tend to give glowing reports, making the letters less discriminatory (the phenomenon of range restriction) and the author's true opinion of the applicant less clear. The continued assessment of the reliability of rating letters is necessary.

Back to Top | Article Outline


1. DeLisa JA, Jain SS, Campagnolo DI. Factors used by physical medicine and rehabilitation residency training directors to select their residents. Am J Phys Med Rehab. 1994;73:152–6.
2. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–74.
© 2000 Association of American Medical Colleges