Each year, about 850 papers are submitted for publication to the Nederlands Tijdschrift voor Geneeskunde (Dutch Journal of Medicine), including nearly 250 original articles.1 Peer review in scientific journals is a widely used and well-established method to assess research reports. Peer review has two principal functions: filtering out incorrect or inadequate work and improving the accuracy and clarity of published reports.2 As will be stated again in the updated Cochrane review on peer review, little is yet known about the effectiveness of the peer review process.3 This is due in part to the fact that assessment of the peer review process is predominantly behavioral science. Only recently has research begun on the effects and shortcomings of peer review. Research has shown that readers of the Dutch Journal of Medicine believe that an article’s quality is improved by peer review.4
One of the most important tools for assessing the quality of reviewers’ reports is an internally and externally validated scale. Theoretically, a grading instrument that provides data on the quality of individual peer reviews could aid editors by 1) identifying reviewers who are consistently outstanding or weak, timely or late, and exacting or lenient, and who contribute the most or least to the review process, and 2) providing objective data to use when updating the reviewer list, making merit-based promotions to the editorial board, and providing feedback to reviewers.5 Different instruments are used for the purpose of testing the quality of reviewers’ reports. One of them is the Review Quality Instrument developed by Van Rooyen et al.6 The Review Quality Instrument consists of 8 items, each scored on a 5-point scale. Each of the first 7 items reflects a different aspect of the review (importance of the research question, originality of the paper, strengths and weaknesses of the method, presentation, constructiveness of comments, substantiation of comments, and interpretation of results). Item 8 is a global question inquiring about the overall quality of the review. A mean total score is calculated that reflects the mean of the sum of the first 7 items of the instrument.
In this study, our first objective was to evaluate the adequacy and reliability of a simple 5-point scale that has been used for years by Obstetrics & Gynecology and that has not been validated until now. If an instrument is adequate, it actually measures that for which it was developed. Reliability concerns the reproducibility of the results measured by the instrument when measurement is repeated and when the instrument is used by others. We believe our 5-point scale instrument to be faster and simpler for daily use than other published instruments. It determines the objectivity, insight, structure, and constructiveness of the review and its turnaround time. Many items in this 5-point scale overlap with other more complicated instruments like the Review Quality Instrument. The second objective of the study was to analyze the relationship between the turnaround time for a review and its quality. As a journal likes reviewers to be timely, we would like to see that better reviews have a shorter turnaround time. On the other hand, one could imagine that an exacting and outstanding reviewer might need more time to complete a review than a weaker or more lenient reviewer.
MATERIALS AND METHODS
The quality of 247 reviews of 119 original articles submitted to the Dutch Journal of Medicine was assessed. Articles were a selectively chosen from the archives of the journal and consisted of articles that were rejected in 2003, articles published in 2004 that had been selected for a prize for the best original article of the year, and articles that were still in the review process at the time of the study. At that time (June 2004), 39 of 119 articles had been rejected for publication, 63 had been accepted for publication, and the remaining 17 articles had been returned to the author for revision. Quality was assessed using a 5-point scale (see the box: “The 5-Point Score Used by Editors of Obstetrics & Gynecology to Assess the Quality of Reviews”). Each masked review was assessed independently by three editors of our journal (WH, HV, AO). We calculated test-retest reliability or intraobserver variability by having 76 reviews of 65 original articles, that were a selectively chosen, rated a second time by the same editors (WH2, HV2, AO2) at the end of the research period. After this, intraobserver variability for each of the three internal editors (for 76 reviews) was computed using an intraobserver intraclass correlation coefficient. An intraclass correlation coefficient is a measure to assess reliability or agreement between evaluators and ranges from 0 to 1. Higher intraclass correlation coefficients reflect greater reproducibility. An intraclass correlation coefficient resembles a simple correlation coefficient but can be used to assess agreement between more than two assessors. The advantage of an intraclass correlation coefficient is that it deals with random errors as well as with systematic differences between methods.7 The time between the first and second rating was about 2 months. For each editor, the number of times that a score of 1 or 5 was given was calculated.
We made an effort to evaluate external validation of the English 5-point scale in two ways. First, we asked the editors of three other Dutch peer-reviewed medical journals to rate the 247 reviews using the same 5-point scale (Ext1, Ext2, Ext3). The 247 ratings of all six editors were then compared. Interobserver variability was determined by means of an interobserver intraclass correlation coefficient for the three internal editors, for the three external editors, for all six editors, and for the pool of internal versus external editors.
Secondly, we sent the authors the reviews of their articles with a questionnaire to elicit their opinions of each review’s quality. The questionnaire consisted of 12 yes-or-no questions and one question asking for an overall score (between 1 and 5) for the reviews (Table 1). If the questionnaires were not returned within 3 weeks, we sent a first reminder to each nonresponding author. The same was done 3 weeks later, and at that time the questionnaires and reviews were included for a second time. Of all 247 reviews, 240 were sent to 118 of 119 authors of the original articles. Seven questionnaires were not sent to the authors because the reviews had not been sent to the authors in the first place because of the tone of the review (4 reviews of accepted articles and 3 reviews of rejected articles). This is why one of the authors received no questionnaire at all.
The 12 items of the yes-or-no questionnaire were scored as 1 when answered with a “yes” and scored as 0 when answered with a “no.” This resulted in a sum score between 0 and 12. We calculated the mean and median sum scores and overall scores (ie, the score that ranged from 1 to 5) as rated by the authors. Next, we evaluated the Spearman correlation coefficient between the authors’ sum and overall scores. The correlation coefficient is a number between –1 and +1. A negative correlation coefficient reflects a negative association whereas a positive correlation coefficient corresponds to a positive association. If the correlation coefficient is 0 there is no association at all. The closer the correlation coefficient comes to –1 or +1, the stronger the association.
The ratings of all six editors were compared with the overall scores of the authors by calculating an interobserver intraclass correlation coefficient. We assessed whether the editorial decision (accepted, revisions needed, or rejected) influenced the authors’ sum and overall scores or the quality rating of all editors by performing an unpaired t test. We assessed the correlation coefficient between the authors’ overall scores and the editors’ assessments for articles in each category of editorial decision. We determined the relationship between each author’s response and the category of editorial decision by variance analysis.
In addition, the number of days between the request for and the return of the review was noted, and the mean and median turnaround times of the reviewers were calculated. For the date of return, the date as it had been written down by the reviewer on the accompanying letter was used. We then evaluated the correlation coefficient between turnaround time and the quality of the review as indicated by the mean score from all six editors. SPSS 12.0.1 (SPSS Inc, Chicago, IL) was used for the statistical analysis.
The mean and median scores for the three editors of the Dutch Journal of Medicine combined were 3.15 and 3.17, respectively (Table 2). A score of 1 was given five times, one time, and one time by WH, HV, and AO, respectively. A score of 5 was given 1, 28, and 11 times by WH, HV, and AO, respectively.
The interobserver intraclass correlation coefficient for the three internal editors was 0.62 (95% confidence interval [CI] 0.50–0.71) (Table 3). For the second rating of 76 reviews, the interobserver intraclass correlation coefficient was 0.62 (95% CI 0.45–0.74), indicating that there was no difference in interobserver variability between the first and second assessments. A score of 1 for the second rating was given 0, 4, and 1 times by WH, HV, and AO, respectively. A score of 5 was given 0, 11 and 2 times by WH, HV, and AO, respectively. Intraobserver variability was calculated as an intraobserver intraclass correlation coefficient for the three internal editors and was 0.66 for WH (95% CI 0.51–0.77), 0.72 for HV (95% CI 0.58–0.81), and 0.88 (95% CI 0.81–0.92) for AO.
With regard to the external editors, a score of 1 was given 9, 4, and 2 times by Ext1, Ext2, and Ext3, respectively. A score of 5 was given 8, 22, and 15 times by Ext1, Ext2 and Ext3, respectively.
The interobserver intraclass correlation coefficient for the external editors was 0.60 (95% CI 0.51–0.68) (Table 3). The interobserver intraclass correlation coefficient for all six editors combined was 0.62 (95% CI 0.55–0.68). The interobserver intraclass correlation coefficient for the pool of external versus internal editors was 0.86 (95% CI 0.82–0.89).
Of the 240 reviews and questionnaires that were sent to the authors, 187 (78%) were returned. Author response was 83% (98 of 118 authors). The mean sum score (ie, the sum of the 12 items on the questionnaires) was 7.7, and the median was 8.0. The mean overall score (ie, the score that ranged from 1 to 5) was 3.3, and the median was 3.5. The correlation between sum score and overall score of the authors by Spearman correlation coefficient was 0.76 (P < .01). The overall score of the authors was significantly higher for accepted than for rejected papers (unpaired t test, P < .01). Both overall and sum scores were significantly higher for “revision needed” than for rejected papers (unpaired t test, P < .01 and P < .05, respectively). Author response was significantly higher for accepted papers (93%) and “revision needed” papers (70%) than for rejected papers (60%) (variance analysis, P < .01).
The interobserver intraclass correlation coefficient for the mean total editorial rating and the overall score of the authors was 0.28 (95% CI 0.14–0.41). For accepted papers, there was a significant correlation between the overall score of the authors and the review quality as determined by all editors (Spearman 0.43, P < .01), the internal editors (Spearman 0.34, P < .01), and the external editors (Spearman 0.47, P < .01). For “revision needed” papers, a significant correlation was found between the overall scores of the authors and review quality as determined by the internal editors (Spearman 0.41, P = .03) but not between review quality as determined by the external editors and by all six editors (Spearman, respectively, 0.22, P = .26, and 0.35, P = .07). For rejected papers, no significant correlations were found between the overall score of the authors and the review quality as determined by the editors (Spearman 0.36, P = .06, and 0.21, P = .14, respectively). Review quality as determined by the internal editors, the external editors, and all editors combined was significantly higher for “revision needed” than for accepted papers (unpaired t test, P = .04, 0.04, 0.03, respectively) (Table 4).
The mean and median turnaround times for review were 24 and 21 days, respectively (range 1–97). There was no correlation between the speed of return and the quality of the review as determined by the mean score of all six editors (Spearman –0.03, P = .67).
The 5-point scale used by Obstetrics & Gynecology proved to be a simple and reliable instrument enabling editors to assess the quality of reviews. The editors of the Dutch Journal of Medicine found it easy to use for the quick evaluation of reviews. The interobserver variability for all six editors combined was 0.62, which means that the interrater reliability was quite good. The intraobserver variability for the three internal editors ranged from 0.66 to 0.88, demonstrating good test-retest reliability. The scores do not follow a normal distribution; there are some floor and ceiling effects. We made every effort to guarantee both internal and external validation of the instrument as much as possible. Because there is no gold standard for a review, we also asked external editors to give their ratings, using the same instrument, so as to assess the level of agreement between internal and external editors. In addition, we asked the authors of the original articles to fill out a questionnaire concerning the reviews of their articles. There was a significant correlation between the authors’ assessments and that of the editors in terms of the quality of the reviews. Other types of validation are construct, face, content, and criterion-related validity. Criterion-related validity is difficult to evaluate because there is no gold standard. We suggested that there might be a correlation between turnaround time and the quality of the review, but we did not find any positive or negative correlation.
Several studies have measured the quality of reviewers’ reports under varying circumstances, such as blinding7–10 and training.11 The instruments used to rate review quality ranged from 2-item to 10-item scales. Most items were rated using a 2- to 5-point system, but one used a visual analog scale and one used ratings from 1 to 100.12 None of the rating instruments has a published validation, except for the Review Quality Instrument, which was used in several studies.9–11,13 The Review Quality Instrument is a more complicated instrument that consists of seven items, each scored on a 5-point scale. A mean total score is calculated as the mean of the sum of the first seven items of the instrument.6 The instrument was validated by asking seven people to assess the quality of 20 reviews, followed by an internal consistency check in which a maximum of 11 raters assessed 11 reviews for a second time (test-retest reliability). We believe that the 5-point scale that we tested is a faster and simpler instrument for assessing the quality of reviews. We think that our method of validation is better for several reasons. Not only did it involve more reviews, but it was done in a thorough way by determining both intraobserver and interobserver variability. In addition, external editors as well as authors were asked to evaluate the instrument. We believe that there is a need for simpler instruments because, in a study to examine the effect of blinding and unmasking on review quality, for example, the researchers added a global item to the Review Quality Instrument, seeking an overall assessment of the quality of the review.9
The instrument we used is only usable for original articles and not for other types of papers like case reports and meta-analyses. Most of the instruments described have been used for original papers only.6,9,15,16 Some of the rating instruments could be used for other types of manuscripts.4,5
The more often our rating instrument was used, the faster a judgment could be given by the editors. We asked the editors to assess the review without giving them the original article, as was done in most other studies except the study by Feurer et al.5 Because the authors were obviously familiar with their own article, they were not blinded, in contrast to the editors. The Review Quality Instrument was also tested without accompanying manuscripts.6 However, a review may look very good, but an editor cannot really judge the accuracy of the review without the original article.
In other studies, the most commonly rated aspects of reviews are those relating to the methodological soundness of the reviewed study, its importance, originality, and presentation. Several studies have also attempted to assess the tone or courteousness of the review.13 One of the criteria evaluated in our instrument was the objectiveness of the review.
Some studies have considered the speed of completion of a review. The speed of completion was sometimes measured by the time (as noted by the referee) taken to complete the review.9–11 This is a subjective parameter because a reviewer may write down more or less time than was actually used. To capture the turnaround time as objectively as possible, we calculated the number of days between the date of request for the review and the return date, as was also done in the studies by Feurer et al5 and Weber et al.17
In our study, the turnaround time was shorter than that previously described for the Dutch Journal of Medicine in 1996.18 At that time, the mean and median turnaround times were 41 and 39 days (range 3–106), compared with 24 and 21 days (range 1–97) in our study. Although the turnaround time in our study may have been slightly underestimated, because we took the date written down by the referee and not the date of receipt by the secretary of the Dutch Journal of Medicine, this cannot explain this big difference.
Most studies have used the judgment of editors to determine the quality of reviews. Some have included the reader’s judgment in assessing the quality of the manuscript.4,16 Few studies have included an assessment by the authors of the reviewed article, as we did in this study.9,11,17,19 There may be some selection bias because we did not send all 247, but only 240, reviews to the authors. However, of the seven reviews that were not sent, four reviewed accepted articles, and three reviewed rejected articles. The authors were given the review only, without the final judgment of the reviewer or the accompanying letter to the editor.
Most studies of peer review in biomedical journals have been concerned with manuscripts that were accepted for publication.8,12,15,16 Some studies used submitted manuscripts independently of the editorial decision, as we did.10,11,17 A number of important questions can only be answered by studying rejected manuscripts as well as those that are accepted.2
The chief internal editors involved in this research were the same ones that initially sorted incoming articles or reviews. This may have been a source of some recall bias if they were able to remember whether the article was accepted for publication. Because the study was done with articles that had been submitted in the past, many authors already knew whether their articles were accepted. For this reason, we examined the influence of acceptance, revision, or rejection of the article on the author’s judgment. The authors’ responses in our study were highest for accepted papers (93% versus 60% for rejected papers).
We think there is a need for simpler instruments than the Review Quality Instrument to assess the quality of reviews. Our instrument seems to be an adequate and reliable alternative. Very few studies of review quality have included an assessment by the responsible authors, as we did in this study. We believe that the quality of a review must be determined by both authors and editors because they have different objectives. Accepted, rejected, and “revision needed” articles should be used together to correct for any influence of the status of the article on the assessment of review quality. To calculate a turnaround time as objectively as possible, the number of days between the date of request for the review and the return date should be used.
Some suggestions can be made with regard to future studies on review quality. The instrument we used is only suitable for original articles, but it could be adapted for other types of manuscripts for future studies. The quality of the review may be evaluated better if the original article accompanies the review.
1.Bloemenkamp DGM, Hart W, Overbeke AJPM. The percentage of articles which were accepted or rejected for publication in the Dutch Journal of Medicine in 1997 [in Dutch]. Ned Tijdschr Geneeskd 1999;143:157–9.
2.Jefferson T, Alderson P, Wager E, Davidoff F. Effects of editorial peer review: a systematic review. JAMA 2002;287:2784–6.
3.Jefferson TO, Alderson P, Davidoff F, Wager E. Editorial peer-review for improving the quality of reports of biomedical studies (Cochrane Review). In: The Cochrane Library, Issue 2, 2005. Oxford: Update Software.
4.Pierie JP, Walvoort HC, Overbeke AJ. Readerś evaluation of effect of peer review and editing on quality of articles in the Nederlands Tijdschrift voor Geneeskunde.
5.Feurer ID, Becker GJ, Picus D, Ramirez E, Darcy MD, Hicks ME. Evaluating peer reviews: pilot testing of a grading instrument. JAMA 1994;272:98–100.
6.Van Rooyen S, Black N, Godlee F. Development of the Review Quality Instrument (RQI) for assessing peer reviews of manuscripts. J Clin Epidemiol 1999;52:625–9.
7.Shrout P, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86:420–8.
8.Godlee F, Gale CR, Martyn CN. Effect on the quality of peer review of blinding reviewers and asking them to sign their reports. JAMA 1998;280:237–40.
9.McNutt RA, Evans AT, Fletcher RH, Fletcher SW. The effects of blinding on the quality of peer review: a randomized trial. JAMA 1990;263:1371–6.
10.Van Rooyen S, Godlee F, Evans S, Smith R, Black N. Effect of blinding and unmasking on the quality of peer review: a randomized trial. JAMA 1998;280:234–7.
11.Van Rooyen S, Godlee F, Evans S, Black N, Smith R. Effect of open peer review on quality of reviews and on reviewerś recommendations: a randomised trial. BMJ 1999;318:23–7.
12.Schroter S, Black N, Evans S, Carpenter J, Godlee F, Smith R. Effects of training on quality of peer review. BMJ 2004;328:673–5.
13.Jefferson T, Wager E, Davidoff F. Measuring the quality of editorial peer review. JAMA 2002;287:2786–90.
14.Callaham ML, Knopp RK, Gallagher EJ. Effect of written feedback by editors on quality of reviews: two randomized trials. JAMA 2002;287:2781–3.
15.Goodman SN, Berlin J, Fletcher SW, Fletcher RH. Manuscript quality before and after peer review and editing at Annals of Internal Medicine.
Ann Intern Med 1994;121:11–21.
16.Justice AC, Berlin JA, Fletcher SW, Fletcher RH, Goodman SN. Do readers and peer reviewers agree on manuscript quality? JAMA 1994;272:117–9.
17.Weber EJ, Katz PP, Waeckerle JF, Callaham ML. Author perception of peer review: impact of review quality and acceptance on satisfaction. JAMA 2002;287:2790–3.
18.Tjon MJ, Sang F, Overbeke AJPM, Lockefeer JHM. What do reviewers look for in “original articles” submitted for publication in the Netherlands Journal of Medicine
[in Dutch]? Ned Tijdschr Geneeskd 1996;140:2349–52.
19.Black N, van Rooyen S, Godlee F, Smith R, Evans S. What makes a good reviewer and a good review for a general medical journal? JAMA 1998;280:231–3.