- The development and content of the instrument are sufficiently described or referenced, and are sufficiently detailed to permit the study to be replicated.
- The measurement instrument is appropriate given the study's variables; the scoring method is clearly defined.
- The psychometric properties and procedures are clearly presented and appropriate.
- The data set is sufficiently described or referenced.
- Observers or raters were sufficiently trained.
- Data quality control is described and adequate.
ISSUES AND EXAMPLES RELATED TO CRITERIA
Instrumentation refers to the selection or development and the later use of tools to make observations about variables in a research study. The observations are collected, recorded, and used as primary data.
In the social and behavioral sciences—covering health outcomes, medical education, and patient education research, for example—these instruments are usually “paper-and-pencil” tools. In contrast, the biological sciences and physical sciences usually rely on tools such as microscopes, CAT scans, and many other laboratory technologies. Yet the goals and process in developing and using instruments are the same across the sciences, and therefore each field has appropriate criteria within the overall standards of scientific research. Throughout this section, the focus and examples are from the social sciences and in particular from health professions research, although the general principles of the criteria apply across the sciences.
Instrumentation builds on the study design and problem statement and assumes that both are appropriately specified. In considering the quality of instrumentation and data collection, the reviewer should focus on the rigor with which data collection is executed. Reviewers are looking for or evaluating four aspects of the execution: (1) selecting or developing the instrument, (2) creating scores from the data captured by the instrument, (3) using the instrument appropriately, and (4) a sense that the methods employed met at least minimum quality standards.
Selection and Development
Describing the instrumentation starts with specifying in what way(s) the variables will be captured or measured. The reviewer needs to know what was studied and how the data were collected. There are many means an author can choose. A broad definition is used here that includes, but is not limited to, a wide variety of tools such as tests and examinations, attitude measures, checklists, surveys, abstraction forms, interview schedules, and rating forms. Indeed, scholars recommend that investigators use multiple measures to address the same research construct, a process called triangulation.1 Instrumentation is often relatively direct because existing and well-known tools are used to capture a variable of interest (e.g., Medical College Admission Test [MCAT] for medical school “readiness” or “aptitude”; National Board of Medical Examiners [NBME] subject examinations for “acquisition of medical knowledge”; Association of American Medical Colleges [AAMC] Graduation Questionnaire for “curricular experiences”). But sometimes the process is less straightforward. For example, if clinical competence of medical students after a required core clerkship is the variable of interest, it may be measured from a variety of perspectives. One approach is to use direct observations of students performing a clinical task, perhaps with standardized patients. Another approach is to use a written test to ask them what they would do in hypothetical situations. Another option is to collect ratings made by clerkship directors at the end of the clerkship that attest to students' clinical skills. Other alternatives are peer- and self-ratings of competence. Or patient satisfaction data could be collected. Choosing among several possible measures of a variable is a key decision when planning a research study.
Often a suitable measurement instrument is not available, and instruments must be developed. Typically, when new instruments are used for research, more detail about their development is expected than when existing measures are employed. Reviewers do not have to be experts in instrument development, but they need to be able to assess that the authors did the right things. Numerous publications describe the methods that should be followed in developing academic achievement tests,2,3 rating and attitude scales,4,6 checklists,7 and surveys.8 There is no single best approach to instrument development, but the process should be described rigorously and in detail, and reviewers should look for citations provided for readers to access this information.
Instrument development starts with specifying the content domain, conducting a thorough review of past work to see what exists, and, if necessary, beginning to create a new instrument. If an existing instrument is used, the reviewer needs to know and learn from the manuscript the rationale and original sources. When new items are developed, the content can be drawn from many sources such as potential subjects, other instruments, the literature, and experts. What the reviewer needs to see is that the process followed was more rigorous than a single investigator (or two) simply putting thoughts on paper. The reviewers should make sure that the items were critically reviewed for their clarity and meaning, and that the instrument was pilot tested and revised, as necessary. For some instruments, such as a data abstraction form, pilot testing might mean as little as trying out the form on a sample of hospital charts. More stringent testing is needed for instruments that are administered to individuals.
For any given instrument, the reviewer needs to be able to discern how scores or classifications are derived from the instrument. For example, how were questionnaire responses summed or dichotomized such that respondents were grouped into those who “agreed” and “disagreed” or those who were judged to be “competent” and “not competent”? If a manuscript is about an instrument, as opposed to the more typical case, when authors use an instrument to assess some question, investigators might present methods for formal scale development and evaluation, often focusing on subscale definition, reliability estimation, reproducibility, and homogeneity.9 Large development projects for instruments designed to measure individual differences on a variable of interest will also need to pay attention to validity issues, sensitivity, and stability of scores.10 Other types of instruments do not lend themselves well to aggregated scores. Nevertheless, reviewers need to be clear about how investigators operationalized research variables and judged the technical properties (i.e., reliability and validity) of research data.
Decisions made about cut-scores and classifications also need to be conveyed to readers. For example, in a study on the perceived frequency of feedback from preceptors and residents to students, the definition of “feedback” needs to be reported and justified. For example, is it a report of any feedback in a certain amount of time, or is it feedback at a higher frequency, maybe more than twice a day? Investigators make many decisions in the course of conducting a study. Not all need to be reported in a paper but enough should be present to allow readers to understand the operationalization of the variables of interest.
This discussion of score creation applies equally when the source of data is an existing data set, such as the AAMC Faculty Roster or the AMA Master File. These types of data raise yet more issues about justification of analytic decisions. A focus of these manuscripts should be how data were selected, cleaned, and manipulated. For example, if the AMA Master File is being used for a study on primary care providers, how exactly was the sample defined? Was it by training, board certification, or self-reports of how respondents spent their professional time? Does it include research and administrative as well as clinical time? Does it include both family medicine and internal medicine physicians? When researchers do secondary data analyses, they lose intimate knowledge of the database and yet must provide information. The reviewer must look for evidence of sound decisions about sample definition and treatment of missing data that preceded the definition of scores.
Use of the Instrument
Designing an instrument and selecting and scoring it are only two parts of instrumentation. The third and complementary part involves the steps taken to ensure that the instrument is used properly. For many self-administered forms, the important information may concern incentives and processes used to gather complete data (e.g., contact of non-responders, location of missing charts). For instruments that may be more reactive to the person using the forms (e.g., rating forms, interviews), it is necessary to summarize coherently the actions that were taken to minimize differences related to the instrument user. This typically involves discussions of rater or interviewer training and computation of inter- or intra-rater reliability coefficients.5
General Quality Control
In addition to reviewing the details about the actual instruments used in the study, reviewers need to gain a sense that a study was conducted soundly.11 In most cases, it is impossible and unnecessary to report internal methods that were put in place for monitoring data collection and quality. This level of detail might be expected for a proposal application, but it does not fit in most manuscripts. Still, depending on the methods of the study under review, the reviewer must assess a variety of issues such as unbiased recruitment and retention of subjects, appropriate training of data collectors, and sensible and sequential definitions of analytic variables. The source of any funding must also be reported.
These are generic concerns for any study. It would be too unwieldy to consider here all possible elements, but the reviewer needs to be convinced that the methods are sound —sloppiness or incompleteness in reporting (or worse) should raise a red flag. In the end the reviewer must be convinced that appropriate rigor was used in selecting, developing, and using measurement tools for the study. Without being an expert in measurement, the reviewer can look for relevant details about instrument selection and subsequent score development. Optimally the reviewer would be left confident and clear about the procedures that the author followed in developing and implementing data collection tools.
1. Campbell DT, Fiske DW. Convergent and discriminant validation by the multitrait-multimethod matrix. Psychol Bull. 1959;56:81–105.
2. Linn RL, Gronlund NE. Measurement and Assessment in Teaching. 7th ed. Englewood Cliffs, NJ: Prentice-Hall, 1995.
3. Millman J, Green J. The specification and development of tests of achievement and ability. In: Linn RL (ed). Educational Measurement. 3rd ed. New York: McMillan, 1989:335–66.
4. Medical Outcomes Trust. Instrument review criteria. Med Outcomes Trust Bull. 1995;2:I–IV.
5. Streiner DL, Norman GR. Health Measurement Scales: A Practical Guide to Their Development and Use. 2nd ed. Oxford, U.K.: Oxford University Press, 1995.
6. DeVellis RF. Scale Development: Theory and Applications. Applied Social Research Methods Series, Vol. 26. Newbury Park, CA: Sage, 1991.
7. McGaghie WC, Renner BR, Kowlowitz V, et al. Development and evaluation of musculoskeletal performance measures for an objective structured clinical examination. Teach Learn Med. 1994;6:59–63.
8. Woodward CA. Questionnaire construction and question writing for research in medical education. Med Educ. 1998;22:347–63.
9. Kerlinger FN. Foundations of Behavioral Research. 3rd ed. New York: Holt, Rinehart and Winston, 1986.
10. Nunnally JC. Psychometric Theory. New York: McGraw-Hill, 1978.
11. McGaghie WC. Conducting a research study. In: McGaghie WC, Frey JJ (eds). Handbook for the Academic Physician. New York: Springer Verlag, 1986:217–33.
Fraenkel JR, Wallen NE. How to Design and Evaluate Research in Education. 3rd ed. New York: McGraw-Hill, 1996.
Linn RL, Gronlund NE. Measurement and Assessment in Teaching. 8th ed. Englewood Cliffs, NJ: Merrill, 2000.
Review Criteria for Research Manuscripts
Joint Task Force of Academic Medicine and the GEA-RIME Committee