Application of Bayesian Methodology to Measure Equivalence Between Two Rating Scales

Pawlowicz, Franchesca E.; Gajewski, Byron J.; Coffland, Valorie; Boyle, Diane K.; Bott, Marjorie J.; Dunton, Nancy

doi: 10.1097/NNR.0b013e318253341b

Background: Little is known about measuring equivalence between two rating scales. Measuring the equivalence between two rating scales requires a study design and analysis conducive to clear interpretation of actual equivalence with simple inferences.

Objective: The aim of this study was to show the use of Bayesian methodology in determining equivalence within a simulated content validity study (to establish equivalence, not content validity).

Methods: Participants were randomized into two groups and responded to the items’ perceived relevance or perceived correlation to a construct, job enjoyment. Items from the National Database of Nursing Quality Indicators were used.

Results: Eighty-seven nursing faculty members from various schools of nursing participated. Findings revealed in all items having a posterior probability of >95% that rating scales are equivalent using an informative prior whereas using a weak or flat prior led to a minimal decrease in posterior probability results.

Discussion: Prior and new information collected from this study was used to determine a posterior probability that a mean difference (±0.5 points) between the relevance and correlation group exists, thereby demonstrating equivalence between two rating scales.

Franchesca E. Pawlowicz, MSN, RN, is Graduate Research Assistant, School of Nursing, University of Kansas Medical Center.

Byron J. Gajewski, PhD, is Associate Professor, School of Nursing and Department of Biostatistics, University of Kansas Medical Center.

Valorie Coffland, MA, BS, is Research Instructor; Diane K. Boyle, PhD, RN, is Associate Professor; Marjorie J. Bott, PhD, MA, RN, is Associate Professor and Associate Dean for Research; and Nancy Dunton, PhD, is Research Professor, School of Nursing, University of Kansas Medical Center.

Accepted for publication January 26, 2012.

Partial funding of this research for all authors was provided by a contract (National Database of Nursing Quality Indicators; PI: N.D.) with the American Nurses Association. The second author was funded partially also by a University of Kansas Research Institute Bridging Grant (PI: B.G.).

The authors have no conflicts of interest to disclose.

Corresponding author: Franchesca E. Pawlowicz, MSN, RN, School of Nursing, University of Kansas Medical Center, Kansas City, KS 66160 (e-mail:

Article Outline

Establishing equivalence in research studies has many essential uses in the care of patients or measurement of variables. Studies to determine if a new intervention equates with a predominant or established intervention often are used to explore and enhance patient care. Alternative methods of measuring equivalence occur during translation of questionnaires, and they are used to determine if two questionnaires measure the same construct using classical statistical methods. However, little is known about measuring equivalence between two rating scales. Theory development, hypothesis generation, and alternative measurement methods may arise from knowledge derived from investigating equivalence between scales.

Measuring the equivalence between two rating scales requires a study design and analysis conducive to clear interpretation of actual equivalence with simple inferences. A Bayesian approach may satisfy these requirements. This methodology allows investigators to make inferences of comparison (equivalent, inferiority, or superiority) between treatments, pharmaceuticals, or diagnostics (Carpenter, Gajewski, Teel, & Aaronson, 2008).

Bayesian methodology allows the use of external data and has an inductive versus deductive approach (Adamina, Tomlinson, & Guller, 2009). Classical statistical methodologies do not include external data, and results are computed from directly observed data. Bayesian methodology is grounded in subjective probability rather than relative frequency probability (Goodman, 2005). Bayesian methodology incorporates external data, which makes this statistical method ideal for accumulation of data for evidence-based guideline development. Classical statistics is viewed as a deductive approach based on hypothesis-driven studies, with outcomes calculated from the observed data. Bayesian methodology differs due to its inductive approach, with probabilities initially stated and then updated with new data.

Two components of Bayesian methodology are prior distribution and likelihood function.

Two components of Bayesian methodology are prior distribution and likelihood function. The prior distribution is derived from external information, not from current data (likelihood). A prior can be informative or uninformative, depending on the level of current knowledge. Types of informative priors (Spiegelhalter, Abrams, & Myles, 2004) are skeptical (reasonable expression of doubt), enthusiastic (counterbalance pessimism expressed by the skeptical prior), data based (previous collected data), or opinion based (reasonable judgment). The prior distribution should never be arbitrary, and an uninformative prior fails to provide a contribution to the posterior distribution. The likelihood reflects the information in the current experimental data. Combining prior distribution and likelihood results in a calculated posterior distribution.

Sensitivity and small sample sizes are advantages to using Bayesian methodology. External data or varying a priori information displays the extent of the sensitivity of the current data. Sensitivity is determined by incorporating various levels of subjective probabilities (informative priors) to infer how prior information affects the current data. Use of prior distribution allows for successful analysis of small samples (Weiss, 2010) within the likelihood. Likelihood information with small sample size may not provide significant information with classical statistical analyses. However, with Bayesian methodology, these small samples retain their value because they are combined with prior information. The combining of samples can be time saving and cost effective and makes use of information that may not otherwise be considered.

Application of Bayesian methodology occurs in areas such as nursing outcomes (Gajewski, Mahnken, & Dunton, 2008; Tervo-Heikkinen, Kiviniemi, Partanen, & Vehviläinen-Julkunen, 2009), nursing assistant-resident rapport (Williams, Herman, Gajewski, & Wilson, 2009), efficacy of treatments (Carpenter et al., 2008), adaptation of measurement tools (Gill, Reifsnider, Lucke, & Mann, 2007), interrater reliability (Gajewski, Hart, Bergquist-Beringer, & Dunton, 2007), patient outcomes (Gill, Reifsnider, & Lucke, 2007), and measurement reliability of patient days (Simon, Yankovskyy, Klaus, Gajewski, & Dunton, 2011).

Studies determining the effectiveness of pharmaceutical treatments are typically randomized controlled trials (RCTs). Generally, the intent is to determine superiority (one intervention is better than another), noninferiority (a new intervention is not less effective than a current treatment), or equivalence of treatments (Turan & Senocak, 2007). Examples of investigators determining equivalence status in RCT and non-RCT studies include determining equivalence of an original questionnaire and the translated version (Jones, Mallinson, Phillips, & Kang, 2006), equivalent anesthetic effect between two topical agents (Kleiber, Sorenson, Whiteside, Gronstal, & Tannous, 2002), equivalence in outcomes between an alternative scalp laceration closure procedure completed by a physician or nurse (Ong et al., 2008), and an equivalence study of preoperative assessments completed by nurses and senior house officers (Rushforth et al., 2006). As of June 2010, there were few published articles (Gajewski et al., in press) related to Bayesian analysis in determining equivalence between two rating scales.

The main purpose of this presentation is to examine the Bayesian methodology in the context of a simulated content validity study using an equivalence trial. The objective is to determine equivalence rather than to establish content validity. Content validity is necessary to determine if items are relevant to the construct being examined. Traditional content validity requires experts to evaluate each item’s relevance to the construct using a 4-point Likert-type scale ranging from not relevant to very relevant (DeVon et al., 2007). Content validity is necessary to identify irrelevant items or items that are relevant but missing (Bannigan & Watson, 2009); both situations can affect the validity of the tool.

This case study was conducted using items from the National Database of Nursing Quality Indicators (NDNQI®) RN Survey. Items measuring job enjoyment were the focus of the study. Bayesian methodology was used for an equivalence trial between two differing response scales.

Back to Top | Article Outline



Subjects were identified through a cluster random selection of 40 schools of nursing from the top 100 schools of nursing funded by the National Institutes of Health. A Bayesian power of 90% requires 49 subjects within each group (Gajewski & Mayo, 2006). These subjects were considered experts in general job enjoyment for this study. In practice, this would not be an adequate set of experts. However, the objective was to determine equivalence rather than to establish actual content validity.

From the 40 schools, 120 faculty members volunteered to participate in the survey. The participants were randomized into a relevance group (participants determine the item’s relevance to the construct job enjoyment) or correlation group (participants determine the item’s correlation to the construct job enjoyment) during September and October 2009. Thirty-seven participants submitted a survey within the correlation group, and 50 participants within the relevance group.

Back to Top | Article Outline

Research Design

Experts’ information regarding perceived relevance was hypothesized to provide information regarding perceived correlation between the item and the construct, thereby demonstrating equivalence of rating scales. Rather than assuming this equivalence, the goal was to show equivalence via a randomized trial of two groups. The first group receives responses with relevance categories; the second group receives correlation ranges. The theoretical framework for Bayesian methodology is based on Bayes’ rules. The use of combining prior information with the likelihood information (observed data) has the potential to change existing knowledge (prior) when new knowledge is generated (likelihood). The combining of the prior and likelihood creates a posterior distribution, which can be calculated into a posterior probability. New inferences can be made from posterior probabilities resulting in updated knowledge. Further information on the theoretical framework for Bayesian methodology is available in articles by Carpenter et al. (2008) and Gajewski et al. (in press).

Back to Top | Article Outline


The questionnaire contained basic demographic items, seven items from the NDNQI Job Enjoyment Scale to measure satisfaction with work (Taunton et al., 2004) and one Practice Environment Scale (Gajewski, Boyle, Miller, Oberhelman, & Dunton, 2010) item. Demographics of the participants included RN status, highest level of education, job role, and knowledge in research. Each participant reported their level of expertise on psychometric instruments, quantitative research methods, and qualitative research methods.

Depending on group assignment, the participants either identified the level of perceived relevance or the level of perceived correlation for each of the eight items used to measure job enjoyment based on the provided definition: “The extent to which nurses like their job in general.” The item response options consist of a 4-point Likert-type scale. The survey measuring correlation contained a 4-point scale (Cohen, 1988), with 1 = no correlation (.0–.10), 2 = small correlation (.11–.30), 3 = medium correlation (.31–.50), and 4 = large correlation (.51–1.0). The Pearson r ranges are widely accepted and used in reporting the magnitude of effects sizes (Cohen, Cohen, Aiken, & West, 1999). The survey used to measure relevance contained a 4-point scale, with 1 = not relevant, 2 = somewhat relevant, 3 = quite relevant, and 4 = highly relevant. The 4-point scale with relevance labels was supported by Lynn (1986) and Davis (1992). The relevance group represented traditional content validity (Polit & Beck, 2006), and the correlation group was based on the hypothesis that content validity also collects information on item-construct correlation.

Back to Top | Article Outline


The participants were contacted initially by e-mail after receiving institutional review board approval; addresses were gathered from 40 school of nursing faculty member Web sites. Addresses of nursing faculty members were obtained from research department listings, if available, or from their biosketch demonstrating previous research experience. We sent a recruitment notice via e-mail describing the study and participation requirements to nursing faculty at institutions in the top 40, as defined by National Institutes of Health funding. After we established a pool of interested persons (consenting by e-mail), we first randomly selected (simple random sample) 180 to participate. Then we randomized (balanced, completely randomized) one third of them to “correlation,” one third to “relevance,” and one third to a “pilot experimental group”—the last is not a part of this study. We then sent them an additional e-mail containing the appropriate Web link to a survey (depending on group assignment). The volunteers received a Web link to different versions of the survey. The survey was hosted on a Web site available through SurveyMonkey.

Back to Top | Article Outline

Data Analysis

The model for the data is xjkm = μjm + exjkm, wherein xjkm is the response to the jth item by the kth expert within the mth group. The μjm is the mean response for the jth item in the mth group. The exjkm is the error and has a normal distribution with a mean of zero and a variance σj2. The intent of the study was to determine if the means are equivalent and to test for H: |μj1μj2| < 0.5. This hypothesis states the average response from groups is within 0.5 point. This is half distance from each correlation width and within a .25 correlation of each other. With Bayesian methodology, the prior distribution requires specification. Three specifications—weak, flat, and informative prior—are used to determine equivalence.

The posterior distribution uses a Bayesian two-sample comparison (Carpenter et al., 2008). The posterior probability is calculated to determine truth of the hypothesis. Note that the inductive Bayesian reasoning allows for probability of a hypothesis. For the weak prior, a normal distribution of μjm was assumed, wherein the mean is estimated at the midpoint (2.5) of the 4-point scale and standard error is 1/√4. The prior distribution indicates the probability of the hypothesis being true as a flip of a coin. The flat prior incorporates the parameters of the weak prior; however, sample size is set at zero. No information is provided because there are no subjects; this is also viewed as a skeptical prior. The probability the hypothesis is true is 0%. The informative prior consists of data from a previous study (Gajewski et al., in press) measuring equivalence between perceived relevance and correlation among NDNQI site coordinators (RN contact members of their respective healthcare facility receiving NDNQI services). The prior probability is calculated as equal or greater than 86% of the hypothesis is true. Example of posterior probability calculation is found in a previous study using Bayesian methodology (Carpenter et al., 2008; Oberhelman, 2009). The assumption that the mean was distributed normally (the likelihood) was tested using Bootstrap Gajewski, Hall, & Dunton, 2007). Resamples (10,000) within each group were obtained, and the sample mean was calculated. Then, the normal probability and all eight items by group were plotted, and the normal assumption was found to be valid. This assumption is consistent with the mathematical theorem of the central limit theorem for posterior distribution.

Back to Top | Article Outline


Of those randomized in the correlation and relevance group, 87 participants completed their assigned survey. There were 37 participants in the correlation group and 50 participants in the relevance group. Of the 87 individuals, 85 participants were RNs. Two of the 87 participants in the study did not complete participant characteristic items on the survey. Participant characteristics are presented in Table 1.

More than 60% of the respondents agreed that each item was either quite or highly relevant (Response Option 3 or 4) or moderately or largely correlated (Response Option 3 or 4) to the construct job enjoyment, except for Item 6 (are clinically competent; Table 2). Item 6 is an item collected by NDNQI as part of the Practice Environment Scale (Gajewski et al., 2010). This item was placed in the survey to determine respondents’ attentiveness in determining item quality and validity to the construct. Missing responses did not exceed 5% for each item within each group.

Analysis of the posterior distribution with the weak prior shows that all items, except for Item 3 (have to force themselves to work much of the time), have posterior distribution probabilities greater than .90. However, Item 3 was close to the probability of .90. Based on the given model

, seven of the eight items have >90% probability that the means are within 0.5 point. The analysis with the flat prior shows that six of eight items have posterior distribution probabilities of >.90. The posterior probability for Item 3 decreased, and Item 7 (feel that each day on their job will never end) decreased to .88. All posterior distributions analyzed with the informative prior have a probability of >.90. All posterior probability distribution results are presented in Table 3, and an example of the calculation of the posterior probability is shown in Figures 1 and 2.

Back to Top | Article Outline


The prior information (flat, weak, and informative) indicates a respective probability (0%, 52%, and ≥86%) that the relevance scale and correlation scale shows equivalence with mean difference of ≤.5. Once the prior is integrated into the new information (likelihood), the probability becomes altered as shown by the posterior probability (Figure 3). On the basis of the posterior probability distribution (derived from the flat, weak, and informative), 83%–99%, 87%–99%, and 95%–99%, respectively, are the probabilities that the correlation and relevance scale are equivalent based on the groups’ mean difference of ±0.5. Of interest is the strength of the posterior probability (83%–99%) when a 0% prior probability was combined with the likelihood. Item 3 shows a less than desired posterior probability (sensitivity of the item to the construct). The item’s wording differs from the actual scale item for study replication purposes and showed how the posterior probability was smaller. The wording of the item in the Job Enjoyment Scale, have to force themselves to come to work much of the time, versus what was in the study, have to force themselves to work much of the time, are very different in context. It is possible that commitment to the organization is affected by nurses’ job enjoyment versus commitment to caring for patients. The item in the study may implicate enjoyment with the nursing profession as oppose to a nursing job.

Notably, this case study exemplifies how previous knowledge from a small sample size may be included in current data to generate new information (prior and posterior probability). Furthermore, measuring equivalence between rating scales has led to future research in the area of experts possibly interpreting content information as a correlation, because relevance translates to a correlation. Note that this is a partial replication of the study (Gajewski et al., in press), but on a different set of participants and analysis with three different priors. Six percent of that sample had doctorate-level degrees. Posterior probability results incorporating the informative prior were similar to the previous study—all items had a posterior distribution of >.90 (90%) based on the exact model. This study confirms the use of content validity information as a valid basis for a prior distribution in construct validity.

Limitations of this study are the sample size and the classification of scale. The correlation group had smaller sample size than the relevance group and did not have the sample size to achieve the Bayesian power of 90%. Classification of the scales was treated as interval data for this study. Treating the scale values as ordinal data could yield stronger results. The strength of this study shows how prior information has the ability to pull observed data results (likelihood probability) with three different priors and establishes the likelihood’s durability when integrated with the priors.

This study shows an alternative case application of Bayesian methodology in determining equivalence between two rating scales. Although the calculation within Bayesian methodology might be new to some researchers, the posterior probability can be interpreted easily. Application of Bayesian methodology to strategies of content validity measurement of survey tools has potential in the emergence of new knowledge. A content validity study integrated with a Bayesian design and analysis in an equivalence trial has several advantages. Using prior information with an ongoing study may provide new and more efficient processes (e.g., the number of participants within a study may be decreased). The Bayesian methodology is flexible enough to allow for adjustment of evolving prior information to be integrated into the likelihood, creating an extension of knowledge without having to design studies at the ground level. In addition, an equivalence trial enables researchers to decide if measuring validity of items via a 4-point Likert-type scale is equivalent to selecting a perceived correlation range.

Back to Top | Article Outline


Adamina M., Tomlinson G., Guller U. (2009). Bayesian statistics in oncology: A guide for the clinical investigator. Cancer, 115, 5371–5381. doi: 10.1002/cncr.24628.
Bannigan K., Watson R. (2009). Reliability and validity in a nutshell. Journal of Clinical Nursing, 18, 3237–3243. doi: 10.1111/j.1365-2702.2009.02939.x.
Carpenter J., Gajewski B., Teel C., Aaronson L. S. (2008). Bayesian data analysis: Estimating the efficacy of T’ai Chi as a case study. Nursing Research, 57, 214–219. doi: 10.1097/NNR.0b013e3181e46062.
Cohen J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.
Cohen P., Cohen J., Aiken L. S., West S. G. (1999). The problem of units and the circumstance for POMP. Multivariate Behavioral Research, 34, 315–346. doi: 10.1207/S15327906MBR3403.
Davis L. L. (1992). Instrument review: Getting the most from your panel of experts. Applied Nursing Research, 5, 194–197. doi: 10.1016/S0897-1897(05)80008-4.
DeVon H. A., Block M. E., Moyle-Wright P., Ernst D. M., Hayden S. J., Lazzara D. J., Kostas-Polston E. (2007). A psychometric toolbox for testing validity and reliability. Journal of Nursing Scholarship, 39, 155–164. doi: 10.1111/j.1547-5069.2007.00161.x.
Gajewski B., Hall M., Dunton N. (2007). Summarizing benchmarks in the National Database of Nursing Quality Indicators using Bootstrap confidence intervals. Research in Nursing & Health, 30, 112–119. doi: 10.1002/nur.20166.
Gajewski B. J., Boyle D. K., Miller P. A., Oberhelman F., Dunton N. (2010). A multilevel confirmatory factor analysis of the Practice Environment Scale: A case study. Nursing Research, 59, 147–153. doi: 10.1097/NNR.0b013e3181d1a71e.
Gajewski B. J., Coffland V., Boyle D. K., Bott M. J., Price L., Leopold J., Dunton N. (in press). Assessing content validity through correlation and relevance tools: A Bayesian randomized equivalency experiment. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences. doi: 10.1027/1614-2241/a000040.
Gajewski B. J., Hart S., Bergquist-Beringer S., Dunton N. (2007). Inter-rater reliability of pressure ulcer staging: Ordinal probit Bayesian hierarchical model that allows for uncertain rater response. Statistics in Medicine, 26, 4602–4618. doi: 10.1002/sim.2877.
Gajewski B. J., Mahnken J. D., Dunton N. (2008). Improving quality indicator report cards through Bayesian modeling. BMC Medical Research Methodology, 8, 77–89. doi: 10.1186/1471-2288-8-77.
Gajewski B. J., Mayo M. S. (2006). Bayesian sample size collections in phase II clinical trials using a mixture of informative priors. Statistics in Medicine, 25, 2554–2566. doi: 10.1002/sim.2450.
Gill S. L., Reifsnider E., Lucke J. F. (2007). Effects of support on the initiation and duration of breastfeeding. Western Journal of Nursing Research, 29, 708–723. doi: 10.1177/0193945906297376.
Gill S. L., Reifsnider E., Lucke J. F., Mann A. R. (2007). Predicting breast-feeding attrition: Adapting the Breast-feeding Attrition Prediction Tool. The Journal of Perinatal & Neonatal Nursing, 21, 216–224. doi: 10.1097/01.JPN.0000285811.21151.37.
Goodman S. N. (2005). Introduction to Bayesian methods I: Measuring the strength of evidence. Clinical Trials, 2, 282–290. doi: 10.1191/1740774505cn098oa.
Jones E. G., Mallinson R. K., Phillips L., Kang Y. (2006). Challenges in language, culture, and modality: Translating English measures into American sign language. Nursing Research, 55, 75–81.
Kleiber C., Sorenson M., Whiteside K., Gronstal B. A., Tannous R. (2002). Topical anesthetics for intravenous insertion in children: A randomized equivalency study. Pediatrics, 110, 758–761. doi: 10.1542/peds.110.4.758.
Lynn M. R. (1986). Determination and quantification of content validity. Nursing Research, 35, 382–385.
Oberhelman, F. (2009, November 3). [Letter to the editor in reference to Carpenter et al. (2008)]. Retrieved from
Ong M. E. H., Chan Y. H., Teo J., Saroja S., Yap S., Ang P. H. Y., Lim S. H. (2008). Hair apposition technique for scalp laceration repair: A randomized control trial comparing physicians and nurses (HAT 2 study). American Journal of Emergency Medicine, 26, 433–438. doi: 10.1016/j.ajem.2007.07.008.
Polit D. F., Beck C. T. (2006). The content validity index: Are you sure you know what’s being reported? Critique and recommendations. Research in Nursing & Health, 29, 489–497. doi: 10.1002/nur.20147.
Rushforth H., Burge D., Mullee M., Jones S., McDonald H., Glasper E. A. (2006). Nurse-led paediatric pre operative assessment: An equivalence study. Paediatric Nursing, 18, 23–29.
Simon M., Yankovskyy E., Klaus S., Gajewski B, Dunton N. (2011). Midnight census revisited: Reliability of patient day measurements in US hospital units. International Journal of Nursing Studies, 48, 56–61. doi: 10.1016/j.ijnurstu.2010.07.002.
Spiegelhalter D. J., Abrams K. R., Myles J. P. (2004). Bayesian approaches to clinical trials and health-care evaluation. Sussex, England: John Wiley & Sons.
Taunton R. L., Bott M. J., Koehn M. L., Miller P., Rindner E., Pace K., Dunton N. (2004). The NDNQI-adapted index of work satisfaction. Journal of Nursing Measurement, 12, 101–122.
Tervo-Heikkinen T., Kiviniemi V., Partanen P., Vehviläinen-Julkunen K. (2009). Nurse staffing levels and nursing outcomes: A Bayesian analysis of Finnish-registered nurse survey data. Journal of Nursing Management, 17, 986–993. doi: 10.1111/j.1365-2834.2009.01020.x.
Turan F. N., Senocak M. (2007). Evaluating “superiority,” “equivalence” and “non-inferiority” in clinical trials. Annals of Saudi Medicine, 27, 284–288. doi: 10.4103/0256-4947.51490.
Weiss R. E. (2010). Bayesian methods for data analysis. American Journal of Ophthalmology, 149, 187. e1–188.e1. doi: 10.1016/j.ajo.2009.11.011.
Williams K. N., Herman R., Gajewski B., Wilson K. (2009). Elderspeak communication: Impact on dementia care. American Journal of Alzheimer’s Disease and Other Dementias, 24, 11–20. doi: 10.1177/1533317508318472.

Bayesian methodology; equivalence; rating scales

© 2012 Lippincott Williams & Wilkins, Inc.