The Nursing Student Self-Efficacy Scale: Development Using Item Response Theory : Nursing Research

Secondary Logo

Journal Logo


The Nursing Student Self-Efficacy Scale

Development Using Item Response Theory

Stump, Glenda S.; Husman, Jenefer; Brem, Sarah K.

Author Information
Nursing Research 61(3):p 149-158, May/June 2012. | DOI: 10.1097/NNR.0b013e318253a750
  • Free


Beliefs about ability influence the choices people make in many situations. Self-efficacy, or the belief in one’s ability to take actions to manage a future situation (Bandura, 1997), is often considered in an academic context but can include performance of psychomotor tasks as well. This makes the construct of particular interest to practice disciplines such as nursing, in which psychomotor skill acquisition is a critical component of student instruction (American Association of Colleges of Nursing, 2008). Bandura and others (e.g., Pajares & Urdan, 2006; Schunk & Zimmerman, 1997) have shown that when students are faced with a task, they will exert maximal effort and persist despite failure if they believe they are capable. In providing nursing care to patients, however, self-efficacy for the ability to perform patient care is vital for nurses in that it may be required even to initiate performance. This concern is supported by a study with osteopathic physicians showing that healthcare providers who lacked confidence in their abilities did not take needed actions for their patients (Johnson & Kurtz, 2001). Similarly, in the current healthcare climate, where great emphasis has been placed on decreasing patient errors and risk (Institute of Medicine, 2000), nurses who doubt their ability may not initiate tasks in order to avoid making mistakes.

In addition to possessing efficacy for task performance, it is important that students correctly calibrate their self-efficacy or make accurate estimates of their ability (Chen, 2003). In the provision of healthcare, inaccurate calibration of self-efficacy may lead to adverse patient outcomes. A nursing student who incorrectly believes that he or she is capable of performing a skill may harm the patient if he or she independently performs the skill instead of appropriately seeking help. Conversely, a student who experiences low self-efficacy for tasks may delay initiation or avoid them altogether, again leading to possible adverse consequences for the patient.

The potential influence of self-efficacy on nursing performance makes it important for nurse educators to optimize instruction that supports students’ accurate estimates of their ability. An appropriate first step in this process is to develop a measure of nursing self-efficacy that provides ample evidence for reliable and valid interpretation of students’ scores. Many currently published instruments measuring self-efficacy utilize classical true score theory (CTST) as the theoretical basis for measurement (e.g., Arnold et al., 2009; Brown & Chronister, 2009; Cheraghi, Hassani, Yaghmaei, & Alavi-Majed, 2009; Clark, Owen, & Tholcken, 2004). This model presents some limitations when test or survey items are analyzed and data are interpreted, such as generation of a single reliability estimate for an entire scale and sample dependent score interpretation (Embretson & Reise, 2000). Item response theory (IRT) provides an alternative to CTST.

Advantages of IRT

IRT procedures provide distinct advantages over those utilized in CTST with regard to item analysis, score interpretation, and reliability estimates. Within the CTST model, item parameters are sample dependent, meaning that an individual’s score may be higher if the items were easy or lower if the items were difficult, and the difficulty of the items may differ depending on the ability of individuals who completed them. When measuring abilities or attitudes of multiple groups using CTST, it is difficult to make comparisons across groups unless the same instrument is used, because scoring is relative to specific tests and groups of respondents (Embretson & Reise, 2000).

IRT procedures are used to estimate the amount of an individual’s attitude or ability, known as the latent trait level, from his or her pattern of responses to survey or test items. The procedures are also used to estimate an item’s location, referring to the relative ease or difficulty of positively endorsing the item or getting it correct. For multiple category response items such as Likert-type scale data, IRT procedures estimate the locations of item response categories on a continuum where respondents with a particular trait level are likely to respond. In the context of this study, IRT procedures estimated the level of self-efficacy for individuals who chose each response option in a given item. The estimated parameters—trait level and item or response category location—are not sample dependent, which allows for comparison of scores across multiple groups or multiple administrations of the instrument. The information obtained from IRT analysis is at an item level rather than a scale level as in CTST (Embretson & Reise, 2000).

Another advantage of using IRT for scale evaluation is that estimates of reliability coefficients are more precise. When CTST is used, a single reliability estimate is given for an entire test or scale, which implies that the scores are accurate at all ability or attitude levels; in reality, reliability of score interpretation is lower when an individual’s attitude or ability is at the extreme ends of the spectrum. With IRT, estimates of reliability vary with individuals’ trait levels (Embretson & Reise, 2000).

Study Purpose

The current study is unique in that IRT was used for scaling or establishing the relationship between students’ item responses and their level of self-efficacy. The purpose of this study was to use IRT to evaluate a measure of nursing students’ self-efficacy. The context of care for critically ill patients was chosen because a review of the literature showed that many current scales are either narrow in focus, such as self-efficacy for care related to electrocardiograms (Brown & Chronister, 2009), or cover a broad range of nursing practice, such as self-efficacy for care of patients with chronic illness in multiple settings (Clark et al., 2004).

A scale measuring self-efficacy for task-specific skills utilized in one generic setting would be useful for several reasons: It would extend measurement of this important motivational construct to situations in which it has been relatively unexplored, it may provide direction toward instructional strategies that could result in increased self-efficacy of nurses for provision of patient care, and it would provide a basis upon which to evaluate accurate calibration. The significance of well-calibrated self-efficacy in relation to task performance in the healthcare arena necessitates accurate measurement of this construct. Specific research questions addressed by this study were as follows: (a) Does an IRT model fit self-efficacy data obtained from nursing students? and (b) Can evidence be provided to support reliable and valid interpretation of scores from a self-efficacy scale designed specifically for nursing students?


Item Development

As recommended by Bandura (2006), a domain analysis was conducted prior to scale development to determine which aspects of self-efficacy should be measured. Student beliefs were measured for tasks in two of the four core competencies for nursing practice identified as being among those requisite for a sound nursing education (American Association of Colleges of Nursing, 2008)—technical skills, referred to as psychomotor skills, and communication skills. These competencies were selected from those recommended by the American Association of Colleges of Nursing because students’ self-efficacy could be measured and later compared with their performance in patient care scenarios to determine whether they calibrated their self-efficacy accurately. A critical care nursing faculty member assisted in determination of the skills and activities relevant to the care of critically ill patients. This list was not exhaustive but was, instead, targeted at critical care skills required during patient care scenarios typically used for critical care instruction. The tasks identified for each competency were ordered by a hypothesized level of difficulty, from easiest to most difficult (Wilson, 2005). Initial review by three nursing experts, two experienced critical care faculty and one experienced nursing educator, corroborated that the skills used in item development were ones required for care of a critically ill patient.

Following this process, 70 items were constructed for the 25 tasks identified in the domain analysis. Between two and five items were created for each task; they differed only in word choice. After approval from the institutional review board, the items were evaluated via two pilot studies conducted with 10 volunteer nursing students in their final semester of a university nursing program. During both pilot studies, participants were asked to select the most clearly stated item for each task from the differently worded versions. They were also asked to identify items containing unfamiliar terminology and were subsequently interviewed about items they found to be unclear. At the conclusion of the pilot study phase, the instrument contained 26 items representing the originally identified 25 tasks.

Prior to the final scaling study, five different nursing experts reviewed the items and two experts re-reviewed them. All experts were current or former members of the critical care nursing faculty. The experts were in 100% agreement regarding item content and in 92% agreement regarding the way the items were worded.

Nursing Student Self-Efficacy Survey

The final product of the pilot studies and expert reviews was a 26-item scale with six additional multiple-choice demographic items and one open-ended response item for participant comments. The item response options were on a Likert-type scale, with scores ranging from 1 (not at all confident) to 5 (completely confident). There were 18 items related to psychomotor skills, used to assess students’ self-efficacy to perform psychomotor tasks related to the care of critically ill patients and 8 items related to communication skills, used to assess students’ self-efficacy to communicate with patients and other providers about patient care.

Participants and Procedure

The Nursing Student Self-Efficacy Survey (NSSES) was administered to 421 volunteer nursing students enrolled in one of the four semesters of study required to complete their nursing program. The sample contained 272 students from four semesters of a nursing program at a large public university in the southwestern United States and 149 nursing students from the first three semesters of a nursing program offered at a community college in the same region. After removal of incomplete or duplicate surveys, data from 405 students remained for analysis, representing approximately 63% of the students available for survey. The participants’ mean age was 27.6 years, and 12.6% were male students. Twenty-six percent of the participants reported being in the first semester of their nursing program, 23% reported being in the second semester, 31% reported being in the third semester, and 20% reported being in the fourth semester or graduating within the previous 4 weeks. Participants completed the survey in either an online or hard copy format and were given a small financial incentive for their participation.


Preliminary Analysis

Descriptive statistics were calculated for all items. Overall, participants marked most items favorably, with 17 of the 26 items having a mean greater than 2.5. Item means ranged from 1.99 to 4.95 with standard deviations ranging from 0.28 to 1.56; item variances ranged from 0.08 to 2.44. Items on which participants did not use the full range of item response options and items in which response option frequencies were low in particular categories were noted, as these characteristics would affect subsequent IRT parameter estimation (Embretson & Reise, 2000).


The generalized partial credit model (Muraki, 1992) was chosen to fit to the data in this study. The generalized partial credit model is appropriate for use with polytomous data that has ordered item response categories like those of a Likert-type scale, wherein a response of 3 indicates more of the latent trait than a response of 2 and a response of 2 indicates more of the trait than a response of 1. The model is also appropriate for use when respondent guessing is not anticipated and when items with varying degrees of discrimination are used (Embretson & Reise, 2000). The generalized partial credit model is expressed as

In this study, Pjk(θ) represented the probability of giving a particular response given a particular level of self-efficacy; x equaled the observed student response to a Likert-type item, with M being the highest possible response (5 in this study); aj equaled the item’s ability to discriminate between those with high or low levels of self-efficacy; and bjk equaled the difficulty in moving from a response in a lower category (k − 1) to the next higher category (k) for item j (Embretson & Reise, 2000).

The difficulty in moving from a lower to the next higher category is known as the step difficulty parameter. A higher value of the step difficulty parameter would indicate that the step, or moving to the next response option, is more difficult, requiring higher levels of self-efficacy relative to lower steps within the item (Embretson & Reise, 2000).

Estimation of item parameters with a generalized partial credit model also produces a unique slope parameter for each item. In polytomous models and in the context of this study, the ability of an item to discriminate between students with high or low levels of self-efficacy depends on a combination of the slope parameter and the span of the step difficulty parameters. The slope parameter of an item indicated how much students’ responses differed at various levels of self-efficacy (Muraki, 1992). It also reflected the item’s relationship to the trait being measured, similar to an item–total correlation in CTST. An item’s ability to discriminate determines the amount of information that can be reliably obtained about the latent trait level, or students’ self-efficacy in this study (Embretson & Reise, 2000).

Evaluation of Model Assumptions

Prior to parameter estimation, the IRT assumptions of appropriate dimensionality and local item independence were evaluated. Appropriate dimensionality should be confirmed prior to analysis in order for the parameter estimates to be accurate indicators of participants’ trait level (Embretson & Reise, 2000). In this case, the chosen IRT model assumed unidimensional data, which meant that only one latent trait, self-efficacy, should explain the variance in student responses to the items. Similarly, local item independence means that the items should be correlated only through their relationship with the latent trait, and thus, if the latent trait was removed, the items would no longer be correlated (Embretson & Reise, 2000). If this assumption held true for this study, student responses for one item would rely only on their self-efficacy for the behavior stated in that item and would not be dependent on their response to any other item.

Dimensionality was initially assessed by exploratory factor analysis. A one- through six-factor solution was evaluated for the data using the Mplus program (Muthén & Muthén, 2007). Using a criterion of 0.40 for acceptable factor loading (Tabachnick & Fidell, 2007), a five-factor solution initially provided acceptable model fit to the data (RMSEA = .082, CI [0.076, 0.088], CFI =.974) and explained 75% of the variance in the items. One item, PMS1, was removed from the analysis based on a low factor loading on all possible factors (Reise, Waller, & Comrey, 2000). Items PMS2 and PMS3 were also removed due to their formation of a two-item factor.

Local item independence was assessed by analyzing correlations among variables. The correlation between three pairs of items was greater than .90; these items were very similar (e.g., PMS7 related to correctly monitoring a blood transfusion, whereas PMS8 related to correctly starting a blood transfusion). One item from each pair was eliminated.

The remaining 20 items were reanalyzed, and a two-factor solution then provided the best fit to the data. This solution provided two interpretable factors with factor loadings greater than 0.57, explaining 66% of the item variance. Examination of the items’ residual correlations provided evidence of local independence of the items. The factors identified upon completion of the exploratory analysis related to communication skills and psychomotor skills. The items and respective factor loadings are shown in Table 1. Rather than use advanced IRT methods to analyze a multidimensional scale, these two factors were then analyzed as separate unidimensional subscales: psychomotor skills (PMS) and communication skills (COM).

Geomin Rotated Factor Loadings for Reduced Nursing Student Self-Efficacy Survey Subscales

Examination of response option use by participants prior to parameter estimation showed that, of the remaining 20 items in the two subscales, Response Option 1 (not at all confident) was chosen 3% of the time or less for all eight of the COM items. This suggested a ceiling effect for these items; their higher mean score may have indicated that participants considered the items to be very easy, which explained their preference for choosing Category 2 or higher. Because item parameter estimates are affected adversely by low numbers of responses in a particular category (de Ayala, 2009), Categories 1 and 2 were collapsed for items COM2, COM3, COM4, COM5, COM6, and COM8, leaving four response options instead of five. Both Response Options 1 and 2 were underutilized for items COM1 and COM7; thus, Categories 1, 2, and 3 were collapsed for these items (Muraki, 1997).

Parameter Estimation

Parameter estimation was completed with PARSCALE4 (Muraki & Bock, 2003), using marginal maximum likelihood with expectation maximization (Bock & Aitkin, 1981) to estimate item parameters and expected a posteriori scoring (Bock & Mislevy, 1982) to estimate the latent trait parameters. The students’ response patterns to the self-efficacy items were used in the marginal maximum likelihood and expectation maximization estimation procedures to approximate four-step difficulty parameters and one discrimination parameter for each item, along with an information function, which provided reliability evidence. Following item parameter estimation, the expected a posteriori scoring procedure estimated one latent trait (self-efficacy) value for each participant.

Parameter estimates for all subscale items are presented in Table 2. When slope parameters of greater than 1.0 were considered acceptable as recommended by Hambleton, Swaminathan, and Rogers (1991), 40% of all items discriminated well between participants with high and low levels of self-efficacy. The items that were best at discriminating between levels of students’ self-efficacy were PMS items that addressed care and monitoring of arterial or venous lines, closed suctioning of a ventilated patient, and care of a patient on a ventilator, as well as COM items that addressed documentation of patient assessment, discussion of nursing procedures with patients, and documentation of patient care. The lowest slope parameter was found for item COM7, which addressed communicating to develop a therapeutic nurse–client relationship. Categories 1, 2, and 3 had been collapsed earlier for this item due to the low number of student responses in those categories; the low slope parameter reflected the loss of information provided by the item due to collapsing the category responses (Muraki, 1993).

Item Parameter Estimates for Subscales

Following parameter estimation, the category response curves (CRCs) were examined for each item. The CRCs illustrate the relationship between the probability of a student’s response in one particular category and his or her level of self-efficacy (Donoghue, 1994). For example, Figure 1 illustrates how likely it was that students with various levels of self-efficacy chose each of the four available response categories for the COM6 item, “accurately document care given to a patient.” Curve 1 represents the probability of a student choosing Response Option 1 (not at all confident). Students with a lower self-efficacy level of −2.0 had approximately a .55 probability of choosing a response in the first category and a lower probability of responding in the second category (somewhat confident), as represented by Curve 2. The probability of this same student responding in the third category (moderately confident), represented by Curve 3, was very low. Alternatively, if students were estimated to have a higher self-efficacy level of +1, their probability of selecting a response in the fourth category (completely confident) was high at approximately .8. These students had a lower probability of responding in the second or third category and zero probability of responding in Category 1. Also noted was the intersection of each item’s CRCs, illustrating where the step difficulty parameters were located in relation to levels of self-efficacy. Examination of CRCs and step difficulty parameters provide an indication about the overall difficulty of the item. For example, if all of the CRCs of a self-efficacy item were located below 0 on the continuum, this would indicate that the item was fairly easy, because even students with a low estimated self-efficacy level were likely to choose the highest response (extremely confident).

Category response curves for one item (COM6) with four response options.

Examination of the CRCs also revealed disordinal step difficulty parameters, or reversals (Dodd & Koch, 1987), for four psychomotor subscale items and one communication subscale item. When step difficulty parameters are out of order, it means that moving from one higher step to another was easier for respondents than moving from a lower step to another (Muraki, 1997). The reversal common to all of the psychomotor items is illustrated in Figure 2, in which moving from the second category (somewhat confident) to the third category (moderately confident) occurred at a lower level of self-efficacy than moving from the first category (not at all confident) to the second category. There was also a higher probability of responses in Categories 1 and 3 than in Category 2. If reversals in category intersection parameters are present, it indicates that there is at least one response option that is not chosen frequently for a particular trait level (Andrich, 1988). This can serve as an indication that fewer response options may be needed for that item. In the data, all except one of the items with reversals were related to skill performance. For these skills, students did not choose Response Option 2 as frequently as the others. Their responses indicated that, in general, they felt either a total lack of confidence or moderate confidence in their ability to perform these skills, with little variation in between.

Category response curves for PMS5 with reversal of Step 1 and Step 2.

Model Fit

The goodness-of-fit of the generalized partial credit model (Muraki, 1992) to the data was examined at the item and subscale levels. Item fit was examined both statistically and graphically. For statistical analysis, the chi-square statistic for each item was evaluated (Muraki, 1997) using an alpha of .001. If the chi-square was significant with p ≤.001, the item parameters were considered to be significantly different than those specified in the model. All items exhibited acceptable fit when alpha was set at .001 (Table 3).

Item–Subscale Fit

The overall fit of the generalized partial credit model to each subscale was evaluated using the chi-square statistic. Model fit to the psychomotor and communication subscales was supported, χ2 (390, n = 405) = 446.64, p > .01 and χ2 (145, n = 405) = 172.90, p > .01, respectively.

Graphical confirmation of item fit was evaluated by comparing the empirical and model-implied fit plots using MODFIT (Stark, 2008) to look for correspondence between the CRCs of empirical and model-implied data (Chernyshenko, Stark, Chan, Drasgow, & Williams, 2001). The fit plots showed small to moderate discrepancies between the empirical data and the expected probabilities in all response categories, with increased discrepancies at the highest and lowest levels of self-efficacy for multiple items. Reasonable correspondence between the empirical and model-implied curves provided further support for model-data fit (Figure 3).

Model-implied and empirical category response curves for one response option of PMS11- Category 5, completely confident.

Reliability Evidence

In IRT models, reliability is judged by the amount of information provided by the individual items as well as the entire test or scale. In polytomous IRT models, the information functions from each response category for an item are combined to arrive at the information provided by the item (Dodd, de Ayala, & Koch, 1995). Items that are more discriminating provide more information about the latent trait (Embretson & Reise, 2000). In addition, item information curves illustrate the trait level where the most information can be obtained (Figure 4). Because there is no agreed upon criteria to judge the adequacy of item information, the graphical representation of item information functions was evaluated, as recommended by Fletcher and Hattie (2004) and Muraki (1993). A review of individual item information functions (curves) for NSSES items showed that most psychomotor items provided moderate to high information and communication items provided a mix of high and low information.

Item information curve showing high information at −1 to +0.5 level of self-efficacy.

A review of test information functions showed the amount of information that each subscale collectively provided at all levels of self-efficacy. The information function was also used to calculate the standard error of measurement for each level of self-efficacy. Smaller standard errors indicated more precise measurement (Embretson & Reise, 2000). Test information functions for the items showed that the psychomotor subscale provided the most precise information with low standard error when students had estimated self-efficacy levels ranging from −1.0 to +2.0 (Figure 5a) and the communication subscale provided the most precise measurement between self-efficacy levels of −2.2 and +0.6 (Figure 6a). As mentioned previously, this is a significant difference between results of IRT and CTST scaling. Whereas IRT procedures revealed a difference in precision of measurement between various levels of self-efficacy, CTST procedures did not; the frequent CTST measure of reliability, Cronbach’s alpha, was .94 for the psychomotor subscale and .85 for the communication subscale.

Test information and respondent trait level for psychomotor subscale: (a) test information for psychomotor subscale, (b) respondent trait level for second semester participants (n = 93), and (c) respondent trait level for third semester participants (n = 124).
Test information and respondent trait level for communication subscale: (a) test information for communication subscale, (b) respondent trait level for first semester participants (n = 106), and (c) respondent trait level for second semester participants (n = 93).

To measure self-efficacy with the greatest precision in most students, the distribution of students’ estimated self-efficacy levels should mirror the test information function. In this study, distributions of students’ self-efficacy were examined when grouped by their semester in the nursing program, as self-efficacy levels were expected to vary dependent on students’ progress through the curriculum. When these self-efficacy level distributions were compared with the test information functions, the most accurate measurement by the psychomotor subscale occurred at self-efficacy levels where most second and third semester students were located (Figure 5b and c); the most accurate measurement by the communication subscale occurred at self-efficacy levels where most first and second semester students were located (Figure 6b and c).

Validity Evidence

Items were created in accord with theoretical, professional discipline, and expert recommendations as detailed earlier, thus providing content validity evidence. Structural validity evidence was provided by exploratory factor analysis results, which supported two separate factors measuring self-efficacy for the two competencies selected.

Student self-efficacy scores were used as estimated by IRT procedures to provide discriminant validity evidence in the same manner that scores from a measure scaled by CTST procedures would be used; differences were examined in estimated self-efficacy levels of students in different semesters of the nursing program. It was reasonable to anticipate differences in self-efficacy by virtue of students’ progress in the nursing curriculum.

Results of a one-way ANOVA showed significant mean differences in self-efficacy for psychomotor skills, F(3, 205.54) = 179.74, p < .01, and communication, F(3, 398) = 13.87, p < .01, between the four groups of students. Post hoc comparisons to evaluate pairwise differences showed significant differences in self-efficacy between all semester levels for psychomotor skills, but significant differences for communication were found only between the first and subsequent semester students (first and second, first and third, first and fourth) and between the second and fourth semester students. Identical results were obtained in a CTST framework using students’ PMS and COM subscale scores, calculated by averaging their responses to scale items, as estimates of their self-efficacy.


The results of this study showed that an IRT model can be fitted to self-efficacy data and that graphical illustrations of item information, test information, CRCs, and empirical versus implied model fit can be helpful in determining item characteristics. In addition, detailed reliability and standard error information can be obtained for individual items as well as the entire test, which is an advantage of IRT over traditional CTST scaling procedures. Because IRT procedures estimate reliability at all levels of the trait, they can pinpoint the population that could be measured most reliably with a particular scale. Data provided by IRT estimation procedures can be used in further analysis (i.e., ANOVA) in the same way that scores from CTST can be used.

Although the sample in this study was heterogeneous with regard to level of self-efficacy and the full range of response options were utilized for most items, self-selected participation may have attracted participants who were, in general, more efficacious than the population of nursing students. Students were assured that their survey responses were anonymous, but it is possible that they were uncomfortable to admit low self-efficacy. In addition, items in the communication subscale showed a possible ceiling effect, which would indicate inadequate measurement of self-efficacy for those with high levels of the trait.

Further refinement of the items is necessary before the subscales can be widely used. Increasing difficulty of the communication items and revision of the psychomotor items that had discrimination values of less than 1 will increase the amount of information that can be obtained from the items. A differential item function analysis is also an important next step to examine for similar functioning of the items among different groups of students. Additional construct validity evidence is also necessary, particularly with regard to the ability of the NSSES to predict students’ effort and persistence at difficult tasks.

After revision and further study, the NSSES can be used over time in nursing student samples to obtain measures of self-efficacy that are not sample dependent. The NSSES can be used to give formative feedback to students as they progress through the nursing curriculum, providing guidance toward learning experiences that could improve their self-efficacy for performance of particular skills. Scores can be used to direct faculty attention toward areas where educational intervention or remediation may benefit the students. More importantly, accurate measurement of this construct will provide a basis upon which to evaluate accurate calibration, a vital skill for nurses.

In conclusion, results from this study indicate that IRT models can be fitted appropriately to polytomous data generated from self-efficacy items. Although CTST and IRT procedures produce similar results in some respects, the visual depiction of reliability and standard error from item and test information curves, along with the visual impression of item difficulty generated by CRCs, convey details about item characteristics that are not easily available in the more traditional CTST framework.


American Association of Colleges of Nursing. (2008). Revision of the essentials of baccalaureate education for professional nursing practice. Retrieved from
Andrich D. (1988). A general form of Rasch’s extended logistic model for partial credit scoring. Applied Measurement in Education, 1, 363–378.
Arnold J. J., Johnson L. M., Tucker S. J., Malec J. F., Henrickson S. E., Dunn W. F. (2009). Evaluation tools in simulation learning: Performance and self-efficacy in emergency response. Clinical Simulation in Nursing, 5, e35–e43.
Bandura A. (1997). Self-efficacy: The exercise of control. New York, NY: Freeman.
Bandura A. (2006). Guide for creating self-efficacy scales. In Pajares F., Urdan T. C. (Eds.), Self-efficacy beliefs of adolescents (pp. 307–337). Greenwich, CT: Information Age Publishing.
Bock R. D., Aitken M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459.
Bock R. D., Mislevy R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431–434.
Brown D., Chronister C. (2009). The effect of simulation learning on critical thinking and self-confidence when incorporated into an electrocardiogram nursing course. Clinical Simulation in Nursing, 5, e45–e52.
Chen P. (2003). Exploring the accuracy and predictability of the self-efficacy beliefs of seventh-grade mathematics students. Learning and Individual Differences, 14, 79–92.
Cheraghi F., Hassani P., Yaghmaei F., Alavi-Majed H. (2009). Developing a valid and reliable self-efficacy in clinical performance scale. International Nursing Review, 56, 214–221.
Chernyshenko O. S., Stark S., Chan K. Y., Drasgow F., Williams B. (2001). Fitting item response theory models to two personality inventories: Issues and insights. Multivariate Behavioral Research, 36, 523–562.
Clark M. C., Owen S. V., Tholcken M. A. (2004). Measuring student perceptions of clinical competence. Journal of Nursing Education, 43, 548–554.
de Ayala R. J. (2009). The theory and practice of item response theory. New York, NY: The Guilford Press.
Dodd B. G, Koch W. R. (1987). Effects of variations in item step values on item and test information in the partial credit model. Applied Psychological Measurement, 11, 371–384.
Dodd B. G., de Ayala R. J., Koch W. R. (1995). Computerized adaptive testing with polytomous items. Applied Psychological Methods, 19, 5–22.
Donoghue J. (1994). An empirical examination of the IRT information of polytomously scored reading items under the generalized partial credit model. Journal of Educational Measurement, 31, 295–311.
Embretson S. E., Reise S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum Associates.
Fletcher R., Hattie J. (2004). An examination of the psychometric properties of the physical self-description questionnaire using polytomous item response model. Psychology of Sport and Exercise, 5, 423–446.
Hambleton R. K., Swaminathan H., Rogers H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Institute of Medicine. (2000). To err is human: Building a safer health system. Retrieved from
Johnson S. M., Kurtz M. E. (2001). Diminished use of osteopathic manipulative treatment and its impact on the uniqueness of the osteopathic profession. Academic Medicine, 76, 821–828.
Muraki E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.
Muraki E. (1993). Information functions of the generalized partial credit model. Applied Psychological Measurement, 17, 351–363.
Muraki E. (1997). A generalized partial credit model. In van der Linden W. J., Hambleton R. K. (Eds.), Handbook of modern item response theory, (pp. 153–164). New York, NY: Springer.
Muraki E., Bock R. D. (2003). PARSCALE 4. Chicago, IL: Scientific Software International.
Muthén L. K., Muthén B. O. (2007). Mplus (v. 4.1). Los Angeles, CA: Author.
Pajares F., Urdan T. C. (2006). Self-efficacy beliefs of adolescents. Greenwich, CT: Information Age Publishing.
Reise S. P., Waller N. G., Comrey A. L. (2000). Factor analysis and scale revision Psychological Assessment, 12, 287–297.
Schunk D. H., Zimmerman B. J. (1997). Social origins of self-regulatory competence. Educational Psychologist, 32, 195–208.
Stark, S. (2008). MODFIT (Version 3) [Computer program]. Obtained from author.
Tabachnick B. G., Fidell L. S. (2007). Using multivariate statistics (5th ed.). Boston, MA: Pearson Education.
Wilson M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ: Lawrence Erlbaum & Associates.

item response theory; psychometrics; reliability and validity; self-efficacy

© 2012 Lippincott Williams & Wilkins, Inc.