Klassen, Anne F. D.Phil.; Cano, Stefan J. Ph.D.; Scott, Amie M. M.P.H.; Pusic, Andrea L. M.D., M.H.S.
Increased societal acceptance of cosmetic surgery has resulted in an increased number of patients seeking facial rejuvenation. A face-lift is one of most popular procedures used to combat the appearance of aging, with over 119,026 face-lifts performed in the United States in 2011, 5 percent more than in 2010.1 Face lifts are the fifth most common surgical cosmetic procedure in the United States.1 Satisfaction with facial appearance is undoubtedly the most important outcome to face-lift patients, but there exists limited research evaluating the patient perspective.2,3
Measurement of the patient’s view of his or her facial appearance has been hampered by a lack of clinically meaningful and scientifically valid questionnaires. A systematic review published by our team found only one patient-reported outcome instrument developed to evaluate appearance in face-lift patients.4 The Facelift Outcomes Evaluation scale, published over a decade ago, is a six-item scale that measures appearance, functional outcome, and social acceptance.5,6 Given its limited content and lack of published information regarding its development and scientific properties, it is not a surprise that this patient-reported outcome instrument has not been broadly adopted.
To address the lack of available patient-reported outcome instruments for patients undergoing any type of facial cosmetic operation, minimally invasive cosmetic procedure, or facial injectable, our research team developed the FACE-Q. This is a patient-reported outcome instrument made up of independently functioning scales and checklists measuring concepts important to facial aesthetic patients, including quality of life, appearance, and process of care (Table 1). Each scale provides a standalone score ranging from 0 to 100, with higher scores indicating a better outcome. There is also an adverse effects checklist that includes questions about postsurgical symptoms for different facial areas. Depending on the surgical or nonsurgical procedure, only those FACE-Q scales and/or checklists relevant to a particular patient or procedure(s) need be completed. The aim of this article is to describe the development and psychometric evaluation of five appearance scales and the adverse effects checklist for use with face-lift patients.
Development of FACE-Q Scales and Checklists
Ethics review board approval was obtained before the study was started. FACE-Q scales and checklists were developed by our team following an approach that adheres to internationally recommended guidelines for patient-reported outcome instrument development.7,8
Phase 1: Qualitative Research Methods
In the first phase, we developed a conceptual framework to account for outcomes of importance to facial aesthetic patients that is composed of the following four major domains: appearance, quality of life, process of care, and adverse effects. These domains were identified using a mixed methods approach that included a systematic review, qualitative interviews, and expert input, and is reported in detail elsewhere.9
Phase 2: Quantitative Research Methods
Data were collected as part of two separate studies, and compiled for the purpose of analyses. The following FACE-Q appearance scales and checklist were evaluated in this article:
1. Satisfaction with Cheeks (i.e., sides of the face below the cheekbones): measures satisfaction using items that ask, for example, about symmetry, contour, and fullness.
2. Satisfaction with Lower Face and Jawline: measures satisfaction with items that ask, for example, about how sculpted and how prominent the jawline appears.
3. Appraisal of Nasolabial Folds (i.e., the deep lines that run downward from the sides of the nose): asks how bothered a patient is with his or her nasolabial folds with items such as how deep or noticeable the folds are, and how the folds appear during certain facial expressions.
4. Appraisal of Area Under the Chin: asks how bothered a patient is with this area of the face with items that ask, for example, about loose skin and fat, fullness, and contour.
5. Appraisal of the Neck: asks how bothered someone is with his or her neck with items that ask, for example, about hanging skin, wrinkles, and having to cover up the neck.
6. Adverse effects checklist: asks how bothered a patient is with a range of postsurgical symptoms.
Flesch-Kincaid scores for 29 of 30 items in the five appearance scales were lower than grade 6 (range, 0 to 6.7). For the adverse effects checklist, 12 of 15 items were lower than grade 6 (range, 0 to 10.2).10
Study 1: Data Collection
Ten plastic surgery and dermatology practices in the United States and Canada recruited patients between June of 2010 and June of 2012. Eligible participants were 18 years of age or older who had undergone or were waiting to undergo any surgical or nonsurgical facial aesthetic procedure. For the purposes of this article, we used the data provided by the subsample of patients in the FACE-Q field test who had undergone or were waiting to undergo a face lift. Patients from six practices were recruited in person, and patients from four practices were recruited through a postal survey, with up to three mailed reminders as necessary.
Study 2: Data Collection
A medical device company used the FACE-Q scales for an international clinical trial. The Mapi Research Trust11 provided translations and linguistic validation of the FACE-Q scales. Participants completed FACE-Q scales before and after surgery.
For the five FACE-Q appearance-related scales, decisions about item inclusion/exclusion were based on their performance against a standardized set of psychometric criteria. The adverse effects checklist was not analyzed in this way, as it is a descriptive tool (i.e., each item is an individual clinically important issue, and a total score is not computed).
Rasch Measurement Theory
We analyzed the FACE-Q scale data using Rasch measurement theory methods12,13 in RUMM2030 software.14 Rasch measurement theory analysis examines the differences between observed and predicted item responses to determine the extent to which the data for a set of items accord with (“fit”) a mathematical model. When data fit the Rasch model, the measurement theory (i.e., that a scale measures a specific construct) is supported by the data. Rasch measurement theory analysis examines the difference (or fit) between the observed scores (patients’ responses to items) and the expected values predicted by the Rasch model, which is evaluated interactively using a range of statistical and graphic tests to examine each item in a scale.15,16 This combined evidence is used to make a judgment about the overall quality of the scale. Results for our scales were interpreted with reference to published criteria wherever possible as follows:
Thresholds for item response options: The use of response categories scored with successive integer scores implies a continuum (e.g., increasing satisfaction with facial appearance). We tested this assumption by examining the ordering of thresholds (or points of crossover between adjacent response categories).
Item fit statistics: The items of a scale must work together (fit) as a conformable set both clinically and statistically. When items do not work together (misfit), it would be inappropriate to sum the individual item responses to reach a total score. We examined the following three indicators of fit: log residuals (item-person interaction), chi-square values (item-trait interaction), and item characteristic curves. Fit statistics are usually interpreted together in the context of their clinical usefulness as an item set, but as a guide, fit residual should fall between −2.5 and +2.5, and chi-square values should be nonsignificant after Bonferroni adjustment.
Item locations: The items of a scale define a continuum, and inspecting where items are located on the continuum shows how well the items map out a construct. Items should be spread evenly over a reasonable range.
Person Separation Index: This reliability statistic is comparable to Cronbach’s alpha17 and quantifies the error associated with the measurements of people in a sample. Higher values indicate greater reliability.
Responsiveness analysis: The ability to detect clinical change was examined at the group level by comparing pretreatment and posttreatment Rasch transformed scores using paired t tests and calculating two standard indicators of change as follows: effect size calculations (Kazis' effect size)18 and standardized response mean.19 The magnitude of the change can be interpreted using Cohen’s arbitrary criteria (small, 0.20; moderate, 0.50; and large, 0.80). Preliminary minimal importance difference values were generated as follows: (1) calculating ½ SD of the pretreatment mean score and; (2) extrapolation of a change score based on a 0.5 effect size.
Responsiveness at the person level for each scale was computed by determining the significance of the change in their individual measurement.20 First, we computed a change score for each person (before surgery to after surgery) and the standard error for the change score. Then, we computed the significance of the change for each person by dividing his or her change score by the standard error of the difference. Finally, we categorized the significance of each person’s change score into one of five groups and counted the numbers of people achieving each level of significance of change. The five groups were as follows: significant improvement (change ≥ 1.96), nonsignificant improvement (0 < change ≤ 1.95), no change (change = 0), nonsignificant worsening (−1.95 ≤ change < 0), and significant worsening (change, < −1.96).
Traditional Test Theory Analysis
Traditional psychometric methods are described more fully elsewhere.21 For each FACE-Q scale, we examined the following: data quality (percentage missing data for each item), scaling assumptions (similarity of item means and variances; magnitude and similarity of corrected item-total correlations),22–24 scale-to-sample targeting (score means, standard deviation, floor and ceiling effects), and internal consistency reliability (Cronbach’s alpha17 and homogeneity coefficients).25
Aspects of validity were assessed in two ways. First, we computed intercorrelations between FACE-Q scales to examine the extent to which subscales measured separate but related constructs.26 We predicted that these intercorrelations would range between r = 0.30 and r = 0.70, as the scales of the FACE-Q purport to measure distinct but related clinical variables.27 Second, we examined the ability of the FACE-Q to detect differences between predefined subgroups. Specifically, all patients completed the FACE-Q Patient-Perceived Age Visual Analogue Scale, which asks them to indicate how many years younger or older they think they look compared with their actual age. The scale anchors for the Patient-Perceived Age Visual Analogue Scale are –15 years to +15 years. We categorized patient responses into the following five groups and compared the mean score for each FACE-Q scale using analysis of variance: (1) looks more than 5 years older than actual age, (2) looks between 1 and 5 years older than actual age, (3) looks actual age, (4) looks 1 to 5 years younger than actual age; and (5) looks more than 5 years younger than actual age. We hypothesized that FACE-Q Appearance Appraisal Scale scores would be incrementally higher in the younger subgroups compared with the older subgroups.
Phase 1: Qualitative Research Results
Through the qualitative phase of our study, we developed and refined the final set of FACE-Q scales and checklists shown in Table 1. Each of the five appearance appraisal scales for face-lift patients has four response options. Instructions ask the participants to complete the items for each scale based on how they look right now, and to indicate how much in the past week they have been either “bothered by” or “satisfied with” the particular facial area.
Phase 2: Quantitative Research Results
In study 1, 360 patients were recruited face-to-face, 332 of whom responded (response rate, 92 percent); and 283 patients were recruited by mail, of whom 167 responded (response rate, 59 percent). The overall response rate was 78 percent. Table 2 lists characteristics of the 225 face-lift patients included in the following analyses.
Rasch Measurement Theory
Table 3 shows the summary fit statistics to the Rasch model (i.e., how closely the observed data match those expected by the model). A nonsignificant chi-square value supported the fit to the Rasch model for the five scales. Targeting was good, with minimal floor/ceiling effects, and all items in each of the five scales displayed ordered thresholds, indicating that respondents were able to distinguish between the four response options (data available on request). Table 4 shows the individual item fit statistics. The findings provide further support for each of the five scales as reliable and valid measures of their respective constructs. From the five scales, only one item had a fit residual marginally outside the recommended criteria of −2.5 to +2.5. This item was retained given that all other fit statistics were satisfied. The Person Separation Index values (Table 3) for the five scales were greater than or equal to 0.88, indicating good reliability.
Table 5 shows that patient satisfaction/appraisal with aspects of their facial appearance improved significantly after treatment. The associated statistically significant change scores were associated with moderate effect sizes. In addition, preliminary minimal importance difference analyses suggested a 10- to 14-point difference in total scores. This difference was exceeded in our analysis (range mean change ± SD, 11 ± 27 to 16 ± 30). For individual-level results, depending on the scale, between 32 and 41 percent of patients who had face-lift procedures reported significant improvement in satisfaction with facial appearance.
Traditional Test Theory Analysis
All scales exceeded criteria for acceptability, reliability, and validity (Table 6). Specifically, Cronbach’s alpha coefficients (≥0.94) and intraclass correlation coefficients (≥0.74) supported scale reliability. Scale validity was supported by the high Cronbach’s alpha coefficients and interscale correlations that ranged between r = 0.30 and r = 0.71, showing that each scale measures a distinct but clinically related variable (Table 7). Our examinations of clinical known group validity (Table 8) revealed that our hypotheses relating to the patterns and significance of scores across subgroups were supported (i.e., FACE-Q scores were higher in participants who indicated they appeared younger than their actual age). Overall, our findings indicated that the items in each scale constituted a statistically conformable group, and that these scores were reliable and valid measures. Finally, Table 9 shows the frequency table for the adverse effects checklist.
The FACE-Q was developed using rigorous qualitative research that involved in-depth interviews with a varied sample of patients, extensive expert input, and modern psychometric methods to identify the best indicators of outcome for each scale. Our overriding goal was to address the lack of available patient-reported outcome tools for patients who undergo facial cosmetic surgery, minimally invasive cosmetic procedures, and/or facial injectables. We chose to develop scales and checklists for anatomical areas of the face rather than instruments to evaluate outcomes particular to surgical or nonsurgical procedure as others have done.4
For face-lift patients, the FACE-Q appearance appraisal scales were found to be clinically meaningful, valid, reliable, and responsive to change 6 months after treatment. In addition, the adverse effects checklist was useful for identifying the proportion of patients experiencing postsurgical symptoms. We suggest that this set of FACE-Q scales represents a promising new set of tools that can be used with face-lift patients in both research and clinical practice.
In addition to the scales presented in this article, researchers and clinicians measuring outcomes in face-lift patients might also want to include the FACE-Q 10-item core scale, which measures overall satisfaction with facial appearance. This scale can be used to compare outcomes across any procedure type and/or to measure change before and after any facial aesthetic procedure.28 Our team also developed a seven-item aging appraisal scale, which provides an assessment of a patient’s perception of his or her appearance in the context of facial aging.29
Our current study has some limitations. First, it is rare to find a face-lift patient who has not previously had other facial aesthetic treatments before undergoing a face lift. In fact, in study 1, only 5.6 percent of our sample had not had any facial aesthetic procedure before their face lift. This finding reflects the nature of facial aesthetic patients and the challenge that exists in measuring the benefit of any particular facial aesthetic treatment. Second, our sample was composed of more women than men. Future research could investigate the use of FACE-Q scales with male patients. Third, although our response rate for face-to-face recruitment was high, our response rate to the mailed survey was lower than we would have liked. Fourth, it is possible there could have been some bias introduced at the individual clinic level by office staff who recruited patients for us. Finally, we acknowledge that the sample size in several countries was small. We therefore recommend further research be carried out to add to the evidence base for the use of the scales and the generalizability of their measurement properties.
Based on the development process and these preliminary validation data, we argue that our scales and checklist are tools that can used to advance knowledge about the outcomes that matter the most to face-lift patients. Our scales are short and easy to complete and have high face validity, making them the type of tools that can easily be incorporated into routine clinical practice. Previous research has shown that integration of patient-reported outcomes into clinical practice improves patient-clinician communication and can enhance patient care and outcomes.30–32
In addition to their use in clinical practice, we envision FACE-Q scales as important new metrics that can be used to define the outcomes of facial aesthetics with broad application in clinical research. For example, incorporation of FACE-Q scales into clinical trials could help to guide future surgical innovation and advance comparative effective research in facial aesthetic treatments. Given an ever-growing range of interventions and products in facial aesthetic surgery, the incorporation of patient-reported outcome instruments into clinical research is absolutely essential if we are to understand the profound impact that cosmetic treatments have on the appearance and quality of life of patients. For researchers who plan to use the FACE-Q in future studies, it is important to note that our scales are designed to function independently from each other. This means that researchers can choose to administer only those scales that are most appropriate for their research hypothesis or patient population. For example, in a face-lift study, the scales described in this article might be used, whereas in a blepharoplasty study, only scales related to the eye might be selected. This approach minimizes response burden and improves targeting. We would also stress that study design and timing of FACE-Q administration is entirely at the discretion of individual research teams. As an example, an investigator may elect to use the FACE-Q before and after treatment in a randomized clinical trial, whereas another might select a cross-sectional cohort study design.
This study was funded by grants from the Plastic Surgery Foundation. The authors acknowledge and thank the following clinicians for their invaluable assistance with the recruitment of patients and countless hours spent as expert reviewers: Vancouver, British Columbia, Canada: Drs. Nick Carr, Francis Jang, Nancy VanLaeken, Alistair Carruthers, Jean Carruthers, and Richard Warren; Washington, D.C.: Dr. Stephen Baker; Dallas, Texas: Drs. Jeffery Kenkel and Rod Rohrich; Atlanta, Georgia: Dr. Foad Nahai; St. Louis, Missouri: Dr. Leroy Young; New York, New York: Drs. David Hidalgo, David Rosenberg, Philip Miller, Alexes Hazen, and Haideh Hirmand. The authors thank Jonathan Switzer for his assistance with final manuscript preparation.
2. Honigman RJ, Phillips KA, Castle DJ. A review of psychosocial outcomes for patients seeking cosmetic surgery. Plast Reconstr Surg. 2004;113:1229–1237
3. Shridharani SM, Magarakis M, Manson PN, Rodriguez ED. Psychology of plastic and reconstructive surgery: A systematic clinical review. Plast Reconstr Surg. 2010;126:2243–2251
4. Kosowski TR, McCarthy C, Reavey PL, et al. A systematic review of patient-reported outcome measures after facial cosmetic surgery and/or nonsurgical facial rejuvenation. Plast Reconstr Surg. 2009;123:1819–1827
5. Alsarraf R, Larrabee WF Jr, Anderson S, Murakami CS, Johnson CM Jr. Measuring cosmetic facial plastic surgery outcomes: A pilot study. Arch Facial Plast Surg. 2001;3:198–201
6. Alsarraf R. Outcomes research in facial plastic surgery: A review and new directions. Aesthetic Plast Surg. 2000;24:192–197
8. Scientific Advisory Committee of the Medical Outcomes Trust. . Assessing health status and quality of life instruments: Attributes and review criteria. Qual Life Res. 2002;11:193–205
9. Klassen AF, Cano SJ, Scott A, Snell L, Pusic AL. Measuring patient-reported outcomes in facial aesthetic patients: Development of the FACE-Q. Facial Plast Surg. 2010;26:303–309
10. Flesch R. A new readability yardstick. J Appl Psychol. 1948;32:221–233
12. Andrich D. Controversy and the Rasch model: A characteristic of incompatible paradigms? Med Care. 2004;42(Suppl):1–16
13. Wright BD, Masters G. Rasch measurement. Rating Scale Analysis. 1982 Chicago Mesa In:
14. Andrich D, Sheridan B, Luo G RUMM 2030. Perth, Australia RUMM Laboratory:1997–2013
15. Andrich D Rasch Models for Measurement. 1988 Newbury Park, Calif Sage
16. Rasch G Probabilistic Models for Some Intelligence and Attainment Tests. 1960 Copenhagen Danish Institute for Education Research
17. Cronbach LJ.. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16:297–334
18. Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Med Care. 1989;27(Suppl):S178–S189
19. Liang MH, Fossel AH, Larson MG. Comparisons of five health status instruments for orthopedic evaluation. Med Care. 1990;28:632–642
20. Hobart JC, Cano SJ, Thompson AJ. Effect sizes can be misleading: Is it time to change the way we measure change? J Neurol Neurosurg Psychiatry. 2010;81:1044–1048
21. Hobart JC, Cano SJ. Improving the evaluation of therapeutic intervention in MS: The role of new psychometric methods. Health Technol Assess. 2009;13:1–200
22. McHorney CA, Haley SM, Ware JE Jr. Evaluation of the MOS SF-36 Physical Functioning Scale (PF-10): II. Comparison of relative precision using Likert and Rasch scoring methods. J Clin Epidemiol. 1997;50:451–461
23. Likert R. A technique for the measurement of attitudes. Arch Psychol. 1932;140:1–55
24. Ware JE, Harris WJ, Gandek B, et al. MAP-R for Windows: Multi-trait/Multi-Item Analysis Program—Revised User’s Guide. 1997 Boston Health Assessment Laboratory
25. Eisen M, Ware JE Jr, Donald CA, Brook RH. Measuring components of children’s health status. Med Care. 1979;17:902–921
26. McHorney CA, Ware JE Jr, Lu JF, Sherbourne CD. The MOS 36-Item Short-Form Health Survey (SF-36): III. Tests of data quality, scaling assumptions and reliability across diverse patient groups. Med Care. 1994;32:40–66
27. Bohrnstedt GW.Rossi PH, Wright JD, Anderson AB. Measurement. Handbook of Survey Research. 1983 New York Academic Press:69–121 In:
28. Pusic AL, Klassen AK, Scott A, Cano SJ. Development and psychometric evaluation of the FACE-Q satisfaction with appearance scale: A new patient-reported outcome instrument for facial aesthetic patients. Clin Plast Surg. 2013;40:249–260
29. Panchapakesan V, Klassen AF, Cano SJ, et al. Development and psychometric evaluation of the FACE-Q Aging Appraisal Scale and Patient-Perceived Age Visual Analog Scale. Aesthet Surg J. 2013;33:1099-1109
30. Marshall S, Haywood K, Fitzpatrick R. Impact of patient-reported outcome measures on routine practice: A structured review. J Eval Clin Pract. 2006;12:559–568
31. Valderas JM, Kotzeva A, Espallargues M, et al. The impact of measuring patient-reported outcomes in clinical practice: A systematic review of the literature. Qual Life Res. 2008;17:179–193
32. Greenhalgh J, Meadows K. The effectiveness of the use of patient-based measures of health in routine practice in improving the process and outcomes of patient care: A literature review. J Eval Clin Pract. 1999;5:401–416
Evidence-Based Medicine: A New Emphasis
Plastic and Reconstructive Surgery has made evidence-based medicine a major initiative to improve the overall quality of published articles. Rather than the traditional uncontrolled case series and retrospective cohort studies, PRS strongly encourages submissions to employ the full spectrum of research methodology, including:
* Outcomes questionnaire development
* Large database analysis
* Survey methodology
* Clinical trials
* Case-control studies
* Qualitative research
* Social sciences relating to plastic surgery
In the current era of comparative effectiveness and focus on health care economics, articles that assess outcomes and cost will be considered favorably for submission to PRS because these types of articles can guide the treatments for our patients. To provide the most optimal care for our patients and to distinguish plastic surgeons as the scientific leaders in surgery, we must fully embrace evidence-based medicine and the many creative research methods to help us research vexing questions facing our specialty.
Plastic and Reconstructive Surgery offers two free online article collections to help you better understand and embrace evidence-based medicine. One is a set of “how to” articles; the second is a series of evidence-based medicine articles directed at addressing important clinical questions.