Horizontal or transverse neck lines can occur at any age.1 Neck lines may be associated with the deposition of submental and subplatysmal fat, and they are exacerbated by age-related decreases in elasticity and thickness of the skin of the neck, combined with gravity and the downward pull of the platysma muscle.2–4 Horizontal neck lines may be treated with botulinum toxin Type A in cases where the lines are clearly caused by the activity of the platysma muscles,3–5 although some groups report having little success with this approach.6 Use of injectable filler for the treatment of horizontal neck lines has been reported in one case study1 and in a prospective single-center study in combination with other therapies.7 Other approaches for reducing the appearance of neck lines include rhytidectomy,8 fractional laser treatment,9,10 fractional radiofrequency treatment,11,12 and microfocused ultrasound.13,14
Patients are increasingly seeking treatment for nonfacial rejuvenation, including neck lines, and clinicians need a way to both educate and assess patients regarding treatments. Clinical studies of neck line treatments have assessed outcomes using general numeric wrinkle scales that did not include images and were not validated for the assessment of the neck.9,10,12 This report describes the development and validation of a new photonumeric scale designed to rate horizontal lines of the neck (Allergan Transverse Neck Lines Scale). The scale was created to meet FDA requirements for outcome assessments in clinical trials15 and to provide a practical tool that physicians can use for the assessment of patients. The objectives of this study were to determine the clinically significant difference in scale scores and to establish the interrater and intrarater reliability of the scale for rating severity of horizontal lines of the neck in live subjects.
Figure 1 summarizes key steps in the creation and validation of the Allergan Transverse Neck Lines Scale. A 9-member team comprising 5 external members (3 board-certified dermatologists, 1 board-certified oculoplastic surgeon, and 1 board-certified facial plastic surgeon) and 4 Allergan employees (2 dermatologists, 1 plastic surgeon, and 1 clinical scientist) developed the scale from a pool of subject images collected for scale development by Canfield Scientific, Inc (Canfield, Fairfield, NJ). A total of 396 men and women aged 18 years or older with Fitzpatrick skin Types I through VI and in good general health volunteered for image capture. All subjects provided informed photograph consent before image collection. Subjects were excluded if they had anything that would interfere with visual assessment of the area of interest. Canfield photographers obtained full 2-dimensional (2D) images of the face and neck using a 2D custom suite for face and neck imaging (Nikon D7100 Hi Res SLR). Images were cropped horizontally from 1 cm lateral to the neck/shoulder junction on the left and right sides and vertically from 1 cm above the bony menton down to 2 cm below the neck/shoulder junction to produce images of the area of interest.
Scale descriptors were created for each of the 5 grades of the scale (Table 1). Two members of the Allergan team met individually with each member of the scale development team for preliminary input on each scale grade. After preliminary scale grades were established, all 9 individuals involved in scale creation had a collaborative discussion about the scale grades and descriptors. The wording for each grade was then finalized by the Allergan team.
Canfield created an assessment guide with a line drawing of anatomic markers demarcating the anterior third of the neck between each sternocleidomastoid based on detailed instructions from the Allergan team regarding anatomic markers (Figure 2). Canfield revised the drawing multiple times based on careful review by the Allergan team.
A base image to demonstrate Grade 2 neck lines was selected, and this image was morphed to represent all 5 grades of the scale. A Canfield graphics technician morphed the anatomic area of interest in the base image to match the descriptors provided for Grades 0, 1, 3, and 4. Alignment of the morphed images with the scale descriptors was achieved via an interactive process with the Allergan team.
A forced ranking review was performed to delineate the range of severity between Grades 2 and 3 and to confirm the selection of the best representative image to be used as Grade 2. The 5 external scale developers performed the web-based forced ranking exercise on preselected images that represented the upper and lower boundaries of Grades 2 and 3.
To determine whether there was a clinically significant difference between grades of the scale, the 5 external scale developers were asked to perform an online clinical significance review of image pairs. Multiple image pairs were selected to represent varying degrees of differences in severity (ranging from no difference to a 4-point difference). During the session, the scale developers determined whether there was a clinically significant difference (Yes/No) between images for each pair. After the session, the images from all image pairs were randomly mixed in with other images to be used in the morphed image scale validation (described in the following paragraph) and assigned a score by scale developers so that score differences between the 2 images in each pair could be calculated.
The morphed image scale was validated by having the 5 external scale developers use the scale to rate randomized images representing all scale grades during 2 web-based sessions occurring at least 3 days apart. A total of 299 images were rated (120 images in Session 1 and 179 images in Session 2). The scale had acceptable interrater and intrarater agreement (>0.5), so scale development proceeded using the morphed images.
For both the clinical significance review and the morphed image scale validation review, Canfield provided scale developers uniform hardware to complete the reviews. Before the reviews, the external scale developers completed a web‐based PowerPoint training to familiarize themselves with the hardware, the review platform, and the purpose of the clinical significance and morphed image validation reviews. The scale developers were not allowed to discuss the reviews with one another, and each completed the reviews independently.
After the morphed image scale was created, 2 subject photographs representing each grade of the scale were selected to represent diversity in sex and Fitzpatrick skin type per grade. The final scale contains the scale descriptors for each grade, an assessment guide, the morphed images, and the real subject images (Figure 3).
The interrater and intrarater reliability of the final scale was evaluated in a live-subject rating validation study. Eight physician raters experienced in using aesthetic photonumeric scales who were not involved in scale development participated in two 2-day live validation sessions occurring 3 weeks apart. Before the first live evaluation session, all physician raters were trained on the use of the scale in an interactive group training session using 4 example subjects. Raters were instructed to rate only horizontal neck lines, to disregard vertical lines (e.g., platysmal bands on neck), to select a grade based on the most severe line present (with 1 line being sufficient to determine grade), and to assess effaceable versus noneffaceable lines visually and not through attempts to manually efface lines (Figure 3).
All subjects who qualified for the initial image capture events were invited to attend the live validation sessions. Subjects were instructed to arrive clean shaven, remove makeup and jewelry, wear dark pants or jeans and a provided black T-shirt, not drink alcohol excessively before the sessions, try not to alter their usual routine (e.g., their facial care routine and normal sleep or hydration patterns) between sessions, and not have tanning sessions or extensive sun exposure between sessions. Upon arrival at the study center for the first live validation session, subjects signed informed consent and were assessed for eligibility, age, sex, race (as reported by the subject), and Fitzpatrick skin type (determined by the investigator). Subjects were excluded if they had their photographs included in the scale; anything that would interfere with the visual assessment of the area of interest; any treatment with toxin/fillers, dental procedures, or surgery that would alter the area of interest within 2 weeks of the first validation session or plans to have one of these procedures between the 2 sessions; or diagnosis of pregnancy. Two-dimensional images of each subject were collected using a 2D custom studio suite at the first live validation session. The first 5 subjects rated during the first validation session were considered run-in training subjects and were excluded from the analysis.
During the first and second live scale validation sessions, each physician rater evaluated all subjects on all scales (7 additional scales for other anatomic features were evaluated at the same sessions and are reported separately16–22). Raters had separate evaluation stations with an examination lamp, table, a stool for subject seating, supplies, and the photonumeric scale mounted and displayed for use in subject evaluation. Subjects presented themselves to each rater individually and proceeded from one rating station to the next in the same order until evaluated by all 8 raters. Raters were instructed to not discuss ratings with subjects or other raters. Raters took at least a 10-minute break every hour and at least a 30-minute lunch break to avoid rater fatigue.
To determine the utility of the scale grades for detecting clinically meaningful differences in horizontal neck lines, absolute score differences for the image pairs deemed “clinically different” or “not clinically different” during scale development were summarized (mean, standard deviation, range, 95% confidence interval [CI]). For the live-subject scale validation study, intrarater reliability was compared between Round 1 and Round 2 scores by calculating weighted kappa scores using Fleiss-Cohen weights.23 Kappa scores within the range of 0.0 to 0.20 indicate slight agreement, 0.21 to 0.40 indicate fair agreement, 0.41 to 0.60 indicate moderate agreement, 0.61 to 0.80 indicate substantial agreement, and 0.81 to 1.00 indicate almost perfect agreement.24 Interrater agreement was measured by determining the intraclass correlation coefficient (ICC [2,1]) and 95% CIs calculated using the formula described by Shrout and Fleiss.25 The a priori primary end point for the interrater agreement analysis was ICC (2,1) for the second rating session. SAS version 9.3 (Cary, NC) was used for all statistical analyses.
Sample Size Considerations
The sample size for the live-subject validation sessions was calculated using the method described by Bonett.26 With up to 10 raters and an ICC of 0.5, a total of 66 subjects were needed in order to have a 95% CI with a width of 0.2 for interrater reliability. Considering the potential loss of subjects between the 2 rounds, at least 80 subjects were to be enrolled for the scale. Because 297 subjects were eligible for the scale validation analysis, the number of subjects evaluated using this scale was substantially larger than the preplanned sample size of 80, and the overall number of assessments for some grades of this scale were larger than those for the other grades. To minimize the imbalance in the number of subjects across scale grades and to meet the sample size requirement, the mean score across the 8 raters for each subject was used to assign an overall grade for each subject, and a subset of 80 subjects with minimal imbalance across the grades (∼16 subjects per each of the 5 scale grades) was randomly selected from the eligible subjects using a prespecified procedure and a preselected randomization seed. This random selection of the subset was performed 20 times. Interrater and intrarater agreements calculated for each of the 20 subsets were combined using SAS procedure PROC MIANALYZE to obtain the overall interrater and intrarater agreements.
Clinical Significance Determination by Scale Developers
The mean (95% CI) absolute difference in scores was 1.22 (1.09–1.35) for image pairs identified as clinically different and 0.57 (0.42–0.72) for image pairs identified as not clinically different (Table 2). The 95% CIs for clinically different pairs did not overlap with the 95% CIs for pairs deemed not clinically different, confirming that a 1-point difference in scores is clinically significant.
Live-Subject Scale Validation
Of the 297 subjects eligible for Allergan Transverse Neck Lines Scale validation analysis, 288 subjects were selected in at least 1 of the 20 random subsets. Demographic characteristics of subjects in the final scale validation set are shown in Table 3. Most subjects were female (67%), Caucasian (79%), and had Fitzpatrick skin Type III (27%) or IV (33%). Median age was 48 years, and a broad span of ages was represented (18–83 years).
Intrarater agreement between the 2 live-subject rating validation sessions was substantial (mean weighted kappa = 0.78) (Table 4). Interrater agreement for the Allergan Transverse Neck Lines Scale was substantial in Session 1 (0.72) and Session 2 (0.73) (Table 4).
This study demonstrated substantial interrater and intrarater agreement for the Allergan Transverse Neck Lines Scale, indicating that the scale is reliable for multiple assessments of the same subject and across different raters. A 1-point difference in scale ratings was shown to reflect clinically significant differences, indicating that the scale has sufficient sensitivity for detecting clinically significant changes in horizontal lines of the neck.
The scale requires that effaceable versus noneffaceable lines be assessed visually, not manually; most physicians with experience in the treatment of neck lines can generally tell whether the line is effaceable with visual inspection alone. The scale uses morphed images to represent each grade to focus the rater's attention on the change from one grade to the next, with all other features remaining constant across scale grades. Real-world images representing a diverse range of skin types across sexes and races are an important addition to the scale because morphed images may not always translate to the broad array of appearances or physical changes observed in the clinic. Representation of both sexes and multiple ethnic groups in rating scales is important, as growing numbers of men and members of diverse ethnic groups are seeking aesthetic facial treatment.4,27
Patients are increasingly seeking aesthetic treatment for areas other than the face, including the neck. In the experience of the authors, transverse neck lines are often observed in younger patients, even those without extensive photodamage. In some middle-aged patients, the neck is much more severely damaged than the face, making neck lines a chief concern. Restoration of a more normal neck appearance can substantially improve self-esteem and confidence. Clinicians need a way to both educate and assess patients for neck line treatments, and the Allergan Transverse Neck Lines Scale provides standardized ratings that may be uniformly applied in day-to-day clinical practice and potentially in clinical trials, due to its validation in live subjects and use of both morphed and unaltered images.
The Allergan Transverse Neck Lines Scale is not used to rate vertical neck lines. In the experience of the authors, neck treatments such as botulinum toxin Type A are especially useful for improving the appearance of the neck and jaw line rather than just reducing lines; the loss of downward pull and the softening of vertical lines are also important considerations with neck treatments. More generic wrinkle scales may be helpful for assessing vertical neck lines.9,10,12
The scale developers solely determined the clinical significance of scale scores; although a 1-point change on the scale was considered meaningful to the scale developers, it may or may not be meaningful to subjects. Hence, this scale is not intended for patient self-assessment of meaningful improvement. Use of the FACE-Q appearance appraisal scale, a validated patient satisfaction instrument with a subscale for satisfaction with the neck, may be helpful for capturing the perspective of the patient on the appearance before and after treatment.28 Finally, the verbal descriptors for each grade on the Allergan Transverse Neck Lines Scale are subjective. However, the descriptors were developed and refined during extensive collaboration among 9 clinical experts to minimize inherent subjectivity.
Because increasing numbers of patients are seeking aesthetic treatment of the neck, there is a need for a validated scale for the assessment of neck lines. The Allergan Transverse Neck Lines Scale includes user-friendly diagrams, detailed verbal descriptions, and morphed and real subject images representative of both sexes and diverse skin types. The scale demonstrated substantial intrarater and interrater agreement among physicians, and a 1-point score difference was shown to reflect clinically significant differences in horizontal neck lines. The scale meets FDA criteria for validated clinical outcome measures in clinical trials and provides standardized ratings that can be uniformly applied by dermatologists and plastic surgeons who treat patients seeking treatment of horizontal lines of the neck.
The authors thank the following physicians for completing the scale validation study: David E. Bank, MD, FAAD; Sue Ellen Cox, MD; Timothy M. Greco, MD, FACS; Z. Paul Lorenc, MD, FACS; David J. Narins, MD, PC, FACS; William B. Nolan, MD; Robert A. Weiss, MD; and Margaret Weiss, MD. Statistical support was provided by Yijun Sun, PhD, and Shraddha Mehta, PhD of Allergan plc, Irvine, CA.
1. Chao YY, Chiu HH, Howell DJ. A novel injection technique for horizontal neck lines correction using calcium hydroxylapatite. Dermatol Surg 2011;37:1542–5.
2. Brandt FS, Boker A. Botulinum toxin for the treatment of neck lines and neck bands. Dermatol Clin 2004;22:159–66.
3. Raspaldo H, Niforos FR, Gassia V, Dallara JM, et al. Lower-face and neck antiaging treatment and prevention using onabotulinumtoxin A: the 2010 multidisciplinary French consensus–part 2. J Cosmet Dermatol 2011;10:131–49.
4. Carruthers JD, Glogau RG, Blitzer A. Advances in facial rejuvenation: botulinum toxin type a, hyaluronic acid dermal fillers, and combination therapies–consensus recommendations. Plast Reconstr Surg 2008;121(Suppl 5):5S–30S.
5. Ascher B, Talarico S, Cassuto D, Escobar S, et al. International consensus recommendations on the aesthetic usage of botulinum toxin type A (Speywood Unit)—Part II: wrinkles on the middle and lower face, neck and chest. J Eur Acad Dermatol Venereol 2010;24:1285–95.
6. Dayan SH, Maas CS. Botulinum toxins for facial wrinkles: beyond glabellar lines. Facial Plast Surg Clin North Am 2007;15:41–9, vi.
7. Sarnoff DS, Gotkin RH. ACELIFT: a minimally invasive alternative to a facelift. J Drugs Dermatol 2014;13:1038–46.
8. Agarwal A, Dejoseph L, Silver W. Anatomy of the jawline, neck, and perioral area with clinical correlations. Facial Plast Surg 2005;21:3–10.
9. Oram Y, Akkaya AD. Neck rejuvenation with fractional CO2 laser: long-term results. J Clin Aesthet Dermatol 2014;7:23–9.
10. Bencini PL, Tourlaki A, Galimberti M, Pellacani G. Non-ablative fractionated laser skin resurfacing for the treatment of aged neck skin. J Dermatolog Treat 2015;26:252–6.
11. Abraham MT, Ross EV. Current concepts in nonablative radiofrequency rejuvenation of the lower face and neck. Facial Plast Surg 2005;21:65–73.
12. Alexiades M, Berube D. Randomized, blinded, 3-arm clinical trial assessing optimal temperature and duration for treatment with minimally invasive fractional radiofrequency. Dermatol Surg 2015;41:623–32.
13. Fabi SG, Goldman MP. Retrospective evaluation of micro-focused ultrasound for lifting and tightening the face and neck. Dermatol Surg 2014;40:569–75.
14. Woodward JA, Fabi SG, Alster T, Colon-Acevedo B. Safety and efficacy of combining microfocused ultrasound with fractional CO2 laser resurfacing for lifting and tightening the face and neck. Dermatol Surg 2014;40(Suppl 12):S190–3.
15. U.S. Department of Health and Human Services, Food and Drug Administration. Guidance for industry: patient-reported outcome measures: use in medical product development to support labeling claims. 2009. Available from: http://www.fda.gov/downloads/Drugs/Guidances/UCM193282.pdf
. Accessed July 21, 2016.
16. Jones D, Donofrio L, Hardas B, Murphy DK, et al. Development and validation of a photonumeric scale for evaluation of volume deficit of the hand. Dermatol Surg 2016;42(Suppl 10):S195–202.
17. Sykes JM, Carruthers A, Hardas B, Murphy DK, et al. Development and validation of a photonumeric scale for assessment of chin retrusion. Dermatol Surg 2016;42(Suppl 10):S211–18.
18. Donofrio L, Carruthers A, Hardas B, Murphy DK, et al. Development and validation of a photonumeric scale for evaluation of facial skin texture. Dermatol Surg 2016;42(Suppl 10):S219–26.
19. Carruthers J, Jones D, Hardas B, Murphy DK, et al. Development and validation of a photonumeric scale for evaluation of volume deficit of the temple. Dermatol Surg 2016;42(Suppl 10):S203–10.
20. Carruthers J, Donofrio L, Hardas B, Murphy DK, et al. Development and validation of a photonumeric scale for evaluation of facial fine lines. Dermatol Surg 2016;42(Suppl 10):S227–34.
21. Carruthers A, Donofrio L, Hardas B, Murphy DK, et al. Development and validation of a photonumeric scale for evaluation of static horizontal forehead lines. Dermatol Surg 2016;42(Suppl 10):S243–50.
22. Donofrio L, Carruthers J, Hardas B, Murphy DK, et al. Development and validation of a photonumeric scale for evaluation of infraorbital hollows. Dermatol Surg 2016;42(Suppl 10):S251–58.
23. Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measure of reliability. Educ Psychol Meas 1973;33:613–9.
24. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159–74.
25. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86:420–8.
26. Bonett DG. Sample size requirements for estimating intraclass correlations with desired precision. Stat Med 2002;21:1331–5.
28. Klassen AF, Cano SJ, Scott AM, Pusic AL. Measuring outcomes that matter to face-lift patients: development and validation of FACE-Q appearance appraisal scales and adverse effects checklist for the lower face and neck. Plast Reconstr Surg 2014;133:21–30.