Click on the links below to access all the ArticlePlus for this article.
Please note that ArticlePlus files may launch a viewer application outside of your web browser.
Expert assessment has been put forward as a promising method for occupational exposure assessment in case-control studies. In case-by-case expert assessment, a detailed occupational questionnaire administered by an interviewer is passed to the expert (eg, industrial hygienist, occupational physician, chemist), who evaluates the exposure to a predetermined list of occupational agents. This method, although relatively costly,1 is thought to produce less misclassification than occupation-based exposure assessment (eg, job-exposure matrix) or self-report of exposure.2,3
In multicenter studies, expert assessment depends on local experts who can take into account regional differences in materials, production processes and prevention measures. In multicenter studies the success of expert assessment therefore depends not only on the ability of each expert, but also on the feasibility of standardizing the work of experts who work independently in different settings.
We conducted a multicenter lung cancer case-control study in central and eastern Europe and in Liverpool (UK), using local experts for exposure assessment. This study offered an opportunity to examine in detail the performance of these experts. The main objective of this analysis was to study the reliability of expert teams who were trained together but worked apart, and to study when and why the teams disagree. Secondly, this study provided an indication of the levels of misclassification of exposure status of study subjects and the consequences of such misclassification on the risk estimates of lung cancer associated with these agents.
This interteam agreement study was part of a multicenter case-control study of lung cancer, coordinated by the International Agency for Research on Cancer (IARC), and initiated in 1998. The case-control study was conducted in central and eastern Europe, including 15 centers from Russia (Moscow), Poland (Warsaw, Lodz), Czech Republic (Prague, Brno, Olomouc), Slovakia (Banska Bystrica, Bratislava, Nitra), Hungary (Borsod, Heves, Szabolcz, Szolnok, Budapest), and Romania (Bucharest). The target study size was at least 2000 case-control pairs. The study also includes a parallel study from the United Kingdom (Liverpool) with a similar protocol. Approval by the ethics review board of IARC and the individual case-control recruiting centers was obtained.
All participating centers applied the same occupational questionnaires and protocol for expert assessment. A questionnaire was administered face-to-face by a trained interviewer with the informed consent of the study participants. The occupational questionnaire consisted of a complete job history, with a description of tasks and materials used in each job. If applicable, a task- or industry-specific questionnaire was administered by the interviewer (specialized questionnaire).
Exposure experts were selected from each participating center. All had long experience in industrial hygiene but no experience in epidemiologic studies. Centralized workshops were organized yearly to train the experts and standardize their work. Experts were required to assess exposure to 70 occupational agents for each job held by the study subjects. For each agent, the experts assessed intensity, frequency and confidence. “Intensity” was the level of exposure (1 = low, 2 = medium, 3 = high). Agent-specific quantitative benchmarks were included in the protocol to help calibrate the three intensity levels. The “frequency” of exposure was assessed by the percentage of work time exposed (1 = 1–5%, 2 = 5–30%, 3 = >30%). Subjects exposed to extremely high work levels over short periods of time were defined as having a “peak” exposure. The expert-teams also indicate their “confidence” regarding the presence of exposure (1 = exposure possible but not probable, 2 = exposure probable, 3 = exposure certain). The confidence can depend on many different aspects including the information available in the job description, the expert teams’ knowledge about the specific exposure, and the accessibility of additional information. “Confidence” refers only to the presence of exposure, and not to the level or frequency of exposure.
We conducted a substudy of the performance of these experts after most of the experts had attended three exposure assessment workshops, had participated in two previous expert substudies, and had experience in the process of exposure assessment.
Eight expert-teams from eight centers participated in this substudy: three centers from Slovakia (each with one expert), and centers from Russia, United Kingdom, Poland, Hungary and Romania (each with two experts).
The sample job descriptions were obtained from study subjects to reflect the quality of questionnaire information that is generally available to the experts. Of the 19 job descriptions, 11 included one specialized questionnaire and 3 included two specialized questionnaires. These jobs were selected to capture a broad diversity of possible exposures and included employment as painter, carpenter, machinist, welder, quarry worker, boiler operator, lathe operator and electronics fitter. The job descriptions were translated into English centrally and blinded for country of origin. Each team individually conducted an exposure assessment for the 19 jobs, following the rules and guidelines in the protocol.
Analysis of the Expert Substudy Results
We analyzed the results by treating each job-agent decision as a separate observation. Because 70 agents could be assessed for each of the 19 jobs, 1330 job-agent decisions were made by each expert team. In the following text a “positive job-agent decision” means that the expert assigned exposure to the agent with a confidence 1, 2 or 3.
Agreement Between Teams
We calculated Cohen’s κ statistic as a measure of between-team agreement, giving a measure of agreement beyond that expected by chance alone. The κ can be affected by prevalence (here, the percentage of positive job-agent decisions),4 unbalanced marginals,5 and the number of categories within the variable. We interpreted the κ statistic using the following arbitrary cut points: <0.4 = poor, 0.4–0.75 = fair to good, >0.75 = excellent agreement. We calculated κ for each category of confidence, frequency, and intensity of exposure, using a dichotomy of the category under study compared with all others. We also calculated the overall agreement for confidence, frequency and intensity unweighted as well as weighted for adjacent categories, using arbitrary weights of 1 for selection of the same category and 0.8 for selection of adjacent categories.
Agreement with a Standard
We compared the assessments of each center with a reference assessment, that of a chemist (J.F.) with years of experience in case-by-case expert assessment for French and international studies. This chemist also trained the experts for this study. The agreement of each expert-team with the reference was calculated through Cohen’s κ, sensitivity and specificity.
To study whether teams had a tendency to overestimate or underestimate exposures or make the ratings for confidence, frequency and intensity systematically higher or lower than the reference, we used the McNemar test of marginal homogeneity.6 For testing marginal homogeneity for all categories simultaneously, we used the Stuart-Maxwell test.7
Effects of Misclassification on the Odds Ratio
We used levels of sensitivity and specificity from the substudy to calculate effects on the observed odds ratio.8 For these calculations we assumed several true odds ratios (2,4,6,10), several prevalences of exposure in the control population (10%, 5%, 1% and 0.5%), and three combinations of sensitivity and specificity (those found for excellent, fair to good, and poor agreement agents). We estimated the minimal sample size needed to detect the attenuated odds ratio, assuming an α of 0.05 and power of 0.8.9
Agreement Between Teams
Table 1 shows the agreement among the eight teams. The agreement in presence or absence of exposure is fair (κ = 0.55). The agreement between pairs of teams was always in the range of fair to good, the lowest κ being 0.45 and highest κ 0.63 (not shown). The agreement in confidence of exposure was fair (κ = 0.45, weighted κ = 0.55), mainly because of agreement in presence/absence of exposure and confidence 3 (exposure certain) assessments. The agreement in frequency of exposure was fair (κ = 0.42, weighted κ = 0.54), mainly because of the fair agreement in presence/absence of exposure and frequency 3 (>30% of work time) assessments. The agreement in intensity of exposure was fair (κ = 0.41, weighted κ = 0.54), but agreement was low for all levels of intensity. A peak was rarely assessed in this substudy, and the agreement for a peak was that which could be expected by chance (κ = 0.00).
Agreement Between Teams by Agent
The agreement in presence or absence of an exposure among teams was calculated for each agent individually (data available with the electronic version of this article at www.epidem.com). For 16 exposures, agreement could not be calculated because they were not assessed for any of the 19 job descriptions. For 9 of the remaining 54 agents excellent agreement (κ > 0.75) was achieved. For 16 agents, the agreement was fair to good, whereas for the remaining agents agreement was poor (n = 29).
To study the possible causes of the differences in agreement between agents, we considered variables that could influence the difficulty of assessment and therefore the agreement between teams. Agents that were explicitly mentioned in the general occupational questionnaire were more likely to have excellent agreement (mean κ = 0.65) than agents not directly mentioned (κ = 0.30). Likewise, agreement was generally higher for agents that are visible (mean κ = 0.70) than agents not visible (κ = 0.31). Low agreement rates were found for specific agents not easily inferred from the task (eg, chromium dust).
Results of the expert substudy were discussed with the experts. They suggested that some disagreement might be in part caused by true differences in technology and use of materials between countries [eg, engine fuels (diesel or gasoline), insulation materials (asbestos or man-made fibers), or specific pigments (chromium, cadmium, arsenic)]. When we excluded these job-agent decisions from the calculations, the κ increased from 0.55 to 0.60.
The confidence was higher for agents with high agreement (only counting the positive job-agent decisions), although high confidence was also found for agents with low agreement. Confidence therefore does not appear to be a good predictor for the reliability of an assessment.
Agreement with the Reference
The reference rater assessed that exposure was present for 170 of the 1330 job-agent decisions. This corresponded to 13% of all job-agent decisions. For the eight teams, this percentage ranged between 8% and 16%. Results of the McNemar’s test indicated that four centers underreported and one center overreported exposure, whereas three centers assessed about the same number of exposures as the reference (results not shown). The results of the Stewart–Maxwell test (see Fig. 1) show that centers F and A interpreted the confidence scale similarly to the reference.
For intensity (see Fig. 2), most centers had comparable numbers of intensity 3 assessments, although two centers assessed significantly more often intensity 3. This might be due to a different interpretation of the intensity scale but might also be caused in part by true differences in exposure levels for these two centers.
Very little marginal homogeneity was found for frequency of exposure, indicating a variable interpretation of frequency among the centers. These results were discussed with the experts, who concluded that a better use of the quantitative cut-points of intensity and frequency would improve marginal homogeneity in future assessments.
Sensitivity, Specificity and κ
Figure 4 shows various measures of agreement between the teams and the reference regarding the presence or absence of exposure. This is plotted against the percentage of positive job-agent decisions. Of the 170 exposures assessed to be present by the reference, 48% to 75% were also assessed by the expert-teams (indicated by the “sensitivity”). The specificity was above 0.9 for all centers. The κ ranged between 0.53 and 0.64.
Sensitivity and specificity depended on the percentage of positive job-agent decisions of each team. The κ was largely unaffected by this and shows little range among teams, indicating that the teams performed comparably without any obvious outliers.
Sensitivity and specificity were calculated for three groups of agents: those with excellent, fair/good, and poor agreement among the eight teams. For the group of agents with excellent agreement, both sensitivity and specificity were high (sensitivity = 0.89, specificity = 0.97). The agents with fair/good and poor agreement were assessed with a lower sensitivity (0.70 and 0.39, respectively), while the specificity remained high for these agents (specificity = 0.94) (first row Table 2).
Effects of Misclassification on the Odds Ratio
We studied how the odds ratio would be affected given the sensitivity and specificity for agents with respectively “excellent”, “fair to good” and “poor” interteam agreement. Table 2 indicates that the odds ratio will be attenuated towards 1 even if sensitivity and specificity are high. For example, an odds ratio of 2 (with true prevalence of exposure = 10%) will be attenuated to 1.75 for “excellent agreement” agents, and to 1.33 for “poor agreement” agents. Attenuation will be greater for agents with low prevalence in the study population (eg, <5%). For most agents, the lifetime prevalence will be higher than 1% for men, whereas for women lifetime prevalence of exposure is often under 1% (data available with the electronic version of the article at www.epidem.com). The effects of exposure misclassification will therefore be larger in subgroups (eg, women in this lung cancer study), with lower prevalence of exposure.
Case-by-case expert assessment is often cited as the most accurate method for retrospective occupational exposure assessment among the methods commonly used in community-based case-control studies.2,10 However, there are few validity studies. Only a few studies have compared the assessment of experts with objective measurements in urine11,12 or the work environment.13–16 In the studies reporting sensitivity and specificity,11,13 the specificity of experts was generally high (>0.9) and sensitivity low (∼0.3), indicating that experts missed exposures but did not generate many false-positive assessments.
Several intermethod reliability studies have compared expert assessments with other retrospective methods such as job-exposure matrices17–21 and self-reported exposure.16,21–24 In these studies expert assessment is often used as the “gold standard”. Thus, they provide no information about the levels of misclassification present in expert assessment itself, or the differences in quality among experts, or the differences in assessment accuracy for various agents.
Studies that determined the agreement among experts (inter-rater agreement studies),13,21,25–32 indicate variable reliability of the experts in different studies and for different agents. This suggests that, although expert assessment can be reliable, it is not necessarily reliable in all situations or with all agents.
As previous studies have shown,10 our data suggest that the reliability of expert assessment depends on the agent assessed. There was high agreement for exposures with detailed information in the questionnaire and for aggregate exposes, and low agreement for very specific exposures and exposures that are difficult to infer from the task.
To reflect these agent-differences in the reliability of expert assessment we divided the agents in three groups: those for which the experts’ performance is (1) poor, (2) average to good, and (3) excellent. This represents a worst-, medium- and best-case scenario for expert assessment in this study. Examples of which agents fell under which category are given in the electronic version of this article (www.epidem.com). All three groups showed high specificity, indicating that even if the agent is difficult to assess, this results mainly in false-negative assessments rather than false-positive assessments. This is important in studies with low prevalence of exposure, in which false-positive assessments attenuate the odds ratio towards unity to a greater extent than false-negative exposure evaluations.11,33Table 2 shows that attenuation of the OR will be considerable even for the excellent agreement agents, and that prevalence of exposure and study size are just as important to take into account when investigating the possibilities of expert assessment.
When interpreting these hypothetical results it is important to consider that the sensitivity and specificity estimates were based on a comparison with a reference rater, who herself is subject to misclassification. Because two studies comparing expert assessments with objective measurements in the work environment and urine11,13 indicated relatively low sensitivity but high specificity, one can assume that the true sensitivity of the experts might be lower than measured in this study and that attenuation of the OR may therefore be more substantial.
In the absence of objective exposure measurements for these subjects, we do not know whether the assessed exposure levels corresponded to true levels. However, the low agreement for the individual intensity levels indicates that considerable misclassification will occur. In a hypothetical dose-response scenario applying the found misclassification to a dose-response slope, the slope was flattened by at least 50% depending on the exposure prevalence, but did not change direction.
We also studied whether expert assessment can be sufficiently standardized across independent teams of experts to make application in a multicenter study possible. One aspect of standardization is to ensure all teams use the same cut-points for exposure and their indices. There was an encouraging amount of marginal homogeneity among teams for the number of positive job-agent decisions and confidence and intensity levels (Fig. 1, Fig. 2, and Fig. 3). This is especially true, considering that other studies sometimes found very large differences in exposure prevalence (eg, 7% and 51%).29 Graphic representations of marginal homogeneity (such as in Figs. 1–3) and regular workshops may have helped to homogenize the assessment strategies among the teams. Another aspect of standardization is the level of knowledge of each team of experts. When comparing the teams with the standard, all teams showed fair to good agreement with the reference (κ = 0.53–0.64) without evident outliers. Agreement was also influenced by true differences in the use of industrial materials among countries, although the overall agreement among teams (κ) improved only moderately (from 0.55 to 0.60) when taking this into account.
This κ compares well with those found in other studies, especially considering that the teams had not worked together. Experts working together for years can develop consistency based on a common attitude and knowledge, but could also be subject to the same distortion, making their comparison more likely to overestimate the performance of the method, in relation to those presented here.
An evident limitation is the small size of this reliability study. This required that sensitivity and specificity estimates be grouped for agents rather than given individually. Also, many job descriptions of study subjects will not be as complicated to assess as the ones used for this expert substudy, suggesting that misclassification of exposure might be less in the whole study population. On the other hand, the fact that the experts knew that their work would be evaluated might have resulted in a more thorough search for possible exposures than would be done under ordinary conditions.
In conclusion, this study showed that expert teams that were trained together but worked apart performed similarly, making expert assessment a useful tool in multicenter studies. The levels of misclassification by experts differed by agent, which attenuated the observed OR towards unity and flattened a dose-response relationship.
The results of this study may be useful to help decide whether expert assessment would be a sufficiently reliable tool in future studies. This study also shows that to calculate the sample size needed to detect a significantly increased risk, the expected misclassification of exposure should be taken into account.
The authors thank Jack Siemiatycki and Louise Nadon from the Institute Armand-Frappier, Montreal, Canada, for their valuable contribution to the development of the exposure assessment methodology for this study.
1.Siemiatycki J, Dewar R, Richardson L. Costs and statistical power associated with five methods of collecting occupation exposure information for population-based case-control studies. Am J Epidemiol
2.McGuire V, Nelson LM, Koepsell TD, Checkoway H, Longstreth WT Jr. Assessment of occupational exposures in community-based case-control studies. Annu Rev Public Health
3.Stewart P. Challenges to retrospective exposure assessment. Scand J Work Environ Health
4.Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. J Clin Epidemiol
5.Feinstein AR, Cicchetti DV. High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol
6.McNemar Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika
7.Maxwell AE. Comparing the classification of subjects by two independent judges. Br J Psychiatry
8.Armstrong BK, White E, Saracci R. Principles of Exposure Measurement in Epidemiology
. pp 69–71. Oxford: Oxford University Press, 1992.
9.Casagrande JT, Pike MC. An improved approximate formula for calculating sample sizes for comparing two binomial distributions. Biometrics
10.Teschke K, Olshan AF, Daniels JL, et al. Occupational exposure assessment in case-control studies: Opportunities for improvement. Occup Environ Med
11.Tielemans E, Heederik D, Burdorf A, et al. Assessment of occupational exposures in a general population: comparison of different methods. Occup Environ Med
12.Teschke K, Hertzman C, Dimich-Ward H, Ostry A, Blair J, Hershler R. A comparison of exposure estimates by worker raters and industrial hygienists. Scan J Work Environ Health
13.Benke G, Sim M, Forbes A, Salzberg M. Retrospective assessment of occupational exposure to chemicals in community-based studies: validity and repeatability of industrial hygiene panel ratings. Int J Epidemiol
14.Post W, Kromhout H, Heederik D, Noy D, Smit Duijzentkunst R. Semiquantitative estimates of exposure to methylene chloride and styrene: the influence of quantitative exposure data. Appl Occup Environ Hyg
15.Hawkins NC, Evans JS. Subjective estimation of toluene exposure: a calibration study of industrial hygienists. Appl Ind Hyg
16.Kromhout H, Oostendorp Y, Heederik D, Boleij JS. Agreement between qualitative exposure estimates and quantitative exposure measurements. Am J Ind Med
17.Luce D, Gerin M, Berrino F, Pisani P, Leclerc A. Sources of discrepancies between a job exposure matrix and a case by case expert assessment for occupational exposure to formaldehyde and wood-dust. Int J Epidemiol
. 1993;22(suppl 2):S113—S120.
18.Stengel B, Pisani P, Limasset JC, Bouyer J, Berrino F, Hemon D. Retrospective evaluation of occupational exposure to organic solvents: questionnaire and job exposure matrix. Int J Epidemiol
. 1993;22(suppl 2):S72–S82.
19.Stucker I, Bouyer J, Mandereau L, Hemon D. Retrospective evaluation of the exposure to polycyclic aromatic hydrocarbons: comparative assessments with a job exposure matrix and by experts in industrial hygiene. Int J Epidemiol
. 1993;22(suppl 2):S106–S112.
20.McNamee R. Retrospective assessment of occupational exposure to hydrocarbons. Job exposure matrices versus expert evaluation of questionnaires. Occup Hyg
21.Rybicki BA, Johnson CC, Peterson EL, Kortsha GX, Gorell JM. Comparability of different methods of retrospective exposure assessment of metals in manufacturing industries. Am J Ind Med
22.Savitz DA, Baird N, Dole N. Agreement among textile industry exposures during pregnancy based on work description, job title, and self-report [abstract]. J Expo Anal Environ Epidemiol
23.Fritschi L, Siemiatycki J, Richardson L. Self-assessed versus expert-assessed occupational exposures. Am J Epidemiol
24.Teschke K, Hertzman C, Dimich-Ward H, Ostry A, Blair J, Hershler R. A comparison of exposure estimates by worker raters and industrial hygienists. Scand J Work Environ Health
25.Goldberg MS, Siemiatycki J, Gerin M. Inter-rater agreement in assessing occupational exposure in a case-control study. Br J Ind Med
26.Ciccone G, Vineis P. Inter-rater agreement in the assessment of occupational exposure to herbicides. Med Lav
27.Schmidt S, Lemasters G, Succop P. A comparison of reliability estimates when industrial hygienists determine workplace exposures (abstract). Am J Epidemiol
28.Labreche FP, Cherry NM, McDonald JC. Psychiatric disorders and occupational exposure to solvents. Br J Ind Med
29.Macaluso M, Delzell E, Rose V, Perkins J, Oestenstad K. Inter-rater agreement in the assessment of solvent exposure at a car assembly plant. Am Ind Hyg Assoc J
30.Siemiatycki J, Fritschi L, Nadon L, Gerin M. Reliability of an expert rating procedure for retrospective assessment of occupational exposures in community-based case-control studies. Am J Ind Med
31.McGuire V, Longstreth WT Jr, Nelson LM, et al. Occupational exposures and amyotrophic lateral sclerosis. A population-based case-control study. Am J Epidemiol
32.Stewart PA, Carel R, Schairer C, Blair A. Comparison of industrial hygienists’ exposure evaluations for an epidemiologic study. Scand J Work Environ Health
33.Stewart WF, Correa-Villasenor A. False positive exposure errors and low exposure prevalence in community-based case-control studies. Appl Occup Environ Hyg
34.Semple SE, Proud LA, Tannahill SN, Tindall ME, Cherrie JW. A training exercise in subjectively estimating inhalation exposures. Scan J Work Environ Health