Coeliac disease (CD) is an immune-mediated enteropathy triggered in genetically susceptible individuals by the ingestion of gluten-containing grains (eg, wheat, barley, rye). The disease is associated with human leukocyte antigen (HLA) DQ2 and DQ8 haplotypes and is characterised by intestinal inflammation, villous atrophy, and a relation to autoimmunity. Early diagnosis and dietary treatment can prevent severe, sometimes life-threatening complications (1). Presently existing guidelines from European and North American medical societies for the diagnosis of CD in children include serological testing and the requirement for histological confirmation from small-bowel biopsies (2–4).
With improvement of diagnostic antibody tests, a diagnosis of CD without intestinal biopsies has been suggested in defined situations (5). Omitting biopsies would reduce the burden of endoscopy and of general anaesthesia for the affected children, save costs, and avoid potential adverse effects of these procedures.
The intention of this systematic review and meta-analysis was to update the evidence report on serological tests for CD published in 2004 by the Agency for Healthcare Research and Quality (AHRQ) (6). In the present report, publications from January 1, 2004 to September 1, 2009 were analysed if they used serologic tests in children (anti-gliadin [AGA], endomysial [EmA], anti-transglutaminase-2 [TG2] antibodies as index tests compared with duodenal biopsy results used as the reference standard), In addition to the AHRQ report (6), point-of-care (POC) tests and tests based on deamidated gliadin peptide (DGP) antibodies used as index tests were considered if they were validated against histology as reference. The update was intended to search especially for situations with highly elevated titres where biopsy may be omitted in the diagnosis of CD.
PATIENTS AND METHODS
The protocol of this review was approved by the ESPGHAN Working Group on Coeliac Disease Diagnosis and adhered to current methodological guidelines on the conduct of systematic reviews of diagnostic accuracy (7).
We used the search algorithm as AHRQ (6; search details reported in Appendix B of that review) and included cohort studies that enrolled symptomatic children with true diagnostic uncertainty as well as case-control studies of patients with CD (cases) and children with other diseases or relatives of patients with CD (controls, see Table 1). Tests of interest were the measurement of IgA- and IgG-AGA, EmA, anti-TG2, (also named anti-tTG), and—in addition to AHRQ—anti-DGP antibodies as well as POC for IgA-anti-TG2 not requiring laboratory support.
TABLE 1-a Study feat...Image Tools
Eligible studies had to report on results of duodenal biopsies for at least 90% of patients classified by Marsh criteria (8) or a corresponding classification allowing assessment of the degree of villous atrophy of duodenal mucosa. Numbers for 2*2 tables with a minimum of n > 10 children with CD had to be available to calculate sensitivity, specificity, positive and negative likelihood ratios (“LR+”, “LR−”), and diagnostic odds ratios (DOR), and >80% of patients with CD had to have at least Marsh II lesions. Median age had to be <18 years. All of the children had to be on a gluten-containing diet. Serological test results were not to be part of the CD case definition to distinguish index tests from reference standard. Only full-text publications in English were included.
TABLE 1-b Study feat...Image Tools
More detailed information on 8 exclusion criteria (E1–E8) are provided in Figure 1 and the Appendix. Studies were eliminated in a hierarchical manner, so a study excluded for E1 was not checked further for exclusion criteria E2–E8.
Study Identification and Study Quality Assessment
Two reviewers working independently and in duplicate reviewed all of the abstracts and titles and, upon retrieval of potentially eligible studies, the full-text publications for eligibility. Disagreements were resolved by consensus. Studies were eliminated if they were clearly unrelated to the projects’ topic, judged on review of the title, abstract, and keywords (level 1 screening) or dealt with pathology other than CD, had study objectives other than diagnosis, for example, screening of CD, measuring prevalence of CD, or follow-up of patients with CD (level 2 screening).
Study quality was assessed independently by 2 investigators (K.G., N.S.) according to the Quality Assessment of Studies of Diagnostic Accuracy System (QUADAS (9)). Disagreements were resolved by consensus. Whereas QUADAS reports for 14 items qualitatively, we quantified them by a score for sensitivity analysis to obtain 1 additive summary score per study. Confirmed items were counted as “1,” nonconfirmed items were coded as “−1,” and “unknown” items were handled as “0” (10).
Five QUADAS items were always accepted as present (score +1), as they were part of the inclusion criteria:
(3) reference standard (in our case: biopsy) likely to correctly classify disease
(5) whole sample or random sample receive reference standard
(6) same reference standard regardless of index test result
(7) index test not part of reference standard
(12) same clinical data available in study as if in practice
The maximum score could thus be +14 and the minimum score could be −9.
Results of Study Identification
After deletion of duplicates (88), 2510 records were retrieved and entered level 1 screening (Fig. 2), 323 articles were checked at level 2 and 316 full texts obtained; 87 articles entered level 3 screening, of which n = 71 were excluded in a hierarchical manner by exclusion criteria “E1–E8” (Figs. 1, 2).
All of the eligible studies were abstracted in a standardised form, reporting the study's goal, design, publication type, study population, median and range of age or mean with standard deviation, proportion female, country, Marsh stage for biopsy (reference test for disease status), checks for presence of IgA deficiency, information on HLA test results in conflicting patients, information to set up 2*2 tables for each index test and subgroup, type, and cutoff of tests. All of the studies reported on more than 1 index test, and if multiple cutoffs were reported, this information was collected in a new dataset, leading to up to 12 datasets per study and a total of 80 datasets for the 16 studies (10–26).
We used MetaDiSc 1.4 software Meta-analysis of Diagnostic and Screening Tests for pooling data (27). Because sensitivities and specificities are interrelated, we also present single and pooled LR and DOR for each index test as summary measures and give information on statistical heterogeneity and the appropriateness of pooling for each test and test parameter.
For sensitivities and specificities for the cutoff given by the manufacturer, 95% confidence intervals (CIs) were calculated with MetaDiSc using the F distribution, and summary estimates were obtained by weighted averages in which the weight of each study is its sample size.
Tests with a high LR+ indicate tests that can help rule in CD, and tests with a low LR− can help rule out CD: an LR+ >10 provides “conclusive evidence” for the disease under study, whereas a low negative likelihood ratio (LR− <0.1) is “conclusive evidence” against the diagnosis (28). Ranges of 5 < LR+ ≤ 10 and 0.1 ≤ LR− < 0.2 give “strong diagnostic evidence.”
DORs describe the ratio of the odds of a positive test result in patients with disease compared with patients without disease and can be calculated as the ratio of the LRs for a positive and a negative test. DORs have the advantage of being a single indicator of test performance providing a global meaning of agreement between a test and a reference standard.
LRs and DORs were pooled with MetaDiSc using a random effects model (27,29); 95% CIs were pooled by
LR e±zα/2SE(lnLR); DOR e±zα/2SE(lnDOR)
Cells with 0 content were filled with 0.5. We assessed the inconsistency among studies with MetaDiSc using the inconsistency-squared (I2) statistic, which represents the proportion of variability across studies that is not the result of chance. I2 values of ≈25%, ≈50%, and ≈75% indicate low, moderate, and high heterogeneity, respectively (30).
In case of elevated heterogeneity (I2 > 50%), pooled results for sensitivity or specificity are not shown. Instead, numbers of studies reaching ≥90% and, in some instances, ≥95% of sensitivity or specificity are reported, as are ranges.
For sensitivity analysis, subgroups were classified by either biopsy criteria (“Marsh I/II sufficient for CD” vs “Marsh III required for CD”), type of study (cohort vs case-control study), and overall study quality (QUADAS score ≥9 vs <9) for the most frequently applied test: IgA-anti-TG2 enzyme-linked immunosorbent assay (ELISA), performed on 15 study populations. Sensitivities and specificities were pooled for each subgroup and a bivariate χ2 was calculated, using the Yates correction (EpiInfo statcalc.exe, available at: www.cdc.gov/epiinfo/downloads.htm.).
Table 1 summarises the baseline characteristics of the 16 eligible studies: A total of 1876 biopsy-verified patients with CD were compared with 1234 patients without CD from 10 publications of 11 cohort studies (2 cohorts in Korponay-Szabó et al (20) and 6 case-control studies). The numbers of patients with CD per study varied between n = 16 and n = 428 and those of comparison patients between n = 5 and n = 160. If studies presented separate results for children, only data from children were taken, as grouped by the authors, sometimes limited to only a small number of children contributing information: 15 patients with CD younger than 2 years and 5 controls of the respective age were included from Collin et al (18). Llorente et al (22) set the limit at younger than 3 years, yielding 24 patients with CD and 19 comparisons; Raivio et al (25) at 16 years with 81 patients with CD and 59 comparison patients. For the remaining studies, median ages are given or means as well as ranges or standard deviations for age, as reported by the respective studies.
Basso et al (14) had 8/161 patients with duodenal biopsies classified as Marsh I (“infiltrative”), whereas the remaining cases had Marsh II/III (“hyperplastic”/“villous atrophy”). For the populations analysed by Barker et al (13), Bazzigaluppi et al (15), and Prause et al (24), at least Marsh II was required for a CD case, whereas in the remaining studies, at least partial villous atrophy had to be present, corresponding to Marsh IIIa (8). The numbers of patients biopsied per group are reported as well as the reported index tests with information on the test manufacturer and on study quality assessed by the QUADAS summary score.
For each set of tests, pooled results are presented in Table 2 for sensitivity, specificity, LR+, LR−, and DOR with ranges and 95% confidence limits, and information on statistical heterogeneity (P value, inconsistency I2).
Six of the included studies reported on IgA-AGA ELISAs, showing a larger range for sensitivity (60.9%–96.0%) than for specificity (79.4%–93.8%). Pooling was not done for sensitivity because of statistically relevant heterogeneity. For 4 studies, both sensitivity and specificity were <90%; Prause presented a low sensitivity (73.9%) and a higher specificity (91.9%), whereas Lagerqvist (2008) reported both a high sensitivity (96.0%) and specificity (93.8%). Pooled specificity was 89.8% (95% CI 86.9%–92.3%).
In the subgroup of children younger than 18 months, Lagerqvist (2008) reported a high sensitivity of 97.2% and a lower specificity of 83.2%.
Pooled LR+ was 7.3 (95% CI 4.5–11.8), pooled LR− 0.186 (95% CI 0.095–0.362), and the DOR 40.6, all 3 pooled estimates showing large heterogeneity, indicated by parentheses in Table 2.
Two studies reported on IgG-AGA, reaching sensitivities and specificities of 88.0% and 80.0% (24) and 73.9% and 26.3% (26), respectively (not shown in Table 2).
IgA-anti-TG2: ELISAs and Radiobinding Assays (RBAs)
IgA-anti-TG2 test results were reported from 13 studies plus the 2 cohorts reported by Korponay-Szabó et al (20) on ELISAs (1694 patients with CD and 1138 comparison patients) and from 3 studies on RBAs (255 patients with CD and 146 comparisons).
Pooling of studies for sensitivity and specificity was not done because of elevated heterogeneity, as can be seen from Forest plots (Figs. 3 and 4). Ranking the studies by results for sensitivity, 12 out of 15 study populations reporting on ELISAs had measured sensitivities ≥90%, whereas 9 achieved sensitivities ≥95%. For specificity, 13 studies found results ≥90% and 11 studies ≥95%. Pooled LR+ was 22.4 (95% CI 12.1–41.4) for all 15 studies and LR− 0.06 (95% CI 0.03–0.1); the DOR was 508 (95% CI 247–1042).
Two study populations from India related to the tests based on guinea pig antigen (GP; (23,26)) were reviewed; 1 started the cohort with GP and replaced it later with human antigen (13), whereas the remainder used human (natural or recombinant) antigens only. Sensitivity was discrepant in the 2 Indian studies, whereas 7 of 10 study populations using human recombinant antigen-based tests (all but 1 using the Celikey test from Phadia) showed sensitivities ≥95% whereas 8 of 10 study populations showed specificities ≥97%. The LR+ for the subset using human recombinant antigen–based tests was 28.9 (95% CI 14.2–58.8), the LR− 0.046 (95% CI 0.024–0.087), and the DOR was 784 (95% CI 362–1702).
RBAs showed a similar quality of diagnostic parameters: 2 of 3 studies showed both sensitivities and specificities ≥95% (15,17). Agardh et al (11) reported a sensitivity of 89.0% and a specificity of 94.0%. Pooled LR+ was 19.2 (95% CI 9.07–40.6), LR− 0.06 (95% CI 0.02–0.17), and DOR was 347 (95% CI 78.6–1529).
IgG-anti-TG2: ELISAs and RBAs
There were 4 studies reporting on IgG-anti-TG2. Tests were either ELISAs (12,24) or radioimmunoassay (11,15). Results were heterogeneous; 2 tests showed <30% sensitivity, whereas specificity was ≥94% in 3 studies. The radioimmunoassay used by Bazzigaluppi et al (15) showed the best diagnostic accuracy among IgG-anti-TG2 tests (99.3% sensitivity and 95.3% specificity).
IgA-EmA tests were reported from 9 studies plus those from the 2 cohorts by Korponay-Szabó et al (20). For 7 of 11 study populations, the reported sensitivities were ≥90% (pooling not done because of too much heterogeneity), and for all 11 studies, specificities were at least 94.7%, yielding a pooled estimate of 98.2%. Pooled LR+ was 31.8, LR− 0.067, and the DOR was 554 (more details shown in Table 2). There were no studies reporting on IgG EmA.
POC Tests: IgA-AGA
POC tests for IgA-AGA were analysed by 2 studies, but results were only shown in combination with anti-TG2. Pooling of POC combination tests for IgA-AGA/anti-TG2 was not done for sensitivity ((16): 95.6% and (19): 99.3%) because of high heterogeneity, whereas pooled specificity was 95.7% (95% CI 91.6%–98.1%) with no substantial heterogeneity (not shown in Table 2).
POC Tests: IgA-anti-TG2
IgA-anti-TG2 POC tests were reported from 4 studies including 2 cohorts by Korponay-Szabó et al (20). For all 5 study populations, sensitivities were ≥94.7% (pooled 96.4%; 4 studies ≥95%); and specificities were all ≥96.6% (pooled 97.7%); pooled LR+ was 40.6, LR− was 0.04, and DOR was 1343 (data are presented in more detail in Table 2).
Results were reported from 3 studies reporting on Quanta-lite Gliadin IgA II (12,14,24) and a second test used by Prause et al (GAF-3X). Sensitivities were in the range of 80.7% to 95.1%. Two tests reached ≥90% sensitivity; 1 passed the threshold ≥95%.
Specificity ranged between 86.3% and 93.1%, with 3 of 4 results reaching ≥90%, LR+ was 9.4, LR− 0.12, and DOR was 86.1 (data are presented in more detail in Table 2).
With regard to IgG-DGP, neither sensitivities nor specificities could be pooled from the studies by Basso et al (14), Agardh (12), and Prause et al (24) because of heterogeneity. Sensitivity ranged between 80.1% and 98.6%, the latter 2 reports reaching ≥95%. As for specificity, 3 out of 4 results were ≥90% (all except (12)), whereas the specificity reported by Basso et al (14) surpassed 95%. LR+ was 13.6, LR− was 0.061, and DOR was 234 (data are presented in more detail in Table 2).
The average QUADAS score of the included studies was 8.1 (median 8, range 3–13; a complete item description and individual study results can be found in (10)).
Sensitivity analyses (for IgA-anti-TG2 only)
Studies requiring histological Marsh III lesions showed higher test sensitivities for IgA-anti-TG2 (94.6%) than studies that also included patients with Marsh I/II+ lesions (88.4%; P < 0.001), whereas for specificity there was no substantial discrepancy (94.9% for Marsh I/II+ for CD vs 96.6% for Marsh III as minimum requirement; P = 0.25).
The QUADAS score did not have any impact on the measured sensitivities (P = 0.9) nor specificities (P = 0.5), nor did the type of study (cohort vs case-control studies; difference for sensitivity: P = 0.44, for specificity: P = 0.25).
Complete Concordance (100%) of Laboratory Tests With Biopsy Classification
Of 80 test constellations reported by the 16 studies, 7 yielded sensitivities of 100% (“no patient with CD missed by the index test”) and 17 reported specificities of 100% (“no one without CD wrongly classified as patient with CD”). Four test constellations were without any classification error, showing both 100% sensitivity and 100% specificity (2 for IgA EmA, 1 for IgA-anti-TG2 ELISA, and 1 for IgA-anti-TG2 RBA ((17,22,25), respectively).
Studies Reporting on Multiple Cutoffs
Agardh (12), Barker et al (13), and Poddar et al (23) reported results for multiple cutoffs: Agardh and Barker et al both used Quanta Lite by Inova Diagnostics (San Diego, CA). Agardh (12) used the recommended cutoff of 20 U/mL as well as an additional study cutoff at 30 U/mL, whereas Barker et al used 20 and 100 U/mL with the intention to search for thresholds of sufficient diagnostic accuracy to avoid biopsy. The Quanta Lite kit in the Barker et al study had been changed from GP based to human TG2 antigen during the study. Poddar et al (23) reported on the test by The Binding Site at a cutoff of 4 U/mL (the recommended cutoff by the manufacturer) and at 10 and 20 U/mL.
Perfect specificity (no false-positives) was reported by Agardh (12) for IgA-TG2 at a cutoff at 30 U/mL and for IgG at 20 U/ mL. Sensitivity was 95% for IgA-TG2 at a cutoff at 30 U/ mL, but low for IgG: 12.6% at 20 U/mL and 4.6% at 30 U/ mL.
Barker et al (13) improved specificity from 80% to 97.7% for IgA-anti-TG2 at 100 U/mL, wrongly classifying 1 of 45 patients without CD as CD positive, but lost 7 true patients with CD (sensitivity decreased from 94.8% to 82.7%) compared with the cutoff suggested by the manufacturer for the Quanta Lite test.
INTERPRETATION AND DISCUSSION
We conducted a systematic review and meta-analysis of studies that enrolled mostly paediatric patients with suspected CD and performed antibody detection index tests for CD with biopsy as the reference standard test. Our review included studies published after 2003; older studies were previously summarised by AHRQ (6).
Sensitivity is crucial for finding new patients, but for evaluating whether seropositivity for coeliac antibodies would correctly detect CD without performing an intestinal biopsy, specificity is the critical parameter. Sensitivity and specificity are not independent parameters, and in general, by lowering a threshold cutoff of a particular test kit, sensitivity will be increased by concomitant loss of specificity.
Tests With Best Performance: EmA and Anti-TG2
The synthesis of the evaluated 80 datasets in 16 studies indicated that IgA-anti-TG2 and EmA were the best laboratory tests predicting CD. For both tests, LRs were in the range of “conclusive evidence” (28) for either confirming or rejecting a diagnosis of CD in case of a positive (LR+) or negative (LR−) test result (Table 2; test statistics).
Sensitivity of EmA tests ranged lower (7/11 studies presented ≥90% sensitivity; 6/11 studies presented sensitivities ≥95%) compared with IgA-anti-TG2 (12/15 studies on ELISAs presented ≥90% sensitivity; 9/15 studies ≥95%). In contrast, the specificity was more stably higher for EmA (pooled 98.2%, all except 1 study reached ≥95% specificity, whereas 11/15 studies on IgA-anti-TG2 reached 95% specificity).
When only IgA-anti-TG2 ELISAs based on human recombinant antigens were analysed, the performance improved particularly regarding sensitivity. All but 1 study used the same commercial test, however, and it is thus unclear whether all human recombinant TG2-based tests would perform similarly well.
IgA-anti-TG2 could work better as an initial CD detection test, whereas the highly specific EmA test could be used as a confirmatory test to identify test positives as true patients with CD.
Comparing Our Results With AHRQ: EmA
AHRQ (6) evaluated 18 studies on EmA tests based on monkey oesophagus substrate and reported a pooled sensitivity of 96.1% (95% CI 94.5%–97.3%); all but 1 study had shown sensitivities ≥90%; 14 studies yielded sensitivities ≥95%. In our 11 study populations, there were 1529 patients tested with EmA, the majority (1508) tested with monkey oesophagus as substrate. We could not pool sensitivities because of statistical heterogeneity, but the proportion of studies reaching either 90% or 95% sensitivity was lower than in the AHRQ compilation.
This result is not because of a less stringent reference standard in our sample. Only 1 study reporting on IgA EmA also had included patients with CD with < Marsh III lesions at biopsy (15), but the sensitivity for EmA was not different from the remaining 4 studies, including patients with at least Marsh III findings. Instead, the lower sensitivities observed in our study compilation may be related to the difference in the study populations. In the last decade, more patients are coming to clinician attention with mild clinical symptoms, and the overall proportion of milder histology lesions even within the Marsh III category is relatively higher. It is known that patients with Marsh IIIa grade villous lesion are less frequently EmA positive than those with Marsh IIIb or IIIc (31).
Our finding of a pooled specificity of 98.2% for EmA (95% CI 96.7%–99.1%) was similarly high as in the compilation by AHRQ: 97.4% (95% CI 96.3%–98.2%). The pooled LR+ of 38.1 in our study compilation makes CD likely in a patient with a positive EmA test result.
Comparing Our Results With AHRQ: Anti-TG2
For IgA-anti-TG2 (ELISAs and RBAs), we could not confirm AHRQ's excellent results for anti-TG2 measurements performed with human recombinant antigen. There were only 3 such articles available in 2004, and all of them had shown sensitivities and specificities ≥95% for IgA-anti-TG2 (32–34) while in our subset; this target was reached only by 7/10 (for sensitivity) and 8/10 (for specificity) study populations. Our total pool presented 18 test results: 11/18 study populations (all ELISAs and RBAs) reached ≥95% sensitivity and 13/18 reached a specificity level ≥95%; however, the accuracy improved when only ELISAs based on human antigen were considered.
Quality criteria for the reference standard were less stringent, because histological Marsh I/II changes were sufficient for CD diagnosis in 3 of our studies on IgA-anti-TG2. The number of patients was larger in our sample (2.115 patients tested with ELISAs based on human recombinant substrate), while AHRQ reported on 3 studies with a total of 216 patients based on human substrate. This discrepancy is probably related to the fact that at the time of the AHRQ study most anti-TG2 tests were still GP antigen based and human TG2 antigens were just introduced and reported in typical case sets. Since then, human-TG2 antigens became universally available and the use of the human antigens predominant. Other sample sources such as whole blood, saliva and stool were explored as well, but yielded inferior results for saliva and stool (10).
IgA-AGA and IgG-AGA
Six studies reported on IgA-AGA and 2 on IgG-AGA tests, all but 1 showing inferior diagnostic accuracy than EmA or IgA-anti-TG2, which is in concordance with statements by North American Society for Pediatric Gastroenterology, Hepatology, and Nutrition (NASPGHAN) and National Institute for Health and Clinical Excellence (3,4). Lagerqvist et al (21) emphasised their high sensitivity found in Swedish children younger than 18 months, which seems to contradict the recommendation of NASPGHAN (3) to abandon AGA analysis in routine childhood CD testing. Lagerqvist et al concluded that 17% of the children with CD in the youngest age group would have remained undiagnosed if IgA-antiTG2 had been used alone. Our suggestion is to be cautious with the interpretation of IgA-AGA tests because follow-up with challenge procedures have not been reported in most of these children.
Tests based on IgG-DGP yielded a pooled LR+ of 13.6, equivalent to “conclusive evidence” for the presence of disease as classified by Deeks (28), but ranking lower than tests for EmA and IgA-anti-TG2. Detection of IgA-DGP antibodies was found to have lower diagnostic accuracy, as defined by the proportion of correctly identified individuals. Because detection of DGP antibodies was not available at the time of the AHRQ report, there are no data on DGP antibodies to compare our findings.
Future studies will have to show whether the relatively high specificity of IgG-DPG antibodies will be useful, especially in detecting CD in IgA-deficient individuals.
There were results from 5 study populations reporting on POC tests in a total of 869 patients and comparing them with serologic laboratory-based tests: IgA-anti-TG2 (in all 5 study populations) and EmA in 3 study populations (20,25).
For IgA-anti-TG2 ELISA (all 4 studies used Celikey, Phadia), the pooled sensitivity was 98.3% (95% CI 96.7%–99.3%) in the 5 studies reporting on POC tests, whereas the sensitivity of POC tests reached 96.4% (95% CI 94.3%–97.9%; Table 2). It was not possible to pool specificity for IgA-anti-TG2 ELISAs because of heterogeneity in the 5 studies (Blesa-Baviera et al reached 90.3%, whereas all of the others achieved at least 98.3%). The pooled specificity for POC tests was 97.7% (95% CI 95.8%–99.0%; Table 2).
For the 424 patients in this subgroup also tested by EmA, the pooled sensitivity of EmA was 99.1% (95% CI 96.7%–99.9%), and the pooled specificity was 100% (95% CI 98.2%–100%).
For POC tests it can thus be concluded that they had high diagnostic accuracy, but the corresponding IgA-EmA and anti-TG2 done on the same individuals showed even better diagnostic parameters.
In spite of apparently excellent diagnostic accuracy, there is no indication that POC tests are superior to IgA-anti-TG2 or EmA tests. Presumably the studies reporting on POC tests had more homogenous study populations: all of the patients with CD analysed had to have Marsh III on biopsy and 1 of the 2 cohorts in the study by Korponay-Szabó et al (20) comprised patients with CD previously diagnosed as having CD (Fig. 5), so there is probably some preselection of patients with extended signs of CD.
All test results reported in the meta-analysis were based on study populations with a high prevalence of CD: 60.3% of all patients studied in our pool had biopsy-confirmed CD. The positive predictive value in that situation is 98.5% for POC tests, the negative predictive value is 94.7%. Assuming instead a prevalence of CD in 5% of all symptomatic children—more likely to correspond to the situation in a general practitioner's office—the positive predictive value would drop to 68.6% and the negative predictive value would rise to 99.8%.
Although POC tests had been used and interpreted by paediatricians in the 4 reported studies, they may not work well in the hands of laypeople or medical staff with little experience in the interpretation of such tests. The possibility that POC can be performed by laypeople not considering the circumstances at the time of testing such as gluten intake, absence of interfering drugs, or IgA deficiency may further interfere with the reliability of the test results. In addition, POC tests may have the inherent problem that they are currently nonquantitative. Future studies will have to show whether the highly accurate results obtained by specialists will be reproduced in less selected populations and by individuals less experienced in interpreting POC test results.
Higher Cutoffs for Increasing Specificity
The present evaluation also investigated whether specificity could be increased to 100%, safely predicting small intestinal villous atrophy in children by raising the cutoffs from laboratory tests. A study in adults by Hill et al reported that human-TG2 antigen–based serum anti-TG2 ELISA values always were associated with villous atrophy when they exceeded 10 times the upper limit of normal for a test calculating antibody concentration from a standard curve (5). In such cases, small intestinal biopsy would be unnecessary. There is a further study published by Dahlbom et al after the end of our literature search showing that similar serum IgA-anti-TG2 levels and test result calculations as reported by Hill et al (5), combined with high IgG-anti-TG2 values, also could predict villous atrophy in children (35).
In our selection of 16 studies, 3 investigated the performance of serum antibody tests at various cutoff levels (Agardh (12): 1.5 times the upper limit of normal cutoff for 6 tests; Barker et al (13): 5 times; and Poddar et al (23): 2.5 times and 5 times the upper limit of normal cutoff, all for IgA-anti-TG2), but none of these tests were standard curve based and may thus give different results in different runs. Agardh (12) reported no false-positives for IgG-anti-TG2 already at the cutoff suggested by the manufacturer, therefore, raising the cutoff could not improve specificity.
Among the 80 datasets we analysed from the 16 studies, there were 17 constellations with a specificity of 100%: 9 sets from IgA-anti-TG2, 6 sets from EmA, and 2 sets from combined test kits using IgA-anti-TG2 and IgG-DGP. Only 1 test situation had changed to 100% by raising the cutoff from the limit suggested by the manufacturer: Agardh (12) increased the specificity from 96.6% to 100% for IgA-anti-TG2 by raising the cutoff 1.5 times (30 U/mL; 2 more patients without CD became correctly identified as nondiseased).
Barker et al (13) correctly classified 44 of 45 biopsy-negative as patients without CD by IgA-anti-TG2 at a cutoff 5 times higher than the upper limit of normal value with the same test as Agardh (12), whereas 9 additional patients were seropositive at the normal cutoff but negative on biopsy. Barker et al suggested that patients with an IgA-anti-TG2 > 5 times the upper limit of normal value should be biopsied only if they do not experience symptomatic improvement with a gluten-free diet.
Neither the anti-TG2 kit used by Agardh (12) (human antigen) nor that by Barker et al (13) (GP or human antigen) is standard curve based, so it is unclear how these cutoff proposals correspond to the cutoffs described by Hill et al and Dahlbom et al (5,35), or to values in other commercial tests. Furthermore, unlike other tests, 100 U/mL in the Barker et al study is not an upper limit of the measurement range and may in reality not represent high values (36).
Dahlbom et al correctly identified all 109 children with histologically confirmed CD by Celikey test, equivalent to 100% sensitivity (35). Among the 130 controls (both children and adults) there was 1 person wrongly positive at the cutoff of 3 U/mL with a finding of <10 U/mL. As communicated to us by 1 of the authors (Korponay-Szabó), this person happened to be a child, so child-specific specificity could be calculated as 99.3%. To avoid even this small proportion of false-positive patients, Dahlbom et al suggested raising the limit to 30 U/mL, 10 times the suggested threshold of 3 U/mL, which is considered superior to the manufacturer's suggested cutoff.
Impact of Study Quality and Reference Standards
There was no appreciable difference in the diagnostic yield between studies with a high or low QUADAS score, so the quality of the studies was related neither to sensitivity nor to specificity. We did not find a difference in test outcomes for case-control versus cohort studies; this may be interpreted as an argument against a selection of more clearly symptomatic patients in the case of a case-control study.
It was a surprising finding that raising the level of the reference standard from Marsh I/II to Marsh III improved the sensitivity of index tests, as less—only the more severely—diseased patients should then be picked up by the same laboratory test. The 4 studies including individuals without villous atrophy as patients with CD did not report their results as stratified by the Marsh classification, so this issue cannot be evaluated further. QUADAS ranking was on average lower (score 4.5) in those 4 studies than the entire study pool's average (score 8.1), so lower overall study quality may be an explanation for this unexpected finding.
Because our search included only studies where >80% of patients with CD showed a Marsh II or higher grade small intestinal lesion, it cannot be excluded that some of the seropositive subjects regarded as false-positives could in fact be truly gluten-sensitive subjects. In addition to the sometimes slowly evolving nature of the small intestinal lesion, it must be noted that histological lesions similar to those seen in CD also may rarely occur in other diseases.
Limitations and Strengths
Our study was designed to verify intestinal lesions of CD. It is therefore obvious that patients with only extraintestinal forms of CD may have shown positive serology but negative intestinal biopsy, thus pretending a false-positive test result. Such a condition may be, for example, dermatitis herpetiformis, which was not sufficiently represented in the evaluated material.
Some of the case-control studies were not designed to enroll all subsequent patients, but made a selection from a much larger pool. Possibly these patients were neither representative CD cases nor representative controls. Furthermore, some of the patients probably had undergone serological testing before recruitment, leading to an artificially raised prevalence estimate. The pooled prevalence of CD in the cohort study populations was indeed high: 56.2%, making such a bias most likely, although such selections were not reported in the 16 studies.
Histology has been questioned as the predominant diagnostic tool in CD (37–40), because it can be misleading by producing mainly false-negative results. The lesions may be patchy and thus missed if only 1 biopsy is taken from the descending part of the duodenum or by capsule biopsy. The ongoing debate on the validity of histology for CD verification, its exclusive role as reference standard, and the specificity of villous atrophy including intraepithelial lymphocytosis for CD is acknowledged. In spite of the doubts expressed in those studies and potential technical pitfalls in the pathology interpretation, it was decided to retain biopsy histology as the single reference standard, because the inclusion of information from serologic tests as part of the reference standard definition would have led to circular reasoning making validation impossible.
Both EmA- and-TG2-based ELISA tests in principle detect the same antibodies (40), so variability across different studies may be more dependent on the differences in the exposure of the TG2 antigenic sites in a particular test than on patient characteristics.
CONCLUSIONS AND IMPLICATIONS FOR PRACTICE AND RESEARCH
The AHRQ report from 2004 did not make an explicit statement on the necessity of diagnostic biopsy, but the North American guideline concluded, “It is recommended that confirmation of the diagnosis of CD require an intestinal biopsy in all cases” (3), as did other currently valid guidelines for children (2), and for children and adults (4). Our findings are the basis for the new ESPGHAN guideline published in the January 2011 issue.
The test results for EmA are comparable to the situation as described by AHRQ despite a more complex and more difficult clinical situation than in 2004. For human recombinant IgA-anti-TG2, the new data show variable results and raise some doubt about the excellent diagnostic accuracy presented for these tests in the AHRQ report.
The new POC tests and tests based on DGP antibodies should be followed up in future studies. Although POC tests are performed with high accuracy in expert hands, it is too early to classify them as substitutes for laboratory tests. Tests for IgG-DGP may play a role in the diagnosis of CD in IgA-deficient children in addition to the already established value of IgG-anti-TG2 (41,42).
Neither NASPGHAN (3) nor National Institute for Clinical Excellence (4) had proposed situations when biopsy can be omitted in the diagnosis of CD. We only found 1 small study in children approaching this question by introducing variable cutoffs for laboratory tests, yet ending with inconclusive results (13).
The broader evidence background for the present literature analysis did not disclose higher accuracies in the reported studies despite improved technical approaches. This lack of improvement probably reflects more unselected patient populations. It may be argued that the tests have stood the test of time and the consequence of such high accuracies could be the omission of histological analysis in selected cases.
Future studies reporting on new diagnostic tests for CD should be designed as cohort studies enrolling consecutive patients, including IgA, EmA, and probably also human leucocyte antigen testing in the reference standard besides biopsy. These studies should include patient follow-up for the detection of potential misclassification.
Conflict of Interest Statements: The following authors declared potential conflicts of interest: S. Koletzko (research support from Euroimmun, Phadia, instrumentation laboratory for 1 research project), I. Korponay-Szabó (patent application on point of care coeliac test licenced by the University of Tampere, Finland, to AniBiotech, Vantaa, Finland), Carlo Catassi (consultant for Menarini Diagnostics s.r.l., Florence, Italy), C. Ribes-Koninckx (research support from Phadia), and M. Maki (consultancies for Finnish Food Safety Authority Evira, The Finnish Funding Agency for Technology and Innovation Tekes; The Finnish Innovation Funds Sitra; International Life Science Institute; Coeliac Research Fund, Australia; Domm International; Finn Medi; Sinebrychoff; Moilas; Raisio; Phadia; Anibiotech; Kustannus Duodecim, Finland; Vactech; Eurospital; Inova; Nexpep; Alvine Pharmaceuticals; Shire; GlaxoSmithKline; Alba Therapeutics; ChemoCentryx; Zedira).
Appendix A Study selection: inclusion/exclusion criteria
Appendix B CALCULATION OF DIAGNOSTIC PARAMETERS
Table. No title avai...Image Tools
Table. No title avai...Image Tools