OBJECTIVE: To identify candidate genes and genetic variants for preeclampsia using a bioinformatic approach to extract and organize genes and variants from the published literature.
METHODS: Semantic data-mining and natural language processing were used to identify articles from the published literature meeting criteria for potential association with preeclampsia. Articles were manually reviewed by trained curators. Cluster analysis was used to aggregate the extracted genes into gene sets associated with preeclampsia or severe preeclampsia, early or late preeclampsia, maternal or fetal tissue sources, and concurrent conditions (ie, fetal growth restriction, gestational hypertension, or hemolysis, elevated liver enzymes, and low platelet count [HELLP]). Gene ontology was used to organize this large group of genes into ontology groups.
RESULTS: From more than 22 million records in PubMed, with 28,000 articles on preeclampsia, our data-mining tool identified 2,300 articles with potential genetic associations with preeclampsia-related phenotypes. After curation, 729 articles were “accepted” that contained “statistically significant” associations with 535 genes. We saw distinct segregation of these genes by severity and timing of preeclampsia, by maternal or fetal source, and with associated conditions (eg, gestational hypertension, fetal growth restriction, or HELLP syndrome).
CONCLUSION: The gene sets and ontology groups identified through our systematic literature curation indicate that preeclampsia represents several distinct phenotypes with distinct and overlapping maternal and fetal genetic contributions.
LEVEL OF EVIDENCE: III
Preeclampsia appears to represent several distinct phenotypes with distinct and overlapping maternal and fetal genetic contributions.
Department of Epidemiology, Brown University School of Public Health, Women and Infants Hospital of Rhode Island, Department of Pediatrics, Brown Alpert Medical School, and the Center for Computational Molecular Biology, Providence, Rhode Island; and the Department of Chronic Disease Epidemiology, Yale School of Public Health, New Haven, Connecticut.
Corresponding author: Elizabeth W. Triche, PhD, Assistant Professor of Epidemiology, Brown University School of Public Health, 121 S Main Street, 2nd Floor, Box G-S121-2, Providence, RI 02912; e-mail: Elizabeth_Triche@brown.edu.
Supported by grants from the National Institutes of Health: 1R21HD070177, 5T35HL094308, P20 RR018728, and P20GM103537.
Presented in part at the 2012 Annual Meeting of the American Society of Human Genetics, November 6–10, 2012, San Francisco, California, and the 2013 Annual Meeting of the North American Society of Obstetric Medicine, September 20–21, 2013, Providence, Rhode Island.
Financial Disclosure The authors did not report any potential conflicts of interest.
Preeclampsia is a life-threatening, multisystem hypertensive disorder of pregnancy, which complicates 2–8% of U.S. deliveries.1,2 Preeclampsia is a leading cause of maternal and fetal morbidity and mortality worldwide. The most effective treatment is delivery of the placenta.2 Although considered a perinatal disorder, preeclampsia is associated with long-term outcomes for both mother and neonate. Preeclampsia has been linked to stroke, cardiovascular disease, diabetes, and premature mortality among the affected mothers in later life.3,4 The offspring of preeclamptic pregnancies have higher blood pressure in childhood5 and are at increased risk of stroke in adulthood6 and a variety of diseases.7
Although the familial nature of preeclampsia has been well documented,8 the precise genetic architecture has not been identified. The promises of the genome era have been met with both enthusiasm and skepticism.9–11 The genome-wide association study approach interrogates huge numbers of anonymous single nucleotide polymorphisms or copy number variations in an unbiased, hypothesis-free approach. Unfortunately, this severely limits power and makes it computationally nearly impossible to examine combinatorial gene–gene interactions. New approaches to the genetics of complex diseases might be helpful. The literature on the genetics of preeclampsia is substantial and reflects varied methodologic approaches.12–18
Semantic data-mining and natural language processing are part of a new form of information science that uses computational approaches to extract textual information.19 It can be used to efficiently retrieve information based on user-defined queries. We used these tools to systematically search published literature to identify relevant genetic variants associated with preeclampsia. We further segregated the published genetic data by source (maternal, fetal, or both), timing (early or late), severity, and associated conditions (eg, fetal growth restriction) whenever available.
MATERIALS AND METHODS
We systematically retrieved the published literature derived from multiple approaches and assembled the results for use in future genetic investigations. We built a relational database for preeclampsia using bioinformatics tools and conducted manual review of selected articles by trained curators. More details about study methods are included in the methods supplement (see Appendix 1, available online at http://links.lww.com/AOG/A501). SciMiner is a semantic text-mining and natural language processing program for biomedical literature we used to extract relevant published articles from all years of PubMed. We created a broad set of queries focused on preeclampsia-associated genes and genetic information (see Supplemental Table 1 in Appendix 2, available online at http://links.lww.com/AOG/A502).
Once potentially relevant articles were identified by SciMiner, a curation team consisting of six medical students formally trained in molecular biology, cell biology, and genetics read and evaluated each article. Study investigators (E.W.T., A.U., A.T.D., J.F.P.) met weekly with the curation team to discuss any articles with unclear findings and share particularly interesting articles.
According to well-defined protocols and documentation (manual available on request), the curators “accepted” or “rejected” each publication and, from each accepted publication, each gene or genetic variant. An article was rejected if no significant associations (P<.05) were found between genes or genetic variants and preeclampsia-related phenotypes or if the article was not an original scientific article (ie, a review paper). “Accepted” articles presenting original scientific research with statistically significant associations between genes or gene variants and preeclampsia or related phenotypes were deposited into the database along with their unique PubMed identifiers. Combinations of phenotypes (eg, mild and severe) were only chosen when distinct analyses showed significant associations for each phenotype separately. We recorded source (maternal, fetal, or both); significance with co-occurring conditions including fetal growth restriction, hemolysis, elevated liver enzymes, and low platelet count (HELLP) syndrome, and gestational hypertension; and early onset (less than 34 weeks of gestation) compared with late onset (36 or more weeks of gestation) of preeclampsia when specifically (separately) examined in the study. To avoid duplication, all genes were listed by their HUGO Gene Nomenclature Committee identification number. We also recorded the species for which the associations were significant (Homo sapiens, Mus musculus, Rattus norvegicus, other); only two papers were accepted for the rodent M musculus; none were accepted for R norvegicus.
We accepted the authors' choice of statistical tests. Because primary data were not available, we did not recalculate statistical associations. Each article's reference list was further reviewed by the curator to identify additional potentially relevant articles that may not have been retrieved by SciMiner. We assessed interrater reliability using κ scores among the six trained curators.20,21 Our к score of 0.8 represents substantial agreement.
Once all the genes and articles were collected, we “clustered” the genes associated with each phenotype of preeclampsia. We also created clusters of genes associated with the concurrent conditions (eg, fetal growth restriction, HELLP) and the timing of onset of preeclampsia. In addition to the visual cluster analyses, a full listing of genes associated with each preeclampsia phenotype is provided in supplemental tables in Appendix 2 (http://links.lww.com/AOG/A502).
To provide a representation of the biological processes encompassed by these gene sets, we used the Gene Ontology Database. The Gene Ontology Database describes genes in terms of their associated biological processes, molecular functions, and cellular components. The genes shown in the clusters were tested for their statistical association with Gene Ontology terms using GOstat.22 This program identifies Gene Ontology terms for which genes in the curated list are overrepresented. For each Gene Ontology term, a P value is calculated indicating the probability that the observed counts of genes could have resulted from randomly distributing the associated Gene Ontology terms between our phenotype-based cluster and the reference group. GOstat corrects for multiple comparisons by using a false discovery rate, P<.05.
From more than 22 million records in PubMed, more than 28,000 articles were identified with relevant information to preeclampsia. SciMiner identified 2,300 articles meeting our query criteria for potential genetic associations with preeclampsia-related phenotypes. After curation, 729 articles were “accepted” that contained “statistically significant” associations with 535 genes. The top 15 journals from which we accepted articles during curation are shown in Supplemental Table 2 in Appendix 2 (http://links.lww.com/AOG/A502).
The accepted articles used 11 different definitions of preeclampsia (all definitions in Supplemental Table 2 in Appendix 2, http://links.lww.com/AOG/A502). Some articles also included separate results on preeclampsia-related conditions (eg, HELLP, gestational hypertension). For the purposes of clear visualization and for future interrogations, we aggregated the results by phenotype, source, and conditions for presentation in figures as cluster diagrams. Genes associated with preeclampsia and severe preeclampsia from maternal source alone, fetal source alone, and both are illustrated in Figure 1A–B and also detailed in Supplemental Table 3 in Appendix 2 (http://links.lww.com/AOG/A502). We identified 194 distinct maternal genes associated with preeclampsia and 39 distinct genes associated with severe preeclampsia, one of which is also associated with eclampsia. In addition, there are 94 maternal genes shared with both preeclampsia and severe preeclampsia; 10 of the shared genes are also associated with eclampsia. As seen in Figure 1B, there are 204 distinct fetal genes associated with preeclampsia, 25 associated with severe preeclampsia, and 52 shared genes associated with both preeclampsia and severe preeclampsia; six of the shared genes were also associated with eclampsia. The gray dots in each figure indicate unique maternal or fetal genes associated with each phenotype. We identified 162 genes associated with only maternal source, 130 genes associated with only fetal source, and 144 genes associated with both sources.
A cluster analysis for associated conditions including HELLP, fetal growth restriction, and gestational hypertension is shown in Figure 2. Again, data are shown for maternal and fetal associations in Figure 2A and Figure 2B, respectively. The genes identified in the cluster figures are shown in Supplemental Table 4 in Appendix 2 (http://links.lww.com/AOG/A502). There are a number of shared maternal and fetal genes associated with fetal growth restriction, including FLT1, TNF, IL4, 6, 8, SERPINB2, VEGFA, and PGF. However, there are also a number of distinct maternal genes (eg, CRH, LGALS13, and MTHFR) and fetal genes (eg, NOS2A, NOS3, PPARA, PPARG, and HMOX2) associated with fetal growth restriction. A larger number of distinct maternal genes compared fetal genes were identified for gestational hypertension and HELLP.
Clusters of genes associated with early and late onset of preeclampsia were also identified and are presented in Figure 3 as a network. Supplemental Table 5 in Appendix 2 (http://links.lww.com/AOG/A502) shows the specific genes that are illustrated in the cluster figures and indicates that a number of genes segregate by timing of preeclampsia.
Figure 4 shows the Gene Ontology biological processes overrepresented by fetal genes associated with preeclampsia and severe preeclampsia and for maternal genes associated with preeclampsia and severe preeclampsia. The overlapping domains are also shown. As can be seen, there were eight distinct groups identified with some shared Gene Ontology processes. The Gene Ontology terms for biological processes shown in Figure 4 describing the genes associated with preeclampsia, severe preeclampsia, and both for maternal genes and fetal genes are detailed in Supplemental Table 6 in Appendix 2 (http://links.lww.com/AOG/A502). The shared biological processes shown in Supplemental Table 6 in Appendix 2 (http://links.lww.com/AOG/A502) are shaded. We found it notable that there were no shared Gene Ontology terms for maternal and fetal source for severe preeclampsia. For maternal genes associated with severe preeclampsia, processes related to immune function or immune dysregulation including cell killing, immune effectors, acute inflammatory responses, and regulation of immune response were overrepresented. By contrast, the Gene Ontology terms for the genes associated with severe preeclampsia from fetal sources had much more to do with cellular function and cell–cell interactions. Gene Ontology Database terms overrepresented by the set of fetal genes associated with severe preeclampsia include processes like regulation of cell growth, signal transduction, cell–cell interaction, morphogenesis, and regulation of growth. Inspection of the results shows that the transmembrane receptor protein serine and threonine kinase transmembrane signaling process was the most overrepresented Gene Ontology term for fetal genes for severe preeclampsia. The fetal genes that were represented within this biological process included activin receptor 2B, a zinc finger transcription factor known as ZFYVE9, SMAD7, SMAD2, and the activin receptor 1B. These transmembrane signaling elements were also prominently represented in the next five Gene Ontology terms that encompass the processes of enzyme-linked receptor signaling, the activin receptor signaling pathway, and the transforming growth factor-β receptor signaling pathway. These results underscore the importance of cellular signaling from the fetal side in the pathogenesis of severe preeclampsia. For clarity, Figure 4 and Supplemental Table 6 only present the significant gene ontology groups associated with biological processes. A file containing the ontology groups associated with cellular components and molecular functions as well as biological processes is available on request from the authors.
This report presents results summarizing an analysis of the published literature on the genetics of preeclampsia. We used semantic text-mining to identify published articles containing putative information on genes and genetic variants associated with preeclampsia. We extracted the genes and gene variants demonstrating statistically significant associations with preeclampsia. We believe this research makes important contributions to the literature. First, it describes the use of bioinformatics tools to make systematic review of large published literature more efficient. Second, it provides clinical researchers with information on the genetics of preeclampsia that can be used in their future investigations of this complex condition. Gene sets segregated by preeclampsia phenotype (preeclampsia or severe preeclampsia), source (maternal or fetal), concurrent conditions (eg, fetal growth restriction), and timing of onset (early or late) suggest unique and shared contributions from maternal and fetal genes for different phenotypes. Gene Ontology Database groups identify potentially important pathways to consider as we implement targeted sequencing and other newly developed methods for genetic studies. Finally, this work forms the basis for the development of a publicly available, searchable database for researchers.
It is known that placental mechanisms play a central role in the pathophysiology of preeclampsia.23 Pregnancy can occur without a fetus but may still result in severe preeclampsia. As well, delivery of the placenta is curative. Of note, the ontology groups from fetal sources associated with severe preeclampsia include processes like regulation of cell growth, signal transduction, cell–cell interaction, morphogenesis, and regulation of growth. Although ontology groups do not include organ-specific processes, they involve genes that have shown to have important functional roles in the placenta, eg, TGFβ, IGF, VEGF, FGF, and transporters. By comparison, the biological processes covered by the gene ontology groups based on maternal genes that were associated with severe disease predominantly described processes related to immune function or immune dysregulation including cell killing, immune effectors, acute inflammatory responses, and regulation of immune response.
It is widely accepted that preeclampsia is a complex, multifactorial disease of pregnancy. A variety of mechanisms likely contribute to the pathogenesis of preeclampsia (eg, hypoxia, genetics), and the contribution of genetics to the pathogenesis of preeclampsia is complex. Our findings that unique genes are associated with either preeclampsia or severe preeclampsia and that unique maternal and fetal genetic contributions can also be identified are consistent with that concept. Although a significant role for genetics in the development of preeclampsia is supported by family-based and twin studies, there are many challenges in identifying specific genetic variants associated with the condition. Candidate gene studies have more power to detect associations with targeted variants, but they are limited to known pathophysiologic mechanisms and are rarely replicated in follow-up studies. Although genome-wide association studies have the potential to identify novel variants not previously known to be associated with preeclampsia, their statistical power is limited because of the large numbers of variants tested. The gene sets we present here are useful for analyzing available data on preeclampsia. We are using these gene sets to make comparisons between cases and controls from large data sets like expression arrays or genome-wide association studies. The smaller, more manageable set of genes allows greater statistical power for these comparisons.
Many clinicians believe there are differences between mild and severe or between early and late preeclampsia.24 Our results showed specific gene sets that segregated with these phenotypic characteristics and their biological source (maternal, fetal, or both). We also observed segregation of the genetic associations with conditions like intrauterine growth restriction or HELLP syndrome.
Published literature using candidate gene approaches relies on genes selected by investigators. There is a possibility of publication bias, which we were unable to quantify. It is also well known that popular genes and popular concepts are often tested by multiple groups of investigators. We saw this for several genes that are known to be important in preeclampsia like VEGF and the sFlt1 gene. Conversely, there may be other genes not yet tested that could be missing from this database because they have not yet been explored. Despite important strengths, there are other potential limitations to our approach. Because primary data were not available, we relied on the definitions as well as statistical methods chosen by the authors. We found a notable lack of consistency in the definition of preeclampsia in the literature. Specific, well-defined phenotypes may be critical to understanding the genetic architecture of preeclampsia. We dealt with this limitation by carefully annotating each gene with the specific definition used. This distinction will be helpful in the use of gene sets for increasing statistical power when looking at other published data sets.
There are alternative strategies and data-mining tools that could have been used besides SciMiner (eg, ALI BABA and PolySearch).25 However, we found these alternatives to have important limitations. For example, some use only abstracts as their data source or they do not allow users to edit the data-mining results. SciMiner provided a comprehensive and, we believe, thorough extraction of relevant literature based on full-text searches. Like PubMed, SciMiner extracts MEDLINE abstracts, keywords, and medical subject heading terms. However, SciMiner also has intelligent algorithms to resolve ambiguous acronyms commonly seen in the literature. In addition, SciMiner systematically extracts information from the full text of the articles, including genes, gene names, and gene symbols for later evaluation. In these ways, the retrievals are more robust than manual PubMed searches.
The major aim of this research was to create a valuable resource rather than a complete analysis of the genetic architecture of preeclampsia. This resource could and should be used to direct future studies. A publicly available, searchable form of the relational database for preeclampsia, similar to that developed for preterm birth,26 is under development. Although it is beyond the scope of the present analysis is to completely dissect the clusters and ontology groups and the individual and shared genes among each, these ontologies and gene sets will provide valuable information to researchers in the field. We believe in particular that the individual gene sets provide useful filters for examining larger genetic databases. These more parsimonious sets of genes will allow comparison of gene–gene interactions or higher combinatorial effects with a greater degree of statistical power. Another valuable contribution of these results is the clustering of genes associated with preeclampsia by phenotype and by source. Our findings are consistent with preeclampsia representing several distinct phenotypes with distinct and overlapping maternal and fetal genetic contributions. We believe these data will be of major value to our own and others' investigations into the etiology of this important and complex phenotype.
A publicly available, searchable form of the relational database for preeclampsia, similar to that developed for preterm birth,26 is under development. A detailed documentation manual outlining the definitions of preeclampsia and the curators' processes is available on request. The PERL, PHP, and HTML scripts are also available on request.
1. Ananth CV, Keyes KM, Wapner RJ. Pre-eclampsia rates in the United States, 1980-2010: age-period-cohort analysis. BMJ 2013;347:f6564.
2. Ilekis JV, Reddy UM, Roberts JM. Preeclampsia—a pressing problem: an executive summary of a National Institute of Child Health and Human Development workshop. Reprod Sci 2007;14:508–23.
3. Bellamy L, Casas JP, Hingorani AD, Williams DJ. Pre-eclampsia and risk of cardiovascular disease and cancer in later life: systematic review and meta-analysis. BMJ 2007;335:974.
4. McDonald SD, Malinowski A, Zhou Q, Yusuf S, Devereaux PJ. Cardiovascular sequelae of preeclampsia/eclampsia: a systematic review and meta-analyses. Am Heart J 2008;156:918–30.
5. Tenhola S, Rahiala E, Halonen P, Vanninen E, Voutilainen R. Maternal preeclampsia predicts elevated blood pressure in 12-year-old children: evaluation by ambulatory blood pressure monitoring. Pediatr Res 2006;59:320–4.
6. Kajantie E, Eriksson JG, Osmond C, Thornburg K, Barker DJ. Pre-eclampsia is associated with increased risk of stroke in the adult offspring: the Helsinki birth cohort study. Stroke 2009;40:1176–80.
7. Wu CS, Nohr EA, Bech BH, Vestergaard M, Catov JM, Olsen J. Health of children born to mothers who had preeclampsia: a population-based cohort study. Am J Obstet Gynecol 2009;201:269.e261–e10.
8. Chesley LC, Annitto JE, Cosgrove RA. The familial factor in toxemia of pregnancy. Obstet Gynecol 1968;32:303–11.
9. Goldstein DB. Common genetic variation and human traits. N Engl J Med 2009;360:1696–8.
10. Hirschhorn JN. Genomewide association studies—illuminating biologic pathways. N Engl J Med 2009;360:1699–701.
11. Kraft P, Hunter DJ. Genetic risk prediction—are we there yet? N Engl J Med 2009;360:1701–3.
12. Cnattingius S, Reilly M, Pawitan Y, Lichtenstein P. Maternal and fetal genetic factors account for most of familial aggregation of preeclampsia: a population-based Swedish cohort study. Am J Med Genet A 2004;130A:365–71.
13. Esplin MS, Fausett MB, Fraser A, Kerber R, Mineau G, Carrillo J, et al.. Paternal and maternal components of the predisposition to preeclampsia. N Engl J Med 2001;344:867–72.
14. Hiby SE, Walker JJ, O'Shaughnessy KM, Redman CW, Carrington M, Trowsdale J, et al.. Combinations of maternal KIR and fetal HLA-C genes influence the risk of preeclampsia and reproductive success. J Exp Med 2004;200:957–65.
15. Lie RT, Rasmussen S, Brunborg H, Gjessing HK, Lie-Nielsen E, Irgens LM. Fetal and maternal contributions to risk of pre-eclampsia: population based study. BMJ 1998;316:1343–7.
16. Morgan L, Crawshaw S, Baker PN, Brookfield JF, Broughton PF, Kalsheker N. Distrotion of maternal-fetal angiotensin II type 1 receptor allele transmission in preeclampsia. J Med Genet 1998;35:632–6.
17. Mütze S, Rudnik-Schöneborn S, Zerres K, Rath W. Genes and the preeclampsia syndrome. J Perinatal Med 2008;36:38–58.
18. Skjaerven R, Vatten LJ, Wilcox AJ, Rønning T, Irgens LM, Lie RT. Recurrence of pre-eclampsia across generations: exploring fetal and maternal genetic components in a population based cohort. BMJ 2005;331:877.
19. Shatkay H, Feldman R. Mining the biomedical literature in the genomic era: an overview. J Comput 2003;10:821–55.
20. Donner A, Klar N. The statistical analysis of kappa statistics in multiple samples. J Clin Epidemiol 1996;49:1053–8.
21. Reed JF III. Homogeneity of kappa statistics in multiple samples. Comput Methods Programs Biomed 2000;63:43–6.
22. Beissbarth T, Speed TP. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 2004;20:1464–5.
23. Redman CW, Sargent IL. Pre-eclampsia, the placenta and the maternal systemic inflammatory response—a review. Placenta 2003;24(suppl A):S21–7.
24. Raymond D, Peterson E. A critical review of early-onset and late-onset preeclampsia. Obstet Gynecol Surv 2011;66:497–506.
25. Hur J, Schuyler AD, States DJ, Feldman EL. SciMiner: web-based literature mining tool for target identification and functional enrichment analysis. Bioinformatics 2009;25:838–40.
26. Uzun A, Laliberte A, Parker J, Andrew C, Winterrowd E, Sharma S, et al.. dbPTB: a database for preterm birth. Database (Oxford) 2012;2012:bar069.