OBJECTIVE: To identify candidate genes and genetic variants for preeclampsia using a bioinformatic approach to extract and organize genes and variants from the published literature.
METHODS: Semantic data-mining and natural language processing were used to identify articles from the published literature meeting criteria for potential association with preeclampsia. Articles were manually reviewed by trained curators. Cluster analysis was used to aggregate the extracted genes into gene sets associated with preeclampsia or severe preeclampsia, early or late preeclampsia, maternal or fetal tissue sources, and concurrent conditions (ie, fetal growth restriction, gestational hypertension, or hemolysis, elevated liver enzymes, and low platelet count [HELLP]). Gene ontology was used to organize this large group of genes into ontology groups.
RESULTS: From more than 22 million records in PubMed, with 28,000 articles on preeclampsia, our data-mining tool identified 2,300 articles with potential genetic associations with preeclampsia-related phenotypes. After curation, 729 articles were “accepted” that contained “statistically significant” associations with 535 genes. We saw distinct segregation of these genes by severity and timing of preeclampsia, by maternal or fetal source, and with associated conditions (eg, gestational hypertension, fetal growth restriction, or HELLP syndrome).
CONCLUSION: The gene sets and ontology groups identified through our systematic literature curation indicate that preeclampsia represents several distinct phenotypes with distinct and overlapping maternal and fetal genetic contributions.
LEVEL OF EVIDENCE: III