TY - JOUR
T1 - Improving dictionary-based named entity recognition with deep learning
AU - Nastou, Katerina
AU - Koutrouli, Mikaela
AU - Pyysalo, Sampo
AU - Jensen, Lars Juhl
N1 - Funding Information:
This work was supported by Novo Nordisk Foundation [grant number NNF14CC0001] and from the Academy of Finland [grant number 332844]. K.N. has received funding from the European Union\u2019s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie [grant number 101023676]. M.K. has received funding from Novo Nordisk Foundation [grant number NNF20SA0035590]. This paper was published as part of a supplement financially supported by ECCB2024.
Funding Information:
This work was supported by Novo Nordisk Foundation [grant number NNF14CC0001] and from the Academy of Finland [grant number 332844]. K.N. has received funding from the European Union's Horizon 2020 research and innovation program under the Marie Sklodowska-Curie [grant number 101023676]. M.K. has received funding from Novo Nordisk Foundation [grant number NNF20SA0035590]. This paper was published as part of a supplement financially supported by ECCB2024.
Publisher Copyright:
© 2024 The Author(s). Published by Oxford University Press.
PY - 2024
Y1 - 2024
N2 - Motivation: Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly. Results: In this work, we aim to improve block list s by automatically identifying names to block, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we generated a dataset of over 12.5 million text spans where the methods agree on the boundaries and type of entity tagged. These were used to generate positive and negative examples of contexts for four entity types (genes, diseases, species, and chemicals), which were used to train a Transformer-based model (BioBERT) to perform entity type classification. Application of the best model (F1-score = 96.7%) allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. In addition, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes boosted text mining precision by ∼5.5% on average, and over 8.5% for chemical and 7.5% for gene names, positively affecting several biological databases utilizing this NER system, like the STRING database, with only a minor drop in recall (0.6%).
AB - Motivation: Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly. Results: In this work, we aim to improve block list s by automatically identifying names to block, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we generated a dataset of over 12.5 million text spans where the methods agree on the boundaries and type of entity tagged. These were used to generate positive and negative examples of contexts for four entity types (genes, diseases, species, and chemicals), which were used to train a Transformer-based model (BioBERT) to perform entity type classification. Application of the best model (F1-score = 96.7%) allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. In addition, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes boosted text mining precision by ∼5.5% on average, and over 8.5% for chemical and 7.5% for gene names, positively affecting several biological databases utilizing this NER system, like the STRING database, with only a minor drop in recall (0.6%).
U2 - 10.1093/bioinformatics/btae402
DO - 10.1093/bioinformatics/btae402
M3 - Journal article
C2 - 39230709
AN - SCOPUS:85203200703
VL - 40
SP - ii45-ii52
JO - Bioinformatics (Online)
JF - Bioinformatics (Online)
SN - 1367-4811
ER -