CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes

Katerina Nastou*, Mikaela Koutrouli, Sampo Pyysalo, Lars Juhl Jensen

*Corresponding author for this work

Research output: Contribution to journalJournal articleResearchpeer-review

Abstract

Motivation: Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Named Entity Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus. Results: We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1621 documents with 2052 entities, 1976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature.

Original languageEnglish
Article numbervbae116
JournalBioinformatics Advances
Volume4
Issue number1
Number of pages7
DOIs
Publication statusPublished - 2024

Bibliographical note

Publisher Copyright:
© 2024 The Author(s).

Cite this