TY - JOUR
T1 - GraphPart
T2 - homology partitioning for biological sequence analysis
AU - Teufel, Felix
AU - Gíslason, Magnús Halldór
AU - Almagro Armenteros, José Juan
AU - Johansen, Alexander Rosenberg
AU - Winther, Ole
AU - Nielsen, Henrik
N1 - Publisher Copyright:
© 2023 The Author(s).
PY - 2023
Y1 - 2023
N2 - When splitting biological sequence data for the development and testing of predictive models, it is necessary to avoid too-closely related pairs of sequences ending up in different partitions. If this is ignored, performance of prediction methods will tend to be overestimated. Several algorithms have been proposed for homology reduction, where sequences are removed until no too-closely related pairs remain. We present GraphPart, an algorithm for homology partitioning that divides the data such that closely related sequences always end up in the same partition, while keeping as many sequences as possible in the dataset. Evaluation of GraphPart on Protein, DNA and RNA datasets shows that it is capable of retaining a larger number of sequences per dataset, while providing homology separation on a par with reduction approaches.
AB - When splitting biological sequence data for the development and testing of predictive models, it is necessary to avoid too-closely related pairs of sequences ending up in different partitions. If this is ignored, performance of prediction methods will tend to be overestimated. Several algorithms have been proposed for homology reduction, where sequences are removed until no too-closely related pairs remain. We present GraphPart, an algorithm for homology partitioning that divides the data such that closely related sequences always end up in the same partition, while keeping as many sequences as possible in the dataset. Evaluation of GraphPart on Protein, DNA and RNA datasets shows that it is capable of retaining a larger number of sequences per dataset, while providing homology separation on a par with reduction approaches.
U2 - 10.1093/nargab/lqad088
DO - 10.1093/nargab/lqad088
M3 - Journal article
C2 - 37850036
AN - SCOPUS:85175205895
VL - 5
JO - NAR Genomics and Bioinformatics
JF - NAR Genomics and Bioinformatics
SN - 2631-9268
IS - 4
M1 - lqad088
ER -