TY - JOUR
T1 - Inter- and Intra-Observer Agreement When Using a Diagnostic Labeling Scheme for Annotating Findings on Chest X-rays—An Early Step in the Development of a Deep Learning-Based Decision Support System
AU - Li, Dana
AU - Pehrson, Lea Marie
AU - Tøttrup, Lea
AU - Fraccaro, Marco
AU - Bonnevie, Rasmus
AU - Thrane, Jakob
AU - Sørensen, Peter Jagd
AU - Rykkje, Alexander
AU - Andersen, Tobias Thostrup
AU - Steglich-Arnholm, Henrik
AU - Stærk, Dorte Marianne Rohde
AU - Borgwardt, Lotte
AU - Hansen, Kristoffer Lindskov
AU - Darkner, Sune
AU - Carlsen, Jonathan Frederik
AU - Nielsen, Michael Bachmann
N1 - Publisher Copyright:
© 2022 by the authors.
PY - 2022
Y1 - 2022
N2 - Consistent annotation of data is a prerequisite for the successful training and testing of artificial intelligence-based decision support systems in radiology. This can be obtained by standardizing terminology when annotating diagnostic images. The purpose of this study was to evaluate the annotation consistency among radiologists when using a novel diagnostic labeling scheme for chest X-rays. Six radiologists with experience ranging from one to sixteen years, annotated a set of 100 fully anonymized chest X-rays. The blinded radiologists annotated on two separate occasions. Statistical analyses were done using Randolph’s kappa and PABAK, and the proportions of specific agreements were calculated. Fair-to-excellent agreement was found for all labels among the annotators (Randolph’s Kappa, 0.40–0.99). The PABAK ranged from 0.12 to 1 for the two-reader inter-rater agreement and 0.26 to 1 for the intra-rater agreement. Descriptive and broad labels achieved the highest proportion of positive agreement in both the inter- and intra-reader analyses. Annotating findings with specific, interpretive labels were found to be difficult for less experienced radiologists. Annotating images with descriptive labels may increase agreement between radiologists with different experience levels compared to annotation with interpretive labels.
AB - Consistent annotation of data is a prerequisite for the successful training and testing of artificial intelligence-based decision support systems in radiology. This can be obtained by standardizing terminology when annotating diagnostic images. The purpose of this study was to evaluate the annotation consistency among radiologists when using a novel diagnostic labeling scheme for chest X-rays. Six radiologists with experience ranging from one to sixteen years, annotated a set of 100 fully anonymized chest X-rays. The blinded radiologists annotated on two separate occasions. Statistical analyses were done using Randolph’s kappa and PABAK, and the proportions of specific agreements were calculated. Fair-to-excellent agreement was found for all labels among the annotators (Randolph’s Kappa, 0.40–0.99). The PABAK ranged from 0.12 to 1 for the two-reader inter-rater agreement and 0.26 to 1 for the intra-rater agreement. Descriptive and broad labels achieved the highest proportion of positive agreement in both the inter- and intra-reader analyses. Annotating findings with specific, interpretive labels were found to be difficult for less experienced radiologists. Annotating images with descriptive labels may increase agreement between radiologists with different experience levels compared to annotation with interpretive labels.
KW - artificial intelligence
KW - chest X-ray
KW - diagnostic scheme
KW - image annotation
KW - inter-rater
KW - intra-rater
KW - ontology
KW - radiologists
UR - http://www.scopus.com/inward/record.url?scp=85144620440&partnerID=8YFLogxK
U2 - 10.3390/diagnostics12123112
DO - 10.3390/diagnostics12123112
M3 - Journal article
C2 - 36553118
AN - SCOPUS:85144620440
VL - 12
JO - Diagnostics
JF - Diagnostics
SN - 2075-4418
IS - 12
M1 - 3112
ER -