Compositional Generalization in Image Captioning

Mitja Nikolaus, Mostafa Abdou, Matthew Lamm, Rahul Aralikatte, Desmond Elliott

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningpeer review

23 Citationer (Scopus)
160 Downloads (Pure)

Abstract

Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.
OriginalsprogEngelsk
TitelProceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Antal sider12
ForlagAssociation for Computational Linguistics
Publikationsdato1 nov. 2019
Sider87-98
DOI
StatusUdgivet - 1 nov. 2019
Begivenhed23rd Conference on Computational Natural Language Learning - Hong Kong, Kina
Varighed: 3 nov. 20194 nov. 2019

Konference

Konference23rd Conference on Computational Natural Language Learning
Land/OmrådeKina
ByHong Kong
Periode03/11/201904/11/2019

Citationsformater