Abstract
Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.
Original language | English |
---|---|
Title of host publication | Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL) |
Number of pages | 12 |
Publisher | Association for Computational Linguistics |
Publication date | 1 Nov 2019 |
Pages | 87-98 |
DOIs | |
Publication status | Published - 1 Nov 2019 |
Event | 23rd Conference on Computational Natural Language Learning - Hong Kong, China Duration: 3 Nov 2019 → 4 Nov 2019 |
Conference
Conference | 23rd Conference on Computational Natural Language Learning |
---|---|
Country/Territory | China |
City | Hong Kong |
Period | 03/11/2019 → 04/11/2019 |