TY - JOUR
T1 - Multimodal pretraining unmasked
T2 - A meta-analysis and a unified framework of vision-and-language berts
AU - Bugliarello, Emanuele
AU - Cotterell, Ryan
AU - Okazaki, Naoaki
AU - Elliott, Desmond
N1 - Publisher Copyright:
© 2021, MIT Press Journals. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorized into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five vision and language BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.
AB - Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorized into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five vision and language BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.
U2 - 10.1162/tacl_a_00408
DO - 10.1162/tacl_a_00408
M3 - Journal article
AN - SCOPUS:85119145818
VL - 9
SP - 978
EP - 994
JO - Transactions of the Association for Computational Linguistics
JF - Transactions of the Association for Computational Linguistics
SN - 2307-387X
ER -