TY - JOUR
T1 - Occluded Video Instance Segmentation
T2 - A Benchmark
AU - Qi, Jiyang
AU - Gao, Yan
AU - Hu, Yao
AU - Wang, Xinggang
AU - Liu, Xiaoyu
AU - Bai, Xiang
AU - Belongie, Serge
AU - Yuille, Alan
AU - Torr, Philip H. S.
AU - Bai, Song
N1 - Publisher Copyright:
© 2022, The Author(s).
PY - 2022
Y1 - 2022
N2 - Can our video understanding systems perceive objects when a heavy occlusion exists in a scene? To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, that is, to simultaneously detect, segment, and track instances in occluded scenes. OVIS consists of 296k high-quality instance masks from 25 semantic categories, where object occlusions usually occur. While our human vision systems can understand those occluded instances by contextual reasoning and association, our experiments suggest that current video understanding systems cannot. On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 16.3, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario. We also present a simple plug-and-play module that performs temporal feature calibration to complement missing object cues caused by occlusion. Built upon MaskTrack R-CNN and SipMask, we obtain a remarkable AP improvement on the OVIS dataset. The OVIS dataset and project code are available at http://songbai.site/ovis.
AB - Can our video understanding systems perceive objects when a heavy occlusion exists in a scene? To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, that is, to simultaneously detect, segment, and track instances in occluded scenes. OVIS consists of 296k high-quality instance masks from 25 semantic categories, where object occlusions usually occur. While our human vision systems can understand those occluded instances by contextual reasoning and association, our experiments suggest that current video understanding systems cannot. On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 16.3, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario. We also present a simple plug-and-play module that performs temporal feature calibration to complement missing object cues caused by occlusion. Built upon MaskTrack R-CNN and SipMask, we obtain a remarkable AP improvement on the OVIS dataset. The OVIS dataset and project code are available at http://songbai.site/ovis.
KW - Benchmark
KW - Dataset
KW - Occlusion reasoning
KW - Video instance segmentation
KW - Video understanding
UR - http://www.scopus.com/inward/record.url?scp=85132288284&partnerID=8YFLogxK
U2 - 10.1007/s11263-022-01629-1
DO - 10.1007/s11263-022-01629-1
M3 - Journal article
AN - SCOPUS:85132288284
VL - 130
SP - 2022
EP - 2039
JO - International Journal of Computer Vision
JF - International Journal of Computer Vision
SN - 0920-5691
ER -