Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection

Published in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

We analyze if attributes in captions that supervise VL models result in useful representations for fine-grained tasks. We find models have limited fine-grained utility, so we propose a negative sampling strategy in contrastive learning to improve attribute sensitivity.