A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling (IJCNLP-AACL25)

Kyle Buettner

Jacob T. Emmerson

Adriana Kovashka

University of Pittsburgh, Pittsburgh, PA, USA

[Paper]

[Code]

Abstract

When captioning an image, people describe objects in diverse ways by using different terms or including details that are perceptually noteworthy to them. Descriptions are especially unique across languages and cultures. Machine translation, which has enhanced multilingual capabilities in vision-language models, often relies on text written by English speakers, leading to a perceptual bias. In this work, we outline a framework to address this bias. We particularly use a small amount of cross-language native speaker data, nearest-neighbor example guidance, and multimodal LLM reasoning to produce targeted caption changes across languages. When adding rewrites to multilingual CLIP finetuning, we improve on German and Japanese text-image retrieval case studies (up to +3.5 mean recall, +4.4 on native vs. translation errors). We also propose a mechanism to build understanding of object description variation across languages, and offer insights into cross-dataset and cross-language generalization.

Acknowledgements

This work was supported by National Science Foundation Grants No. 2006885 and 2329992.