Teaching VLMs to Localize Specific Objects from In-context Examples

Sivan Doveh1,2 Nimrod Shabtay1,3 Wei Lin4 Eli Schwartz1 Hilde Kuehne1,5 Raja Giryes3 Rogerio Feris6 Leonid Karlinsky6 James Glass7 Assaf Arbelle1 Shimon Ullman2 M. Jehanzeb Mirza7
1IBM Research, 2Weizmann Institute of Science, 3Tel Aviv University, 4JKU Linz, 5Tuebingen AI Center, 6MIT-IBM, 7MIT CSAIL
Method Overview

In-context personalized localization involves localizing object instances present in a scene (or query image) similar to the object presented as an in-context example. In this setting, the input to the model is a 'category name', 'in-context image', 'bounding box coordinates' (not shown in this figure), and a 'query image'. The model is tasked with localizing the same category of interest (presented as an in-context example) in the 'query image'. Here, we visualize a few inputs and outputs from various VLMs highlighting that our fine-tuned model better captures the information in the in-context image.

Abstract

Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context. In this work, we focus on the task of few-shot personalized localization, where a model is given a small set of annotated images (in-context examples) -- each with a category label and bounding box -- and is tasked with localizing the same object type in a query image. To provoke personalized localization abilities in models, we present a data-centric solution that fine-tunes them using carefully curated data from video object tracking datasets. By leveraging sequences of frames tracking the same object across multiple shots, we simulate instruction-tuning dialogues that promote context awareness. To reinforce this, we introduce a novel regularization technique that replaces object labels with pseudo-names, ensuring the model relies on visual context rather than prior knowledge. Our method significantly enhances few-shot localization performance without sacrificing generalization, as demonstrated on several benchmarks tailored to personalized localization. This work is the first to explore and benchmark personalized few-shot localization for VLMs, laying a foundation for future research in context-driven vision-language applications.

Method

Overview of data creation and conversation format. To instill few-shot personalized localization abilities in VLMs, our IPLoc method creates multi-modal conversations by harnessing data from multiple video object tracking datasets. For semantic coherence, personalization, and stronger contextual awareness, we create these conversations by sampling frames from the same video, tracking a particular object of interest, and enhancing the training data by extending the conversations by replacing the true category name with pseudo names. These conversations are later employed to induce contextual awareness in VLMs.

Data Creation Overview

BibTeX

@article{doveh2024teaching,
  title={Teaching VLMs to Localize Specific Objects from In-context Examples},
  author={Sivan Doveh and Nimrod Shabtay and Wei Lin and Eli Schwartz and Hilde Kuehne and Raja Giryes and Rogerio Feris and Leonid Karlinsky and James Glass and Assaf Arbelle and Shimon Ullman and M. Jehanzeb Mirza},
  journal={arXiv preprint arXiv:2411.13317},
  year={2024},
}