Sivan Doveh

I am a Research Scientist at Google (Mountain View, CA) in the Creative Camera group.

Previously, I was a Postdoctoral Researcher at Stanford University. I hold a PhD in Computer Science from the Weizmann Institute of Science, supervised by Prof. Shimon Ullman. I studied how vision-language models function. Exploring their core mechanisms, strengths, and limitations - mainly by developing new data and training approaches

I earned my Master’s degree in Electrical Engineering from Tel Aviv University and my Bachelor’s degree in Electrical Engineering from Ben-Gurion University of the Negev (BGU). In parallel with my academic journey, I have also worked at Google, Applied Materials and IBM Research.

I am actively looking for student collaborators in the area of multi-modal learning.

Contact: sdoveh [at] gmail [dot] com

LinkedIn | Google Scholar | GitHub

Recent News

05/26: 2 papers accepted to CVPR, 2026 (Work done at Stanford)

05/26: "3rd - What's Next in Multi-Modal Foundation Models" got accepted to CVPR, 2026 (Work done at Stanford)

02/26: "PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies" accepted to ICLR, 2026 (Work done at Stanford)

07/25: "Teaching VLMs to Localize Specific Objects from In-context Examples (IPLoc)" accepted to ICCV, 2025.

05/25: Invited talk at BIU CS Multi-Modal Day .

05/25: Invited talk at New Tech Event .

04/25: Our workshop "Long Multi-Scene Video Foundations: Generation, Understanding and Evaluation" got accepted at ICCV 2025.

04/25: Invited talk at 14th Israel Machine Vision Conference (IMVC) 2025.

02/25: 2 papers accepted to CVPR, 2025 (workshops).

01/25: "LiveXiv" accepted to ICLR, 2025.

12/24: Our workshop "What's Next in Multi-Modal Foundation Models" got accepted to CVPR 2025.

09/24: Invited talk at TU Graz.

04/24: Invited talk at 13th Israel Machine Vision Conference (IMVC) 2024.

12/23: Our workshop "What's Next in Multi-Modal Foundation Models" got accepted to CVPR 2024.

Selected Publications

Teaching VLMs to Localize Specific Objects from In-context Examples (IPLoc)

Sivan Doveh, Nimrod Shabtay, Wei Lin, Eli Schwartz, Hilde Kuehne, Raja Giryes, Rogerio Feris, Leonid Karlinsky, James Glass, Assaf Arbelle, Shimon Ullman, M. Jehanzeb Mirza

ICCV 2025

[ Paper | Code ]
Towards Multimodal In-Context Learning for Vision-Language Models

Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Amit Alfassy, Assaf Arbelle, Shimon Ullman, Leonid Karlinsky

ECCVW 2024

[ Paper | Code ]
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Shimon Ullman, Leonid Karlinsky

NeurIPS 2023 Spotlight

[ Paper | Code ]
Teaching Structured Vision-Language Concepts to Vision-Language Models

Sivan Doveh, Assaf Arbelle, Sivan Harary, Eli Schwartz, Roei Herzig, Raja Giryes, Rogerio Feris, Rameswar Panda, Shimon Ullman, Leonid Karlinsky

CVPR 2023

[ Paper | Code ]
(Equal Contrib.) Detector-Free Weakly Supervised Grounding by Separation

^*Assaf Arbelle, ^*Sivan Doveh, Amit Alfassy, Joseph Shtok, Guy Lev, Eli Schwartz, Hilde Kuehne, Hila Barak Levi, Prasanna Sattigeri, Rameswar Panda, Chun-Fu Chen, Alex Bronstein, Kate Saenko, Shimon Ullman, Raja Giryes, Rogerio Feris, Leonid Karlinsky

ICCV 2021 Oral

[ Paper ]
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh, Wei Lin, M. Jehanzeb Mirza, Leshem Chosen, Mikhail Yurochkin, Yuekai Sun, Hiromi Wakaki, Yuki Mitsufuji, Assaf Arbelle, Leonid Karlinsky, Raja Giryes

ICLR 2025

[ Paper | Code ]
GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

M. Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang, Saurav Jha, Hiromi Wakaki, Yuki Mitsufuji, Horst Possegger, Rogerio Feris, Leonid Karlinsky, James Glass

Arxiv 2024

[ Paper | Code ]
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Irene Huang, Wei Lin, M. Jehanzeb Mirza, Jacob Hansen, Sivan Doveh, Victor Ion Butoi, Roei Herzig, Assaf Arbelle, Hilde Kuehne, Trevor Darrel, Chuang Gan, Aude Oliva, Rogerio Feris, Leonid Karlinsky

NeurIPS 2024

[ Paper | Dataset | Code ]
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Sivan Doveh, Jakub Micorek, Mateusz Kozinski, Hilde Kuehne, Horst Possegger

ECCV 2024

[ Paper | Project Page | Code ]
Going Beyond Nouns With Vision & Language Models Using Synthetic Data

Paola Cascante-Bonilla, Khaled Shehada, James Seale Smith, Sivan Doveh, Donghyun Kim, Rameswar Panda, Gul Varol, Aude Oliva, Vicente Ordonez, Rogerio Feris, Leonid Karlinsky

ICCV 2023

[ Paper ]