I am a student researcher at Google and a PhD student in Computer Science at the Weizmann Institute of Science, supervised by Prof. Shimon Ullman. I study how vision-language models function. Exploring their core mechanisms, strengths, and limitations - mainly by developing new data and training approaches
I earned my Masterās degree in Electrical Engineering from Tel Aviv University and my Bachelorās degree in Electrical Engineering from Ben-Gurion University of the Negev (BGU). In parallel with my academic journey, I have also worked at Applied Materials and IBM Research.
I am actively looking for student collaborators in the area of multi-modal learning.
Contact: sivan.doveh [at] weizmann.ac.il
Teaching VLMs to Localize Specific Objects from In-context Examples (IPLoc)
Sivan Doveh, Nimrod Shabtay, Wei Lin, Eli Schwartz, Hilde Kuehne, Raja Giryes, Rogerio Feris, Leonid Karlinsky, James Glass, Assaf Arbelle, Shimon Ullman, M. Jehanzeb Mirza
Arxiv 2024
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh, Wei Lin, M. Jehanzeb Mirza, Leshem Chosen, Mikhail Yurochkin, Yuekai Sun, Hiromi Wakaki, Yuki Mitsufuji, Assaf Arbelle, Leonid Karlinsky, Raja Giryes
ICLR 2025
GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models
M. Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang, Saurav Jha, Hiromi Wakaki, Yuki Mitsufuji, Horst Possegger, Rogerio Feris, Leonid Karlinsky, James Glass
Arxiv 2024
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
*Irene Huang, *Wei Lin, *M. Jehanzeb Mirza, Jacob Hansen, Sivan Doveh, Victor Ion Butoi, Roei Herzig, Assaf Arbelle, Hilde Kuehne, Trevor Darrel, Chuang Gan, Aude Oliva, Rogerio Feris, Leonid Karlinsky
NeurIPS 2024
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Sivan Doveh, Jakub Micorek, Mateusz Kozinski, Hilde Kuehne, Horst Possegger
ECCV 2024
[ Paper | Project Page | Code ]
Towards Multimodal In-Context Learning for Vision-Language Models
Sivan Doveh, Shaked Perek, M. Jehanzeb Mirza, Amit Alfassy, Assaf Arbelle, Shimon Ullman, Leonid Karlinsky
ECCVW 2024
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Shimon Ullman, Leonid Karlinsky
NeurIPS 2023 Spotlight
Going Beyond Nouns With Vision & Language Models Using Synthetic Data
Paola Cascante-Bonilla, Khaled Shehada, James Seale Smith, Sivan Doveh, Donghyun Kim, Rameswar Panda, Gul Varol, Aude Oliva, Vicente Ordonez, Rogerio Feris, Leonid Karlinsky
ICCV 2023
[ Paper ]
Teaching Structured Vision-Language Concepts to Vision-Language Models
Sivan Doveh, Assaf Arbelle, Sivan Harary, Eli Schwartz, Roei Herzig, Raja Giryes, Rogerio Feris, Rameswar Panda, Shimon Ullman, Leonid Karlinsky
CVPR 2023
Detector-Free Weakly Supervised Grounding by Separation
*Assaf Arbelle, *Sivan Doveh, *Amit Alfassy, Joseph Shtok, Guy Lev, Eli Schwartz, Hilde Kuehne, Hila Barak Levi, Prasanna Sattigeri, Rameswar Panda, Chun-Fu Chen, Alex Bronstein, Kate Saenko, Shimon Ullman, Raja Giryes, Rogerio Feris, Leonid Karlinsky
ICCV 2021 Oral
[ Paper ]