Previously, I received my Ph.D. from Tel Aviv University, advised by Prof. Amir Globerson; I have an Erdős number of 3.
I graduated magna cum laude with M.Sc. (CS), B.Sc. (CS), and B.Sc. (Physics).
I received the 2023 IAAI Best PhD ThesisAward for the outstanding thesis in the field of Artificial Intelligence in Israel.
I'm looking for motivated Master's and senior undergraduate students to collaborate and publish in top-tier conferences
on Vision & Language, Vision & Robotics, and Video Understanding.
We introduce Robotic Steering, a finetuning approach grounded in mechanistic interpretability
that leverages few-shot demonstrations to identify and selectively finetune task-specific attention heads aligned with
the physical, visual, and linguistic requirements of robotic tasks.
IVA is a unified framework for Vision-Language-Action models that detects false-premise instructions,
clarifies them in language, and acts safely—improving robustness while preserving task performance.
We propose ARM4R, an Autoregressive Robotic Model that leverages
low-level 4D Representations learned from human video data to yield a better robotic model.
We propose LLARVA, a model trained with a novel instruction tuning method that
leverages structured prompts to unify a range of robotic configurations
and introduces the concept of visual traces to further align the vision and action spaces.
Selected Publications: Vision-and-Language
Selected recent work in vision and language, with a particular focus on multimodal foundation models.
SAVs is a finetuning-free method that leverages sparse attention head activations (fewer than 5% of heads)
in LMMs as powerful feature representations for vision-language classification tasks, achieving state-of-the-art performance compared to both few-shot and finetuned baselines.
We introduce Granite Vision, a lightweight large language model with vision capabilities,
specifically designed to excel in enterprise use cases, particularly in visual document understanding
We demonstrate the existence of multimodal task vectors--compact implicit representations of many-shot in-context examples
compressed in the model’s attention head-- and leverage them for many-shot in-context learning in LMMs.
We present TraveLER, a modular multi-LMM agent framework for video question-answering
that does not require task-specific fine-tuning or annotations.
Through interactive question-asking using several agents with different roles,
our framework aims to answer the question by collecting relevant information from keyframes.
We propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting
method that utilizes scene graph representations in order to extract compositional knowledge from an LMM.
We propose to improve pretrained VLMs, which are usually trained on large-scale image-text pairs,
by designing a specialized model architecture and a new training method that utilizes
a small set of scene graph annotations from the Visual Genome dataset
that are richer and reflect structured visual and textual information.
We propose a fine-tuning approach for automatically treating two factors limiting VL models’ compositional reasoning performance:
(i) the caption quality, or in other words 'image alignment', of the texts;
and (ii) the level of caption density, which refers to the number of details that appear in the image.
We present SViT (for Structured Video Tokens), a model that utilizes the structure
of a small set of images, whether they are within or outside the domain of interest,
available only during training for a video downstream task.
We introduce the formalism of Action Graphs,
a natural and convenient structure representing the dynamics of actions between objects over time.
We show we can synthesize goal-oriented videos on the CATER and Something Something datasets
and generate novel compositions of unseen actions.
We present a novel model that can inherently learn canonical graph representations and show better
robustness to graph size, adversarial attacks, and semantic equivalent,
thus generating superior images of complex visual scenes.
We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set.
We show the effectiveness of our approach on the proposed compositional task and
a few-shot compositional setting which requires the model to generalize across both object appearance and action category.
We propose a latent inter-object graph representation for activity recognition
that explores the visual interaction between the objects in a self-supervised manner.
We collect a new SKU-110K dataset which takes detection challenges to unexplored territories,
and propose a novel mechanism to learn deep overlap rates for each detection.
We propose a novel invariant graph network for mapping images to scene graphs using the permutation invariant
property, which achieves a new state-of-the-art results on Visual Genome dataset.
Students and Collaborators
If you are a student interested in collaborating on research projects, please reach out.