My research interests are in computer vision, machine learning and computer graphics.
In particular, I am interested in generative models, neural based signal representations and inverse graphics.
I am also interested in how disentangled and generative models can be used to better understand the visual world.
I am always seeking excellent students and postdocs. If you are interested, please contact me!
We propose Through-The-Mask, a two-stage framework for Image-to-Video generation that uses mask-based motion trajectories to enhance object-specific motion accuracy and consistency, achieving state-of-the-art results, particularly in multi-object scenarios.
We introduce 'PuTT', a novel method for learning a coarse-to-fine tensor train representation of visual data, effective for 2D/3D fitting and novel view synthesis, even with noisy or missing data.
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by a pretrained text-to-video generation model.
We propose a framework for disentangling a 3D scene into a forground and background volumetric representations and show a variety of downstream applications involving 3D manipulation.
We propose to adapt similar RL-based methods to Reinforcement Learning from Human Feedback (RLHF) for the task of unsupervised object discovery, i.e. learning to detect objects from LiDAR points without any training labels.
We propose a non-invasive fine-tuning technique fro Text-to-Image Diffusion Models that capitalizes on the expressive potential of free-form text while achieving high accuracy through discriminative signals from a pretrained classifier, which guides the generation. This is done by iteratively modifying the embedding of a single input token of a text-to-image diffusion model, using the classifier, by steering generated images toward a given target class.
We propose a new class of neural fields called
basis-encoded polynomial neural fields (PNFs). The key advantage of a PNF is
that it can represent a signal as a composition of a number of manipulable and
interpretable components without losing the merits of neural fields representation.
A new style (essense) transfer method that incoporates higher level abstractions then textures and colors. TargetCLIP introduces a blending operator that combines the powerful StyleGAN2 generator with a semantic network CLIP to achieve a more natural blending than with each model separately.
Text2Mesh produces color and geometric details over a variety of source meshes, driven by a target text prompt. Our stylization results coherently blend unique and ostensibly unrelated combinations of text,
capturing both global semantics and part-aware attributes.
We introduce JOKR - a JOint Keypoint Representation that captures the motion common to both the source and target videos, without requiring any object prior or data collection.
This geometry-driven representation allows for unsupervised motion retargeting in a variery of challenging situations as well as for further intuitive control, such as temporal coherence and manual editing.
FewGAN is a generative model for generating novel, high-quality and diverse images whose patch distribution lies in the joint patch distribution of a small number of
N training samples.
A new image transformer architecture which first applies a local attention over patches and their local shifts, resulting in virtually located local patches, which are not bound to a single, specific location.
Subsequently, these virtually located patches are used in a global attention layer.
We consider the setting of few-shot anomaly detection in images, where only a few images are given at training. We devise a hierarchical generative model that captures the multi-scale patch distribution of each training image. We further enhance the representation of our model by using image transformations and optimize scale-specific patch-discriminators to distinguish between real and fake patches of the image, as well as between different transformations applied to those patches.
A simple architectural change which forces the network to reduce its bias to global image statistics.
Using AdaIN, we swap global statistics of samples within a batch, stocastically, with some probability p.
This results in significant improvements in multiple settings including domain adaptation, domain generalization, robustness
and image classification.
StyleGAN can be used to upscale a low resolution thumbnail image of a person, to a higher resolution image.
However, it often changes the person’s identity, or produces biased solutions, such as Caucasian faces.
We present a method to upscale an image that preserves the person's identity and other attributes.
We develop theoretical foundations for the success of unsupervised cross-domain mapping algorithms, in mapping between
two domains that share common characteristics, with a particular emphasis on the clear ambiguity in such mappings.
Two new metrics for evaluating generative models in the class-conditional image generation setting. These metrics are obtained by generalizing the two most popular unconditional metrics: the Inception Score (IS) and the Fréchet Inception Distance (FID).
We explore the capabilities of neural networks to understand image structure given only a single pair of images, A and B. We seek to generate images that are structurally aligned: that is, to generate an image that keeps the appearance and style of B, but has a structural arrangement that corresponds to A. Our method can be used for: guided image synthesis, style and texture transfer, text translation as well as video translation.
We consider the task of generating diverse and novel videos from a single video sample. We introduce a novel patch-based variational autoencoder (VAE) which allows for a much greater diversity in generation. Using this tool, a new hierarchical video generation scheme is constructed resulting in diverse and high quality videos.
We train a network called SpeedNet to to automatically predict the "speediness" of moving objects in videos - whether they move faster, at, or slower than their "natural" speed. SpeedNet is trained in a self-supervised manner and can be used to generate time-varying, adaptive video speedups as well as to boost the performance of self-supervised action recognition and video retrieval.
We consider the problem of translating, in an unsupervised manner, between two
domains where one contains some additional information compared to the other. To do so, we disentangle the common and separate parts of these domains
and, through the generation of a mask, focuses the attention of the underlying
network to the desired augmentation, without wastefully reconstructing the
entire target.
We present a method for recovering the shared content between two visual domains as well as the content that is unique to each domain. This allows us to remove content specific content of the first domain and add content specific to the second domain. We can also generate form the intersection of the two domains and their union, despite having no such samples during training.
We study the problem of semi-supervised singing voice separation, in which the training data contains a set of samples of mixed music (singing and instrumental) and an unmatched set of instrumental music. Our results indicate that we are on a par with or better than fully supervised methods, which are also provided with training samples of unmixed singing voices, and are better than other recent semi-supervised methods.
We study the problem of learning to map, in an unsupervised way, between domains A and B, such that the samples b in B, contain all the information that
exists in samples a in A, and some additional information.
We study a new form of unsupervised learning, whose input is a set
of unlabeled points that are assumed to be local maxima of an unknown value
function v in an unknown subset of the vector space. Two functions are learned:
(i) a set indicator c, which is a binary classifier, and (ii) a comparator function
h that given two nearby samples, predicts which sample has the higher value of
the unknown function v.
While in supervised learning, the validation error is an unbiased estimator of the
generalization (test) error and complexity-based generalization bounds are abundant, no such bounds exist for learning a mapping in an unsupervised way.
We propose a novel bound for predicting the success of unsupervised cross domain
mapping methods.
We discuss the feasibility of the unsupervised cross domain generation problem. In the typical setting this problem is ill posed: it seems possible to build infinitely many alternative mappings from every target mapping. We identify the abstract notion of aligning two domains and show that only a minimal architecture and a standard GAN loss is required to learn such mappings, without the need for a cycle loss.
We consider the problem of mapping, in an unsupervised manner, between two visual domains in a one sided fashion.
This is done by learning an equivariant mapping that maintains the distance between a pair of samples.
Towards a Controllable Generation of the 3D World,
Hebrew University of Jerusalem, Tel
Aviv University, Weizmann Institute of Science, Bar Ilan University, Technion Institute,
Haifa University, Meta AI Research Tel Aviv, 2022-2023.