We introduce MolmoWeb, an open visual web agent built on Molmo 2 that navigates websites using only screenshots. Alongside the model, we release MolmoWebMix, a large dataset of browser task demonstrations and GUI perception data. MolmoWeb achieves state-of-the-art results among open-weight web agents, outperforming even agents built on much larger proprietary models.
We introduce OmniView, a unified diffusion framework for 3D and 4D view synthesis that generalizes across novel view synthesis, camera-controlled video generation, and keyframe interpolation. By separately representing space, time, and view conditions, OmniView is competitive with task-specific models across diverse benchmarks.
We introduce DiScoFormer, a "train-once, infer-anywhere" equivariant Transformer that maps i.i.d. samples to both density values and score vectors, generalizing across distributions and sample sizes. We prove that self-attention can recover normalized KDE, and show the model outperforms classical methods for density estimation, Fisher information computation, and Fokker-Planck-type PDEs.
We introduce MultiRef, a benchmark and dataset for controllable image generation with multiple visual references. MultiRef-bench offers 1,990 evaluation samples, and the MultiRef dataset provides 38k high-quality images via our RefBlend engine. Experiments show that even state-of-the-art models struggle with multi-reference conditioning, underscoring challenges and opportunities for more flexible creative tools.
We propose a deterministic sampling method that learns time-varying scores on-the-fly to sample from unnormalized densities. Our approach produces smooth trajectories with monotone convergence, achieving the same optimal rates as exact gradient flow while being more sample efficient than stochastic methods.
We introduce REALEDIT, a large-scale image editing dataset with authentic user requests and human-made edits from Reddit, enabling models to better address real-world needs.
Our model, finetuned on the REALEDIT dataset, shows state-of-the-art performance results and is able to generate extremely high quality edits.
Leveraged neural networks and statistical methods to optimize prediction intervals.
Implemented Jackknife resampling to construct robust intervals based on empirical error distributions.
Designed a dual-network architecture to predict upper and lower confidence bounds, employing custom asymmetric loss functions.
Nonogram-inspired game deployed on Reddit using Devvit. Playable in a Reddit post.
Rendering done with TypeScript, user data is collected and stored via Redis API, backend puzzle generation implemented in Python.
Sloop, designed by Tim and Akash, is a browser-based game. It is built with Node.js, initialized using Create Next App and is deployed on Vercel.
I contributed additional features and hidden Easter eggs to enhance gameplay and user experience.