Teaching AI to see the world and explain what it understands
About
I started my research making sense of what happens in videos: detecting actions, localizing moments, understanding temporal structure. That work took me through collaborations with CMU, KAUST, and teams at DARPA/IARPA, with publications at top-tier venues in computer vision and AI.
Over time I moved from "what is happening" to "does the model actually understand why", working on multimodal fusion, cross-modal alignment, and learnable attention masks. My most recent work, CRYSTAL, shows that even the best multimodal models can't maintain coherent reasoning for more than a few steps.
I also care about making things work in practice. I built vLLM-MLX to run LLMs and vision-language models efficiently on Apple Silicon, and I've shipped production ML systems for enterprise and government through work with Adobe Research, Samsung Research, Northeastern University, Mount Sinai, Lunenfeld Institute, Universidad del Norte, EAFIT, Universidad CES, and Universidad de Antioquia. I founded Wiqonn to bring serious AI research to Latin America.
CRYSTAL is a 6,372-instance benchmark that evaluates multimodal reasoning by checking every intermediate step, not just the final answer. Testing 20 models revealed that none can maintain over 60% step accuracy in the right order. We propose Causal Process Reward, a training approach that improves step-level consistency by 32% without manual annotations.
OpenAI-compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support on M1/M2/M3/M4 chips.