Hao AI Lab @ UCSD

Hao AI Lab @ UCSDhttps://haoailab.com/Recent content on Hao AI Lab @ UCSDHao AI Lab @ UCSDhttps://haoailab.com/img/HAOAILAB_trimmed.jpghttps://haoailab.com/img/HAOAILAB_trimmed.jpgHugo -- gohugo.ioen-usSun, 15 Mar 2026 12:00:00 -0800Into the Dreamverse: Vibe Directing in FastVideohttps://haoailab.com/blogs/dreamverse/Sun, 15 Mar 2026 12:00:00 -0800https://haoailab.com/blogs/dreamverse/TL;DR: Our new real-time inference stack in FastVideo enables Dreamverse, a prototype for a new interface where users can vibe direct their own “multiverse” of videos. AI video generation is already good enough to make a convincing clip. But real creative work is not about getting a clip in one shot. It’s about iteration. An idea appears, you test it: keep the subject, change the camera angle, continue the scene, and try again.Create a 5s 1080p Video in 4.5s with FastVideo on a Single GPUhttps://haoailab.com/blogs/fastvideo_realtime_1080p/Wed, 11 Mar 2026 12:00:00 -0800https://haoailab.com/blogs/fastvideo_realtime_1080p/TL;DR: If you work in media generation, you know the frustration: an idea pops into your mind, you type a prompt, maybe provide a reference image, and want to see the result immediately – while the idea is still alive. But existing video generation APIs break the loop. You wait minutes for clips, and each costs enough to make your wallet wince. FastVideo’s real-time inference stack fixes this problem.From Physical Commonsense to Scientific Reasoning: Why World Modeling in Video Mattershttps://haoailab.com/blogs/videoscience/Thu, 12 Feb 2026 12:00:00 -0800https://haoailab.com/blogs/videoscience/TL;DR: The golden age of AI video has mastered the “look” of reality, but it has yet to learn the laws of reality. Without adhering to rigorous scientific principles, even the most photorealistic model remains a high-fidelity hallucination engine rather than a reliable world simulator. To bridge this gap, we introduce VideoScience-Bench: the first benchmark specifically designed to move beyond “physical commonsense” and evaluate undergraduate-level scientific reasoning in video models.DistCAhttps://haoailab.com/summary/distca/Sun, 21 Dec 2025 12:00:00 -0800https://haoailab.com/summary/distca/Core Attention Disaggregation for Efficient Long-context Language Model TrainingCAD: Disaggregating Core Attention for Efficient Long-context Language Model Traininghttps://haoailab.com/blogs/distca/Wed, 17 Dec 2025 12:00:00 -0800https://haoailab.com/blogs/distca/TL;DR: Workload imbalance is one of the major problems in training long-context LLM models. Imbalance among data parallel (DP) and pipeline parallel (PP) workers introduces stragglers or bubbles that causes severe slowdown, and the problem becomes more severe as we scale to longer context lengths or more GPUs. We believe that one of the major reasons for this slowdown is that the core attention, i.e., the $\text{softmax}(QK^T)V$ kernel, colocates with the other linear parts.Fast and Accurate Causal Parallel Decoding using Jacobi Forcinghttps://haoailab.com/blogs/jacobi-forcing/Tue, 16 Dec 2025 12:00:00 -0800https://haoailab.com/blogs/jacobi-forcing/TL;DR: Today’s best LLMs mostly decode autoregressively from left-to-right, which gives great quality but is terribly slow. Diffusion LLM can decode many tokens in parallel thanks to their non-causal, any-order generation, but they must be trained from scratch, or expensively adapted from autoregressive (AR) checkpoints with a mismatched, non-causal diffusion objective; we find this mismatch often hurts quality and breaks many effective KV-cache related serving optimizations. This blog introduces Jacobi Forcing, a new training technique that converts LLMs into native causal parallel decoders.JacobiForcinghttps://haoailab.com/summary/jacobi-forcing/Tue, 16 Dec 2025 12:00:00 -0800https://haoailab.com/summary/jacobi-forcing/Fast and Accurate Causal Parallel Decoding using Jacobi ForcingAUP: when Accuracy Meets Parallelism in Diffusion Language Modelshttps://haoailab.com/blogs/text-diffusion/Wed, 10 Dec 2025 12:00:00 -0800https://haoailab.com/blogs/text-diffusion/TL;DR: Diffusion large language models (dLLMs) promise things that autoregressive LLMs cannot: parallel decoding, error correction, and random-order generation. Over the past year, a wave of papers has pushed this vision, and closed-source systems like Gemini Diffusion and Mercury report impressive throughput numbers. In this blog, we take a step back and ask a simple question: if we look at both speed and accuracy together, are diffusion LLMs actually better decoders than strong autoregressive (AR) models?d3LLMhttps://haoailab.com/summary/d3llm/Wed, 10 Dec 2025 12:00:00 -0800https://haoailab.com/summary/d3llm/Ultra-Fast Diffusion LLM 🚀CausalWan-MoE Preview: Applying Self-Forcing Distillation To Wan2.2https://haoailab.com/blogs/fastvideo_causalwan_preview/Tue, 18 Nov 2025 11:00:00 -0800https://haoailab.com/blogs/fastvideo_causalwan_preview/TL;DR: The FastVideo Team is excited to share some of our progress on distilling the Wan2.2-A14B model into an autoregressive architecture, alongside the release of our preview checkpoint for CausalWan2.2-I2V-A14B. In this blog, we’ll first discuss the new MoE architecture behind the open-source SOTA performance of Wan2.2, the differences between bidirectional and autoregressive video models, and then share some of the challenges we encountered when applying Self-Forcing distillation to this architecture.Disaggregated Inference: 18 Months Laterhttps://haoailab.com/blogs/distserve-retro/Mon, 03 Nov 2025 00:00:00 -0800https://haoailab.com/blogs/distserve-retro/Eighteen months ago, our lab introduced DistServe with a simple bet: split LLM inference into prefill and decode, and scale them independently on separate compute pools. Today, almost every production-grade LLM serving framework – NVIDIA Dynamo, llm-d, Ray Serve LLM SGLang, vLLM, LMCache, MoonCake – runs on disaggregation and demonstrates its power in large-scale, real-world LLM serving workloads, with many more continuing to push its boundaries. Concepts like TTFT (time-to-first-token) and TPOT (time-per-output-token), now standard latency metrics in nearly every serving benchmark, were also popularized through the lens of disaggregation.Scaling Speculative Decoding with Lookahead Reasoninghttps://haoailab.com/blogs/lookaheadreasoning/Mon, 22 Sep 2025 12:00:00 -0800https://haoailab.com/blogs/lookaheadreasoning/TL;DR: We propose Lookahead Reasoning (LR), a technique that significantly accelerates large reasoning models(LRMs) and complements existing speculative decoding methods. Traditional token-level speculative decoding suffers from limited gains because the probability of correctly guessing a long sequence decreases exponentially with length. In contrast, LR operates at the step level, proposing future reasoning steps instead of individual tokens. This is much more effective since a proposed step only needs to be semantically correct, rather than matching exactly word for word.Can RL-based LLM post-training on games generalize to other tasks? (GRL)https://haoailab.com/blogs/grl/Wed, 27 Aug 2025 12:00:00 -0800https://haoailab.com/blogs/grl/A Practical Guideline to Using Lmgame-Benchhttps://haoailab.com/blogs/lmgame-bench-use/Thu, 21 Aug 2025 12:00:00 -0800https://haoailab.com/blogs/lmgame-bench-use/FastWan: Generating a 5-Second Video in 5 Seconds via Sparse Distillationhttps://haoailab.com/blogs/fastvideo_post_training/Mon, 04 Aug 2025 11:00:00 -0800https://haoailab.com/blogs/fastvideo_post_training/TL;DR: We introduce FastWan, a family of video generation models trained via a new recipe we term as “sparse distillation”. Powered by FastVideo, FastWan2.1-1.3B end2end generates a 5-second 480P video in 5 seconds (denoising time 1 second) on a single H200 and 21 seconds (denoising time 2.8 seconds) on a single RTX 4090. FastWan2.2-5B generates a 5-second 720P video in 16 seconds on a single H200. All resources — model weights, training recipe, and dataset — are released under the Apache-2.From Pokémon Red to Standardized Game-as-an-Evalhttps://haoailab.com/blogs/lmgame-bench/Fri, 20 Jun 2025 12:00:00 -0800https://haoailab.com/blogs/lmgame-bench/FastVideo V1: A Unified Framework for Accelerated Video Generationhttps://haoailab.com/blogs/fastvideo/Thu, 24 Apr 2025 11:00:00 -0800https://haoailab.com/blogs/fastvideo/TL;DR: We are announcing FastVideo V1, a unified framework that accelerates video generation. This new version features a clean, consistent API that works across popular video models, making it easier for developers to author new models and incorporate system - or kernel - level optimizations. For example, FastVideo V1 provides 3x speedup for inference while maintaining quality by seamlessly integrating SageAttention and Teacache. What’s New Modern open-source video generation models such as HunyuanVideo and Wan2.ReFoRCE: A Text-to-SQL Agent with Self-Refinement, Format Restriction, and Column Explorationhttps://haoailab.com/blogs/reforce/Thu, 10 Apr 2025 12:00:00 -0800https://haoailab.com/blogs/reforce/TL;DR: We present ReFoRCE, a Text-to-SQL agent that leads the Spider 2.0 leaderboard—the most challenging Text-to-SQL benchmark where even advanced models like GPT-4o score around 10%. ReFoRCE tackles real-world deployment issues such as massive schemas, SQL dialect diversity, and complex queries. It uses table compression to handle long contexts, format restriction for accurate SQL generation, and iterative column exploration for better schema understanding. A self-refinement pipeline with self-consistency and parallel voting further boosts performance, achieving state-of-the-art scores of 31.Fast Video Generation with Sliding Tile Attentionhttps://haoailab.com/blogs/sta/Tue, 18 Feb 2025 11:00:00 -0800https://haoailab.com/blogs/sta/TL;DR: Video generation with DiTs is painfully slow – HunyuanVideo takes 16 minutes to generate just a 5-second video on an H100 with FlashAttention-3. Our sliding tile attention (STA) slashes this to 5 minutes with zero quality loss, no extra training required. Specifically, STA accelerates attention alone by 2.8–17x over FlashAttention-2 and 1.6–10x over FlashAttention-3. With STA and other optimizations, our solution boosts end-to-end generation speed by 2.98× compared to the FA3 full attention baseline, without quality loss or the need for training.FastVideohttps://haoailab.com/summary/sta/Tue, 18 Feb 2025 11:00:00 -0800https://haoailab.com/summary/sta/Make Video Generation FasterDynasorhttps://haoailab.com/summary/dynasor-cot/Sun, 16 Feb 2025 12:00:00 -0800https://haoailab.com/summary/dynasor-cot/Making Reasoning Models More Token-EfficientDynasor: More Efficient Chain-of-Thought Through Certainty Probinghttps://haoailab.com/blogs/dynasor-cot/Sun, 16 Feb 2025 12:00:00 -0800https://haoailab.com/blogs/dynasor-cot/TL;DR: We observe reasoning models often exhibit poor token efficiency: they waste many tokens second-guessing themselves. We develop Dynasor-CoT, a certainty-based approach for dynamically allocating inference compute for reasoning models. The intuition is that by probing reasoning models at intermediate steps, we can identify and early terminate problems where they maintain consistently high certainty in their answers. The method is plug-and-play, requiring no model modifications or training, but matches baseline accuracy on benchmarks like AMC23, AIME24, and MATH500 while reducing token consumption by 29% dataset-wide and up to 81% for single problems.GameArena: Evaluating LLM Reasoning through Live Computer Gameshttps://haoailab.com/blogs/gamearena/Mon, 10 Feb 2025 12:00:00 -0800https://haoailab.com/blogs/gamearena/LMGame Benchhttps://haoailab.com/summary/gamearena/Mon, 10 Feb 2025 12:00:00 -0800https://haoailab.com/summary/gamearena/Evaluating LLM Reasoning through Live Computer GamesEfficient LLM Scheduling by Learning to Rankhttps://haoailab.com/blogs/vllm-ltr/Mon, 13 Jan 2025 12:00:00 -0800https://haoailab.com/blogs/vllm-ltr/TL;DR: Traditional Large Language Model (LLM) serving systems rely on first-come-first-serve (FCFS) scheduling. When longer requests block shorter ones in the queue, this creates a cascade of delays that severely impacts overall system latency. LLM inference jobs are particularly challenging to schedule due to their highly unpredictable workload and variable output lengths. We developed a novel learning to rank approach that predicts the relative ranking of output lengths, enabling a more efficient Shortest Job First-like scheduling policy.vLLM-LTRhttps://haoailab.com/summary/vllm-ltr/Mon, 13 Jan 2025 12:00:00 -0800https://haoailab.com/summary/vllm-ltr/Efficient LLM Scheduling by Learning to RankMuxServehttps://haoailab.com/summary/muxserve/Mon, 20 May 2024 12:00:00 -0800https://haoailab.com/summary/muxserve/Serving Multiple LLMs with Flexible Spatial-Temporal MultiplexingMuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Servinghttps://haoailab.com/blogs/muxserve/Mon, 20 May 2024 12:00:00 -0800https://haoailab.com/blogs/muxserve/TL;DR: Efficiently serving multiple LLMs have emerged as a crucial and time-sensitive demand within the community, especially for LLM endpoint providers. In this blog, we show that the dynamic popularity of LLMs and the unbalanced resource utilization of LLM inference can be leveraged to achieve high GPU utilization and reduce serving cost. We introduce MuxServe, a novel serving system that efficiently serves multiple LLMs with flexible spatial-temporal multiplexing. MuxServe outperforms the spatial partitioning and temporal multiplexing baselines by up to $1.CLLMhttps://haoailab.com/summary/cllm/Mon, 06 May 2024 12:00:00 -0800https://haoailab.com/summary/cllm/Consistency Large Language Models: A Family of Efficient Parallel DecodersConsistency Large Language Models: A Family of Efficient Parallel Decodershttps://haoailab.com/blogs/cllm/Mon, 06 May 2024 12:00:00 -0800https://haoailab.com/blogs/cllm/TL;DR: LLMs have been traditionally regarded as sequential decoders, decoding one token after another. In this blog, we show pretrained LLMs can be easily taught to operate as efficient parallel decoders. We introduce Consistency Large Language Models (CLLMs), a new family of parallel decoders capable of reducing inference latency by efficiently decoding an $n$-token sequence per inference step. Our research shows this process – mimicking human cognitive process of forming complete sentences in mind before articulating word by word – can be effectively learned by simply finetuning pretrained LLMs.DistServehttps://haoailab.com/summary/distserve/Sun, 17 Mar 2024 12:00:00 -0800https://haoailab.com/summary/distserve/Maximizing Goodput in LLM Serving using Prefill-Decode DisaggregationThroughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregationhttps://haoailab.com/blogs/distserve/Sun, 17 Mar 2024 12:00:00 -0800https://haoailab.com/blogs/distserve/TL;DR: LLM apps today have diverse latency requirements. For example, a chatbot may require a fast initial response (e.g., under 0.2 seconds) but moderate speed in decoding which only needs to match human reading speed, whereas code completion requires a fast end-to-end generation time for real-time code suggestions. In this blog post, we show existing serving systems that optimize throughput are not optimal under latency criteria. We advocate using goodput, the number of completed requests per second adhering to the Service Level Objectives (SLOs), as an improved measure of LLM serving performance to account for both cost and user satisfaction.<link>https://haoailab.com/contact/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://haoailab.com/contact/</guid><description>contact</description></item><item><title/><link>https://haoailab.com/home/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://haoailab.com/home/</guid><description>home page for Hao Lab @ UCSD</description></item><item><title/><link>https://haoailab.com/math-examples/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://haoailab.com/math-examples/</guid><description>This is an inline (a^=x-b^) equation. This is an inline $a^*=x-b^*$ equation. These are block equations: \[a^*=x-b^*\] \[ a^*=x-b^* \] \[ a^*=x-b^* \] These are block equations using alternate delimiters: $$a^*=x-b^*$$ $$ a^*=x-b^* $$ $$ a^*=x-b^* $$</description></item><item><title/><link>https://haoailab.com/people/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://haoailab.com/people/</guid><description>people</description></item><item><title/><link>https://haoailab.com/publications/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://haoailab.com/publications/</guid><description>publications</description></item><item><title>Dynasor-CoT Demohttps://haoailab.com/demo/dynasor-cot/Mon, 01 Jan 0001 00:00:00 +0000https://haoailab.com/demo/dynasor-cot/