Simon Mo (@simon_mo

Simon Mo

263 posts

Simon Mo

@simon_mo_

building @inferact for @vllm_project

Joined July 2018

Pinned
Simon Mo
@simon_mo_
Jan 22
vLLM has grown to 2000+ contributors scale with a diverse community of model, hardwares, and applications. I see @vllm_project on the path of becoming the world's inference engine and @inferact to accelerate AI progress. We cannot be more excited about the road ahead.
Woosuk Kwon
@woosuk_k
Jan 22
Today, we're proud to announce @inferact, a startup founded by creators and core maintainers of @vllm_project, the most popular open-source LLM inference engine. Our mission is to grow vLLM as the world's AI inference engine and accelerate AI progress by making inference cheaper
16K
Simon Mo
@simon_mo_
Mar 1, 2024
vLLM v0.3.3 is released with Starcoder2 @BigCodeProject and Inferentia @awscloud support. I'm also excited about the addition of guided decoding* (JSON, regex) in server leveraging @OutlinesOSS. *experimental, the schema take some time to compile but will be cached.
15K
Simon Mo
@simon_mo_
Apr 19, 2025
Some personal lessons learned on latest benchmark in @vllm_project: 🧵
vLLM
@vllm_project
Apr 19, 2025
After feedback about our v0.8.4 benchmark for @deepseek_ai R1, we rerun it with suggested changes: vLLM no EP, SGLang updated v0.4.5 -> post1 and EP -> DP, TensorRT-LLM uses overlap scheduler and tuned parameters. We are seeing good results! So why was there a difference? 🧵
13K
Simon Mo
@simon_mo_
Feb 21, 2024
vLLM v0.3.2 is released with support for OLMo and Gemma! github.com/vllm-project/v…
4.5K
Simon Mo
@simon_mo_
Apr 11, 2025
Replying to @jxmnop
In early 2024, @KaichaoYou spent almost 3 months debugging an impossible memory problem in vLLM that turns out bad interaction of nccl and cuda graph, which immediately benefited way beyond just @vllm_project users. Set NCCL_CUMEM_ENABLE=0 people.
[Core][Optimization] remove vllm-nccl by youkaichao · Pull Request #5091 · vllm-project/vllm
From github.com
8.1K
Simon Mo
@simon_mo_
Jan 15, 2024
We are hosting The Second vLLM Meetup in downtown SF on Jan 31st (Wed). Come to chat with vLLM maintainers about LLMs in production and inference optimizations! Thanks @IBM for hosting us.
The Second vLLM Meetup @ The AI Alliance · Luma
From luma.com
9.2K
Simon Mo
@simon_mo_
Sep 5, 2024
One thing I'm continuously impressed and surprised by the the power of the vLLM community. After adding so many amazing features (pipeline parallel, state of the art quantization, VLM, etc), the developer community get together for a performance sprint!
vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction
From vllm.ai
2.3K
Simon Mo
@simon_mo_
Mar 15, 2024
The vLLM Team is excited to announce our Third vLLM Meetup in San Carlos on April 2nd (Tuesday). We will be discussing feature updates and hear from you! We thank @Roblox for hosting the event! robloxandvllmmeetup2024.splashthat.com
8.5K
Simon Mo
@simon_mo_
Aug 9, 2024
Thank you @nvidia! The future of LLM inference is open!
vLLM
@vllm_project
Aug 8, 2024
🙏 Thank you @nvidia for sponsoring vLLM development. The DGX H200 machine is marvelous! We plan to use the machine for benchmarking and performance enhancement 🏎️.
1.6K
Simon Mo
@simon_mo_
Dec 12, 2023
They are literally only charging for the electricity. $0.0006/1K tokens for 100 tok/s gives $0.00006/s of compute. This is $0.216/hr. This is basically the cost electricity (700W H100 + 300W CPU)*$0.25/kWh = $0.25/hr Cheapest H100x1 prices is about $2/hr. A100x1 is at $1.5/hr
Susan Zhang
@suchenzang
Dec 12, 2023
grateful for the VCs who subsidize this entire market and lower all barriers of access 🙌
3.7K
Simon Mo
@simon_mo_
Jan 28, 2025
Our biggest milestone yet! I'm particularly excited how the vLLM contributor community organized from many organization to deliver a high quality V1 engine core. We are just getting started 🚀
vLLM
@vllm_project
Jan 27, 2025
🚀 With the v0.7.0 release today, we are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more.
1.1K
Simon Mo
@simon_mo_
Jul 25, 2024
"In the recent Meta Llama 3.1 announcement, 8 out of 10 official partners for real time inference run vLLM as the serving engine for the Llama 3.1 models." Guess the remaining two 😉
vLLM
@vllm_project
Jul 25, 2024
Two exciting updates! * vLLM is already widely adopted, and we want to ensure it has open governance and longevity. We are starting to join @LFAIDataFdn! * We are doubling down in performance. Please checkout our roadmap. blog.vllm.ai/2024/07/25/lfa…
3.3K
Simon Mo
@simon_mo_
Jul 23, 2024
Super excited about Llama 3.1! The license now allows synthetic data generation and distillation, which will unlocks incredible innovations in the open source community.
vLLM
@vllm_project
Jul 23, 2024
🚀 Exciting news! In partnership with @AIatMeta, vLLM officially supports Llama 3.1! 🦙✨ For Llama 3.1 405B, vLLM supports FP8 quantization on single machine and pipeline parallelism for multi-node serving. Learn more in our latest blog post: blog.vllm.ai/2024/07/23/lla…
1.3K
Simon Mo
@simon_mo_
May 21, 2024
I had a lot of fun. So many great questions about @vllm_project from all the experts and hackers! Congrats to everyone who participated!
Runpod
@runpod
May 21, 2024
Replying to @runpod
709