Modular partners with AMD, and much more!

Watch the June 10th,2025 announcement below

GPU portability for the world's most demanding AI

Our AI infrastructure powers the most advanced AI workloads to achieve unparalleled performance hosted in your cloud environment, or in ours.

Start with the free Modular Community Edition

  • 500+ GenAI models

  • Customizable

  • Open source implementation

  • Portable across NVIDIA and AMD

  • Tiny containers

  • Multi-hardware support

  • Multi-cloud deployment

  • Full hardware control

  • Great documentation

  • State-of-the-art performance

  • Hardware agnostic

  • Multi hardware

  • Write once, deploy anywhere

  • Backwards compatible

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

completion = client.chat.completions.create(
    model="google/gemma-3-27b-it",
    messages=[
        {
          "role": "user",
          "content": "Who won the world series in 2020?"
        },
    ],
)

print(completion.choices[0].message.content)



# total file size - 500MB.  This docker container deploys
across GPUs and CPUs, all without vendor libraries like CUDA

docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
    --env "HF_TOKEN=" \
    -p 8000:8000 \
    docker.modular.com/modular/max-nvidia-base:nightly \
    --model-path meta-llama/Llama-3.3-70B-Instruct

alias M = C.shape[0]()
alias N = C.shape[1]()
alias K = A.shape[1]()

var warp_id = thread_idx.x // WARP_SIZE

warp_y = warp_id // (BN // WN)
warp_x = warp_id % (BN // WN)

C_warp_tile = C.tile[BM, BN](block_idx.y, block_idx.x).tile[WM, WN](
    warp_y, warp_x
)

mma_op = TensorCore[A.dtype, C.dtype, Index(MMA_M, MMA_N, MMA_K)]()

A_sram_tile = tb[A.dtype]().row_major[BM, BK]().shared().alloc()
B_sram_tile = tb[B.dtype]().row_major[BK, BN]().shared().alloc()

c_reg = (
    tb[C.dtype]()
    .row_major[WM // MMA_M, (WN * 4) // MMA_N]()
    .local()
    .alloc()
    .fill(0)
)

for k_i in range(K // BK):
    barrier()

    A_dram_tile = A.tile[BM, BK](block_idx.y, k_i)
    B_dram_tile = B.tile[BK, BN](k_i, block_idx.x)

    copy_dram_to_sram_async[thread_layout = Layout.row_major(4, 8)](
        A_sram_tile.vectorize[1, 4](), A_dram_tile.vectorize[1, 4]()
    )
    copy_dram_to_sram_async[thread_layout = Layout.row_major(4, 8)](
        B_sram_tile.vectorize[1, 4](), B_dram_tile.vectorize[1, 4]()
    )

    async_copy_wait_all()
    barrier()

    A_warp_tile = A_sram_tile.tile[WM, BK](warp_y, 0)
    B_warp_tile = B_sram_tile.tile[BK, WN](0, warp_x)

    @parameter
    for mma_k in range(BK // MMA_K):

        @parameter
        for mma_m in range(WM // MMA_M):

            @parameter
            for mma_n in range(WN // MMA_N):
                c_reg_m_n = c_reg.tile[1, 4](mma_m, mma_n)

                A_mma_tile = A_warp_tile.tile[MMA_M, MMA_K](mma_m, mma_k)
                B_mma_tile = B_warp_tile.tile[MMA_K, MMA_N](mma_k, mma_n)

                a_reg = mma_op.load_a(A_mma_tile)
                b_reg = mma_op.load_b(B_mma_tile)

                var d_reg_m_n = mma_op.mma_op(
                    a_reg,
                    b_reg,
                    c_reg_m_n,
                )

                c_reg_m_n.copy_from(d_reg_m_n)

@parameter
for mma_m in range(WM // MMA_M):

    @parameter
    for mma_n in range(WN // MMA_N):
        var C_mma_tile = C_warp_tile.tile[MMA_M, MMA_N](mma_m, mma_n)
        var c_reg_m_n = c_reg.tile[1, 4](mma_m, mma_n)
        mma_op.store_d(C_mma_tile, c_reg_m_n)
        
        
fn kernel(x: SIMD) -> SIMD:
    @parameter
    if is_nvidia_gpu():
        return ptx_intrinsic["..."](x)
    elif is_amd_gpu():
        return llvm_intrinsic["..."](x)
    else:
        return fallback["cpu"](x)


~70% faster compared to vanilla vLLM

"Our collaboration with Modular is a glimpse into the future of accessible AI infrastructure. Our API now returns the first 2 seconds of synthesized audio on average ~70% faster compared to vanilla vLLM based implementation, at just 200ms for 2 second chunks. This allowed us to serve more QPS with lower latency and eventually offer the API at a ~60% lower price than would have been possible without using Modular’s stack."

Image

Igor Poletaev

Chief Science Officer - Inworld

Slashed our inference costs by 80%

"Modular’s team is world class. Their stack slashed our inference costs by 80%, letting our customer dramatically scale up. They’re fast, reliable, and real engineers who take things seriously. We’re excited to partner with them to bring down prices for everyone, to let AI bring about wide prosperity."

Image

Evan Conrad

CEO - San Francisco Compute

Confidently deploy our solution across NVIDIA and AMD

"Modular allows Qwerky to write our optimized code and confidently deploy our solution across NVIDIA and AMD solutions without the massive overhead of re-writing native code for each system."

Image

Evan Owen

CTO, Qwerky AI

MAX Platform supercharges this mission

"At AWS we are focused on powering the future of AI by providing the largest enterprises and fastest-growing startups with services that lower their costs and enable them to move faster. The MAX Platform supercharges this mission for our millions of AWS customers, helping them bring the newest GenAI innovations and traditional AI use cases to market faster."

Image

Bratin Saha

VP of Machine Learning & AI services

Supercharging and scaling

"Developers everywhere are helping their companies adopt and implement generative AI applications that are customized with the knowledge and needs of their business. Adding full-stack NVIDIA accelerated computing support to the MAX platform brings the world’s leading AI infrastructure to Modular’s broad developer ecosystem, supercharging and scaling the work that is fundamental to companies’ business transformation."

Image

Dave Salvator

Director, AI and Cloud

Build, optimize, and scale AI systems on AMD

"We're truly in a golden age of AI, and at AMD we're proud to deliver world-class compute for the next generation of large-scale inference and training workloads… We also know that great hardware alone is not enough. We've invested deeply in open software with ROCm, empowering developers and researchers with the tools they need to build, optimize, and scale AI systems on AMD. This is why we are excited to partner with Modular… and we’re thrilled that we can empower developers and researchers to build the future of AI."

Image

Vamsi Boppana

SVP of AI, AMD

Image
Inworld
Image
San Francisco Compute
Image
Qwerky AI
Image
AWS
Image
NVIDIA
Image
AMD

SOTA Performance

Speed of light performance at every level

We own the entire stack to deliver unprecedented performance - the latest models, running at incredible speeds, on the most advanced hardware.

Image
Image
Image
Image
Image

Case Study - 2.5x cost savings with the fastest text-to-speech model ever

2.5x cost savings and 3.3x latency improvements

  • Image

    Performance out-of-the-box

    We enable performance that surpasses CUDA limitations with unprecedented simplicity out-of-the-box. See how we do it on our latest blog post.

  • Image

    Drive Total Cost Savings

    When we deliver speed at scale, we’ve been proven to bring Enterprise costs down by up to 60%. Read a recent case study for how we did it with text-to-speech.

  • Image

    Scale from 1 GPU, to unlimited

    Kubernetes-native control plane, router, and substrate specially-designed for large-scale distributed AI serving. Learn more about what powers our scalability.

  • Image

    Benchmark our performance

    Everyone says they’re fast ... but we walk the walk. We empower you to test it for yourself. Find an optimized model and then follow our quickstart guide.

Hardware Portability

Achieve true AI hardware portability

  • Image

    Write once, deploy everywhere

    Our breakthrough compiler technology automatically generates optimized kernels for any hardware target, eliminating the need for platform-specific code.

  • Image

    Infrastructure Resilience

    Break free from GPU vendor lock-in. Modular delivers peak performance across NVIDIA and AMD - resilient infrastructure that adapts to any compute type.

BEST-IN-CLASS

The most advanced teams
use Modular

Modular supports the entire AI lifecycle in one platform. Mojo for development, MAX for serving, and Mammoth for scale. Research to production will never be smoother.

ImageImage
Image

Video - We unlocked 60% lower costs, and a 70% faster time-to-first-audio.

By unlocking optionality across the infrastructure, the application layer, and the hardware, Modular helped to power the best inference experience for Inworld's users through reduced costs and breakthrough latency.

  • Image

    Start anywhere, use what you need

    Start with our free community edition. Scale using our Batch or Dedicated Endpoints, work with us to customize for your Enterprise.

  • Image

    Customize down to the silicon

    Customize everything - model architectures to hardware-agnostic kernels. We helped Qwerky AI unlock 50% faster GPU performance.

  • Image

    Vertically integrated

    Mojo's MLIR foundation delivers raw kernel performance. MAX adds kernel fusion and batching. Mammoth orchestrates everything across thousands of nodes.

  • Image

    Open source

    We're democratizing blazing-fast AI by open-sourcing our entire stack. Join the mission. Explore our open kernels.

DETAILS MATTER

Smaller, faster & easier to deploy

90% smaller containers. Sub-second cold starts. Fully open source.
We eliminated deployment friction so you don't have to.

ImageImage
  • Image

    Faster build, pull, and deploy times

    Minimal dependencies. Faster installs. Smaller packages. Smoother deployments. Less complexity, more speed. And more time for your engineers to deliver value.

  • Image

    Improved security

    A leaner container reduces exposure to vulnerabilities, lowering compliance risks and potential downtime costs. A stack you have full access to customize yourself.

  • Image

    Lower resource consumption

    MAX serving: under 700MB. 90% smaller than vLLM. Reduce infrastructure cost —  use less disk space, memory, and bandwidth. Get started with faster deployments now.

  • Image

    Simpler cross-hardware stack management

    No vendor lock-in. One stack for NVIDIA, AMD, CPU—any accelerator. Simpler deployment, easier debugging, unlimited possibilities.

Developer Approved

Image
Image

actually flies on the GPU

@ Sanika

"after wrestling with CUDA drivers for years, it felt surprisingly… smooth. No, really: for once I wasn’t battling obscure libstdc++ errors at midnight or re-compiling kernels to coax out speed. Instead, I got a peek at writing almost-Pythonic code that compiles down to something that actually flies on the GPU."

Image

pure iteration power

@ Jayesh

"This is about unlocking freedom for devs like me, no more vendor traps or rewrites, just pure iteration power. As someone working on challenging ML problems, this is a big thing."

Image

impressed

@ justin_76273

“The more I benchmark, the more impressed I am with the MAX Engine.”

Image

performance is insane

@ drdude81

“I tried MAX builds last night, impressive indeed. I couldn't believe what I was seeing... performance is insane.”

Image

easy to optimize

@ dorjeduck

“It’s fast which is awesome. And it’s easy. It’s not CUDA programming...easy to optimize.”

Image

potential to take over

@ svpino

“A few weeks ago, I started learning Mojo 🔥 and MAX. Mojo has the potential to take over AI development. It's Python++. Simple to learn, and extremely fast.”

Image

was a breeze!

@ NL

“Max installation on Mac M2 and running llama3 in (q6_k and q4_k) was a breeze! Thank you Modular team!”

Image

high performance code

@ jeremyphoward

"Mojo is Python++. It will be, when complete, a strict superset of the Python language. But it also has additional functionality so we can write high performance code that takes advantage of modern accelerators."

Image

one language all the way

@ fnands

“Tired of the two language problem. I have one foot in the ML world and one foot in the geospatial world, and both struggle with the 'two-language' problem. Having Mojo - as one language all the way through would be awesome.”

Image

works across the stack

@ scrumtuous

“Mojo can replace the C programs too. It works across the stack. It’s not glue code. It’s the whole ecosystem.”

Image

completely different ballgame

@ scrumtuous

“What @modular is doing with Mojo and the MaxPlatform is a completely different ballgame.”

Image

AI for the next generation

@ mytechnotalent

“I am focusing my time to help advance @Modular. I may be starting from scratch but I feel it’s what I need to do to contribute to #AI for the next generation.”

Image

surest bet for longterm

@ pagilgukey

“Mojo and the MAX Graph API are the surest bet for longterm multi-arch future-substrate NN compilation”

Image

potential to take over

@ svpino

“A few weeks ago, I started learning Mojo 🔥 and MAX. Mojo has the potential to take over AI development. It's Python++. Simple to learn, and extremely fast.”

Image

12x faster without even trying

@ svpino

“Mojo destroys Python in speed. 12x faster without even trying. The future is bright!”

Image

feeling of superpowers

@ Aydyn

"Mojo gives me the feeling of superpowers. I did not expect it to outperform a well-known solution like llama.cpp."

Image

very excited

@ strangemonad

“I'm very excited to see this coming together and what it represents, not just for MAX, but my hope for what it could also mean for the broader ecosystem that mojo could interact with.”

Image

impressive speed

@ Adalseno

"It worked like a charm, with impressive speed. Now my version is about twice as fast as Julia's (7 ms vs. 12 ms for a 10 million vector; 7 ms on the playground. I guess on my computer, it might be even faster). Amazing."

Image

amazing achievements

@ Eprahim

“I'm excited, you're excited, everyone is excited to see what's new in Mojo and MAX and the amazing achievements of the team at Modular.”

Image

Community is incredible

@ benny.n

“The Community is incredible and so supportive. It’s awesome to be part of.”

Image

excited to see this coming together

@ strangemonad

“I'm very excited to see this coming together and what it represents, not just for MAX, but my hope for what it could also mean for the broader ecosystem that mojo could interact with.”

Image

everyone is excited

@ Eprahim

“I'm excited, you're excited, everyone is excited to see what's new in Mojo and MAX and the amazing achievements of the team at Modular.”

Image

one language all the way through

@ fnands

“Tired of the two language problem. I have one foot in the ML world and one foot in the geospatial world, and both struggle with the 'two-language' problem. Having Mojo - as one language all the way through is be awesome.”

Image

huge increase in performance

@ Aydyn

"C is known for being as fast as assembly, but when we implemented the same logic on Mojo and used some of the out-of-the-box features, it showed a huge increase in performance... It was amazing."

Image

The future is bright!

@ mytechnotalent

Mojo destroys Python in speed. 12x faster without even trying. The future is bright!

Image
Show more quotes

Latest Blog Posts

MAX on GPU waiting list

Be the first to get lightning fast inference speed on your GPUs. Be the envy of all your competitors and lower your compute spend.