Z-Image: Efficient Image Generation

6B parameter Single-Stream Diffusion Transformer • 8 NFE sub-second inference • Bilingual text rendering • Fully open source

Loading Space...

Open in new tab →

Inspiration Gallery

Copy Prompts & One-Click Design Replication.

Latest Updates

Introducing Z-Image: Efficient Image Generation with Single-Stream Diffusion Transformer

We're proud to release Z-Image, a revolutionary 6B parameter model using Single-Stream Diffusion Transformer architecture. With only 8 NFEs, Z-Image-Turbo achieves sub-second inference on enterprise GPUs and runs smoothly on consumer devices with 16GB VRAM. Experience photorealistic quality that rivals models 10x larger.

Dec 15, 2024

Z-Image-Base: Unlocking Community Innovation

The release of Z-Image-Base marks a pivotal moment for the open-source diffusion community. This non-distilled foundation model enables full fine-tuning, LoRA training, and custom development. More than just a model release, we're sharing our groundbreaking training methodology to empower researchers and developers worldwide.

Dec 12, 2024

Bilingual Text Rendering: A Technical Breakthrough

Z-Image achieves industry-leading bilingual text rendering for English and Chinese with unprecedented accuracy. Our Single-Stream DiT architecture natively understands typography across languages, solving a challenge that has plagued image generation models for years. See how we achieve pixel-perfect text in photorealistic scenes.

Dec 10, 2024

Z-Image-Edit: Natural Language Image Editing

Introducing Z-Image-Edit, our specialized variant for image-to-image transformations. Using natural language instructions, you can precisely edit images with creative style transfer, object manipulation, and targeted modifications. Built on Z-Image's efficient architecture for fast, high-quality edits.

Dec 8, 2024

The Architecture Behind Z-Image: Single-Stream Diffusion Transformer

Deep dive into Z-Image's revolutionary Single-Stream Diffusion Transformer architecture. Learn how we achieved an order of magnitude better parameter efficiency than competing models, enabling state-of-the-art quality with just 6B parameters and 8 inference steps. Technical insights into our distillation methodology.

Dec 5, 2024

Community Spotlight: Z-Image as a Universal Upscaler

The community has discovered an innovative use case: using Z-Image as a second-pass upscaler for any model. ComfyUI workflows show that Z-Image can enhance images from FLUX, SDXL, SD 1.5, and more with results comparable to commercial services. Fast enough to integrate directly into your workflow.

Dec 3, 2024

ControlNet Integration: Precision Meets Speed

Z-Image now supports ControlNet, including Union 2.0, for precise control over generation. Combine Z-Image's speed with pose control, depth guidance, edge detection, and more. Perfect for professional workflows requiring both accuracy and efficiency. See examples of object removal and creative editing.

Dec 1, 2024

Running Z-Image on Consumer Hardware: A Complete Guide

Z-Image is designed to be accessible. This comprehensive guide covers running Z-Image-Turbo on consumer GPUs with 16GB VRAM, optimization techniques for 4GB VRAM devices, and integration with ComfyUI, Diffusers, and DiffSynth-Studio. Includes performance benchmarks and best practices.

Nov 28, 2024

What is Z-Image?

Z-Image is a revolutionary image generation model with 6B parameters using a Single-Stream Diffusion Transformer architecture. Developed by Tongyi-MAI (Alibaba), it achieves photorealistic quality that rivals models 10x larger while running on consumer hardware with only 16GB VRAM. With Z-Image-Turbo's 8 NFE (Number of Function Evaluations) inference, you get sub-second generation on enterprise GPUs and blazing-fast performance on consumer devices. The model excels at bilingual text rendering (English & Chinese), photorealistic generation, and instruction-following - all while being fully open source. What truly sets Z-Image apart is the release of not just the model, but also the groundbreaking training methodology, empowering the entire open-source community to innovate and build upon this efficient foundation.

Sub-second inference (8 NFEs)

6B params = 60B+ quality

Bilingual text (EN/ZH)

Fully open source

Single-Stream Diffusion Transformer Architecture

Z-Image employs a revolutionary Single-Stream DiT architecture that processes information more efficiently than traditional multi-stream approaches.

Text Encoder

input

Bilingual encoder for English & Chinese

Shared embedding space

Latent Representation

processing

Compressed latent space representation

Efficient dimensionality reduction

Single-Stream DiT Blocks

core

Unified processing stream for efficiency

6B parameters total

Self-attention mechanismsCross-attention with textFeed-forward networksLayer normalization

Denoising Process

processing

Iterative refinement with only 8 NFEs

Distilled for speed

Image Decoder

output

High-fidelity image reconstruction

Supports up to 2048x2048

Key Innovations

Single-Stream Processing

Unlike multi-stream models, Z-Image uses a unified stream that reduces parameter redundancy while maintaining quality

Advanced Distillation

Turbo variant distills the base model down to 8 NFEs without quality loss using novel training techniques

Parameter Efficiency

Achieves results comparable to 60B+ parameter models with only 6B parameters through architectural optimizations

Performance Benchmarks

Z-Image delivers exceptional performance across all key metrics, outperforming models significantly larger in size.

⚡

8 NFEs

3-6x Faster

vs 20-50 steps for competitors

🎯

6B params

Order of Magnitude Efficient

Quality of 60B+ models

💻

16GB VRAM

Consumer Friendly

Runs on standard GPUs

🌐

EN + ZH

Bilingual Support

Native text rendering

Metric

Z-Image

FLUX

SDXL

Inference Speed (NFEs)

Lower is better

8 👑

steps

Model Size

Lower is better

6 👑

B params

6.6

B params

VRAM Requirements

Lower is better

16 👑

Generation Time

Lower is better

1 👑

seconds

Text Rendering Quality

Higher is better

95 👑

score

Detail Preservation

Higher is better

92 👑

score

How to Start with Z-Image

Sign Up

Create your free account and get instant credits to start generating

Enter Prompt

Describe the image you want to generate in natural language

Generate

Click generate and watch AI create your image in seconds

Download & Use

Save and use your generated images for any project

Key Features

6B Parameters

Powerful 6 billion parameter model delivering exceptional image quality and detail that rivals models 10x larger, while maintaining efficiency and speed.

Single-Stream DiT

Revolutionary Diffusion Transformer architecture with single-stream processing that achieves unprecedented parameter efficiency and coherent generation.

Bilingual Text Rendering

Industry-leading bilingual text rendering with native support for English and Chinese, producing accurate typography where other models fail.

Sub-Second Generation

Blazing-fast 8-NFE turbo mode achieves sub-second inference on H800 GPUs and rapid generation on consumer hardware, 3-6x faster than competitors.

Natural Language Editing

Z-Image-Edit variant enables precise image-to-image transformations using natural language instructions for creative editing and style transfer.

Fully Open Source

Complete open-source release including all three variants (Turbo, Base, Edit), weights, code, and groundbreaking training methodology for community innovation.

Consumer-Friendly VRAM

Runs comfortably on consumer devices with 16GB VRAM, and can be optimized for GPUs with as little as 4GB VRAM, making it accessible to all.

Photorealistic Quality

Excels at generating highly detailed, photorealistic images with superior detail preservation in distant elements and backgrounds compared to similar-sized models.

ControlNet Integration

Full ControlNet support including Union 2.0 for precise control over pose, depth, edges, and more. Perfect for professional workflows requiring accuracy.

Fine-Tuning Ready

Z-Image-Base enables full fine-tuning, LoRA training, and distillation. Tools like DiffSynth-Studio provide comprehensive training support with low-VRAM optimization.

Universal Upscaler

Use Z-Image as a second-pass upscaler/enhancer for any model (FLUX, SDXL, SD 1.5). Results rival commercial services with fast, integrated workflow processing.

Rich Ecosystem

Extensive community support with ComfyUI nodes, Diffusers pipelines, Replicate API, AI Runner integration, and comprehensive documentation across platforms.

What the Community Says About Z-Image

"Z-Image has completely transformed our creative workflow. The quality and speed are unmatched in the industry."

Sarah ChenCreative Director at PixelCraft

"The bilingual text support is a game-changer for our international campaigns. Finally, AI that understands multiple languages!"

David MartinezMarketing Lead at GlobalBrand

"As an artist, I appreciate how Z-Image enhances rather than replaces creativity. The instruction editing is pure magic."

Emily ThompsonDigital Artist

"The open source nature and community support make Z-Image the best choice for developers building AI applications."

Michael ZhangAI Engineer at TechStart

"Incredible speed without sacrificing quality. Z-Image Turbo has cut our production time by 80%."

Jessica LeeProduct Designer at DesignHub

"The 6B parameter model delivers consistently stunning results. It's become an essential tool in our design process."

Alex RodriguezArt Director at CreativeStudio

Frequently Asked Questions

Z-Image is a powerful and highly efficient image generation model with 6B parameters, using a Single-Stream Diffusion Transformer architecture. It delivers state-of-the-art photorealistic image generation that matches or exceeds models an order of magnitude larger, while running on consumer hardware with only 16GB VRAM.

Z-Image comes in three variants: (1) Z-Image-Turbo - a distilled version optimized for speed with only 8 NFEs and sub-second inference; (2) Z-Image-Base - the non-distilled foundation model for community fine-tuning and custom development; (3) Z-Image-Edit - a specialized variant for image editing tasks with natural language instruction following.

Z-Image is smaller (6B parameters), faster (8 NFEs vs 20-50), and more efficient than both FLUX and SDXL while delivering comparable or superior quality. It offers better detail preservation than SDXL and native bilingual text support. The community considers it a potential SDXL replacement for many use cases.

Beyond releasing a powerful model, Z-Image's team shared their advanced distillation training methodology with the open-source community. This enables researchers and developers to apply similar efficient training techniques to their own models, raising the bar for the entire diffusion model ecosystem.

Yes! Z-Image-Turbo fits comfortably within 16GB VRAM on consumer devices. With optimization techniques, it can even run on GPUs with only 4GB VRAM. It achieves sub-second generation on enterprise H800 GPUs and fast generation on consumer hardware.

Yes, Z-Image has excellent bilingual text rendering capabilities, supporting both English and Chinese with high accuracy. This is a significant advantage over many competing models that struggle with text generation.

Absolutely! The community has created workflows using Z-Image as a second-pass upscaler/enhancer for images from other models like FLUX, SD 1.5, SDXL, and others. Results are comparable to commercial upscaling services, and it's fast enough to integrate directly into workflows.

Z-Image-Base is the non-distilled foundation model released to unlock the full potential for community-driven fine-tuning and custom development. This release is significant because it allows the community to create specialized variants, LoRAs, and custom models based on Z-Image's powerful architecture.

Yes! Z-Image supports ControlNet for precise control over generation. There's even a ControlNet Union 2.0 version available. The community has showcased impressive results combining Z-Image-Turbo with ControlNet for tasks like object removal and pose control.

Z-Image supports multiple integration methods: ComfyUI (with community nodes and utilities), Diffusers (official pipelines), DiffSynth-Studio (for LoRA training), AI Runner, and can be accessed via Replicate API. Check the official GitHub repository for detailed setup instructions.

NFEs (Number of Function Evaluations) represent the number of inference steps needed to generate an image. Fewer NFEs mean faster generation. Z-Image-Turbo achieves excellent quality with only 8 NFEs, while competitors often require 20-50 steps, making it significantly faster.

Yes! With the release of Z-Image-Base, you can perform full fine-tuning, LoRA training, and distillation training. Tools like DiffSynth-Studio provide comprehensive support for training workflows with low-VRAM optimization.

Z-Image uses a Single-Stream Diffusion Transformer (DiT) architecture that processes information more efficiently than traditional multi-stream approaches. This contributes to the model's exceptional parameter efficiency and speed while maintaining high quality output.

Yes! Z-Image is fully open source, including the model weights, codebase, and training methodology. The team released all three variants (Turbo, Base, and Edit) to empower the community to build, fine-tune, and innovate freely.

Z-Image stands out through its unprecedented efficiency (6B params matching 60B+ models), blazing speed (8 NFEs vs 20-50), bilingual text rendering, and the groundbreaking release of both the model and training methodology to the open-source community. It's accessible on consumer hardware while delivering professional-grade results.

Research Paper

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Tongyi-MAI (Alibaba)

Abstract

Z-Image is a powerful and highly efficient image generation model with 6B parameters that employs a revolutionary Single-Stream Diffusion Transformer architecture. The model achieves state-of-the-art photorealistic image generation that matches or exceeds models an order of magnitude larger, while running on consumer hardware with only 16GB VRAM. Through advanced distillation techniques, Z-Image-Turbo generates high-quality images with only 8 NFEs, achieving sub-second inference latency.

Key Contributions

Single-Stream Diffusion Transformer Architecture

A novel architecture that processes information through a unified stream rather than multiple parallel streams, achieving an order of magnitude better parameter efficiency.

Significance

Enables 6B parameter model to match quality of 60B+ parameter models

Advanced Distillation Methodology

Groundbreaking distillation techniques that reduce inference steps from 50 to 8 without quality degradation, shared openly with the research community.

Significance

3-6x faster generation than competitors, methodology contribution to open-source community

Native Bilingual Text Rendering

Architecture-level support for bilingual text rendering in English and Chinese with unprecedented accuracy.

Significance

Solves a major challenge in image generation, industry-leading text quality

Comprehensive Model Family

Three specialized variants (Turbo, Base, Edit) optimized for different use cases, all built on the same efficient foundation.

Significance

Enables diverse applications from real-time generation to research and creative editing