CVPR 2026 Highlight

MeshFlow : Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer

~1s parallel, quantization-free artistic mesh generation.

18x faster than AR-style mesh generation
Parallel flow transformer denoising
No quantization continuous vertex coordinates
Denoise process Generated mesh

Abstract

Fast artistic mesh generation, without token-by-token decoding.

We present MeshFlow, a method for generating artist-like 3D meshes with continuous geometry and explicit connectivity. Instead of autoregressively predicting discrete face tokens, MeshFlow learns a compact continuous latent space with MeshVAE and generates the latent mesh representation in parallel using a flow-based diffusion transformer. This avoids coordinate quantization, scales linearly with mesh size, and produces high-quality meshes for unconditional, point-cloud-conditioned, and image-conditioned generation.

Continuous mesh latent

Positions, normals, and edge embeddings are compressed without discretizing vertex coordinates.

Parallel generation

A flow-based transformer denoises all latent tokens together instead of decoding one token at a time.

Artist-like outputs

The decoded meshes keep explicit vertices and edges, making them suitable for downstream 3D workflows.

Gallery(VAE)

MeshVAE compresses discrete meshes into compact latents without quantization.

Method

MeshVAE makes meshes compact; Flow matching makes generation fast.

MeshFlow compresses a continuous vertex-based mesh representation into compact MeshVAE latents, then uses a Flow-based transformer to generate all latent tokens in parallel before decoding explicit geometry and connectivity.

Motivation

Vertices are more compact than faces.

A mesh with nv vertices is represented by nv continuous vectors. Because meshes usually have two to three times more faces than vertices, this representation is shorter than face-oriented tokenizers and avoids coordinate quantization.

Method overview

Parallel flow generation from continuous latents.

MeshFlow encodes positions, normals, and edge features into MeshVAE latents. A flow-based diffusion transformer denoises these latents together, then the decoder recovers vertices, normals, and mesh connectivity.

Detailed VAE

TokenMerge forms a more compact mesh latent.

Inspired by pixel shuffle, TokenMerge downsamples vertex tokens into fewer latent tokens. TokenSplit reverses this process, while attention blocks refine geometry, edge embeddings, and the validity mask.

VAE comparisons

Continuous codes avoid quantization loss.

MeshVAE reconstructs meshes in continuous space with only 512 latent codes. Compared with quantized tokenizers, it preserves fine geometry and topology with far fewer variables.

DiT

The Choice of Shape Information Injection.

We study two shape-conditioning schemes: cross-attention on Shape2VecSet-style tokens, and 3D RoPE on vertex XYZ. Cross-attention suits inference but trains slowly; RoPE converges faster, with 32³ voxelization bridging the train–inference gap.

What matters

Compact latents for fast mesh generation.

We introduce a continuous mesh representation that keeps vertex positions, outward normals, and connectivity on the vertices. Inspired by SpaceMesh, adjacency is encoded with contrastively learned edge embeddings instead of discrete face tokens.

We design MeshVAE to map this representation into compact continuous latents. TokenMerge downsamples along vertices, reconstructs accurately with fewer variables, and avoids coordinate quantization.

A flow-based transformer generates the latent mesh in parallel rather than autoregressively decoding tokens. Inference scales linearly with mesh size and runs significantly faster than strong AR baselines.

Q&A

Common questions about MeshFlow.

What are the main contributions of MeshFlow?

MeshFlow makes artist-like mesh generation fast and quantization-free through three key ideas. First, we introduce a continuous mesh representation that keeps vertex positions, normals, and contrastively learned edge embeddings instead of discrete face tokens. Second, we design MeshVAE to compress this representation into compact continuous latents with TokenMerge and TokenSplit, avoiding coordinate quantization while preserving geometry and topology. Third, a flow-based diffusion transformer denoises all latent tokens in parallel, enabling ~1s inference that scales linearly with mesh size.

Why do you use voxelization + RoPE 3D?

In our initial design, we uniformly sample 32,768 points for shape conditioning and feed them into a pre-trained shape encoder with an architecture analogous to Shape2VecSet, yielding 2,048 shape tokens. We then use cross-attention (CA) to condition the DiT on these tokens and further add a vertex-count signal in the timestep embedding. This paradigm supports straightforward inference but requires longer training. As an alternative, we embed ground-truth vertex XYZ with 3D RoPE, which converges much faster but creates a domain gap at inference, because uniformly sampled point clouds differ from the training-time vertex distribution. To bridge this gap, we coarsely voxelize ground-truth vertices during training and inject the resulting shape priors into RoPE; a resolution of 32³ offers a practical trade-off between artist-like topology and shape detail. See Sec. 3.3 of our paper for the full discussion.

Can inference run without a point cloud?

The released model expects a surface point cloud as input to provide coarse shape guidance. We are actively working toward single-image conditioning for artistic mesh generation.

Acknowledgements

We are deeply grateful to Minghao Chen, Jianyuan Wang, Zihang Lai, and Thu Nguyen-Phuoc for their discussions and support. We also thank the authors of related mesh generation works, including MeshGPT, PolyDiff, SpaceMesh, MeshCraft, FastMesh, LATTICE, and LATO. This project page is also greatly inspired by the VGGT-Omega project page; we sincerely thank them for their excellent work.

BibTeX

@inproceedings{li2026meshflow,
  author    = {Weiyu Li and Antoine Toisoul and Tom Monnier and Roman Shapovalov and Rakesh Ranjan and Ping Tan and Andrea Vedaldi},
  title     = {{MeshFlow}: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer},
  booktitle = {Proceedings of the {IEEE/CVF} Conference on Computer Vision and Pattern Recognition ({CVPR})},
  year      = {2026},
  note      = {Highlight},
}