Skip to content

Conversation

@kashif
Copy link
Contributor

@kashif kashif commented Nov 23, 2025

What does this PR do?

  • Removes the redundant txt_seq_lens plumbing from all QwenImage pipelines and modular steps; the transformer now infers text length from encoder inputs/masks and validates optional overrides.
  • Builds a lightweight broadcastable attention mask from encoder_hidden_states_mask inside the double-stream attention, avoiding full seq_len² masks while keeping padding tokens masked.
  • Adjusts QwenImage Transformer/ControlNet RoPE to take a single text length and documents the fallback behavior.
  • Adds regression tests to ensure short txt_seq_lens values and encoder masks are handled safely.

Fixes #12344

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@kashif kashif requested a review from sayakpaul November 23, 2025 18:03
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@sayakpaul sayakpaul requested a review from yiyixuxu November 24, 2025 01:57
@dxqb
Copy link

dxqb commented Nov 29, 2025

just a few comments, not a full review:

  • there is some overlap with Fix qwen encoder hidden states mask #12655
  • this code has the same issue mentioned in Fix qwen encoder hidden states mask #12655 (expecting boolean semantics in a FloatTensor - but float attention masks are interpreted differently)
  • Could you clarify what the purpose of this PR is?
    If the purpose is to remove the txt_seq_lens parameters, and infer the sequence lengths from the attention mask: why is it still a parameter of the transformer model?
    If the purpose is towards passing sequence lengths to the attention dispatch (see Qwen Image: txt_seq_lens is redundant and not used #12344 (comment)), the sequence lengths for each batch sample must be inferred from the mask and passed to the transformer blocks, not only the max sequence length across all batch samples for RoPE

raise ValueError(f"`txt_seq_lens` must have length {batch_size}, but got {len(txt_seq_lens)} instead.")
text_seq_len = max(text_seq_len, max(txt_seq_lens))
elif encoder_hidden_states_mask is not None:
text_seq_len = max(text_seq_len, int(encoder_hidden_states_mask.sum(dim=1).max().item()))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works if the attention mask is in the form of [True, True, True, ..., False, False, False]. While this is the case in the most common use case of text attention masks, it doesn't have to be the case.

If the mask is [True, False, True, False, True, False], self.pos_embed receives an incorrect sequence length

@kashif
Copy link
Contributor Author

kashif commented Nov 29, 2025

thanks @dxqb the idea was to remove txt_seq_lens all together and work with any mask pattern

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a ton for the PR! @kashif
I left one question, let me know!

- Remove seq_lens parameter from dispatch_attention_fn
- Update varlen backends to extract seqlens from masks
- Update QwenImage to pass 2D joint_attention_mask
- Fix native backend to handle 2D boolean masks
- Fix sage_varlen seqlens_q to match seqlens_k for self-attention

Note: sage_varlen still producing black images, needs further investigation
@sayakpaul
Copy link
Member

#12655 provides some benchmarks on the speed, as well. Possible to provide them here too? @kashif

@kashif
Copy link
Contributor Author

kashif commented Dec 8, 2025

some benchmarks with various backends:

code:
benchmark_backends_qwen.py

backend_benchmark_complete

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks so much for helping us on this issue @kashif
changes looks good to me.

I asked the qwen team to review it too so we will wait for their feedbacks now :)

Enhances documentation with comprehensive performance insights for QwenImage pipeline:
@cdutr
Copy link
Contributor

cdutr commented Dec 11, 2025

Hey @kashif! I've prepared a documentation update with a new Performance section covering:

  • Attention backend benchmarks (from your tests)
  • torch.compile speedup (~2.4x)
  • Variable-length prompt handling with CFG

I am also mentioning the gist with the scripts I used

Can you double check?

Is there anything missing, or something else I can help with?

@naykun
Copy link
Contributor

naykun commented Dec 12, 2025

Thank you so much for this excellent PR! It’s clean, well-structured, and addresses several long-standing issues. I’ve left a few questions —we can discuss them further.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some more comments. LMK if anything is unclear.

image = pipe("a cat", num_inference_steps=50).images[0]
```

### Batched Inference with Variable-Length Prompts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of specifying performance numbers on torch.compile and other attention backends, maybe we could highlight this point and include with and without torch.compile numbers? @cdutr WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I've simplified the Performance section to focus on torch.compile with the before/after numbers,
removed the attention backend tables since the differences between backends are minimal compared to the torch.compile gains

return out


_maybe_download_kernel_for_backend(_AttentionBackendRegistry._active_backend)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cc: @kashif


# Validate batch inference with variable-sized images
# In Layer3DRope, the outer list represents batch, inner list/tuple represents layers
if isinstance(video_fhw, list) and len(video_fhw) > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cc: @naykun good for you?

def prepare_dummy_input(self, height, width):
return QwenImageTransformerTests().prepare_dummy_input(height=height, width=width)

@pytest.mark.xfail(condition=True, reason="RoPE needs to be revisited.", strict=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it pass now? 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it passes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn! Wow!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kashif which versions are you using? I am getting:

FAILED tests/models/transformers/test_models_transformer_qwenimage.py::QwenImageTransformerCompileTests::test_compile_works_with_aot - torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u0, 1) (unhinted: E...
FAILED tests/models/transformers/test_models_transformer_qwenimage.py::QwenImageTransformerCompileTests::test_torch_compile_recompilation_and_graph_break - torch._dynamo.exc.Unsupported: Dynamic slicing with Tensor arguments
FAILED tests/models/transformers/test_models_transformer_qwenimage.py::QwenImageTransformerCompileTests::test_torch_compile_repeated_blocks - torch._dynamo.exc.Unsupported: Data-dependent branching

With:

RUN_SLOW=1 RUN_COMPILE=yes pytest tests/models/transformers/test_models_transformer_qwenimage.py::QwenImageTransformerCompileTests

Would it be possible to fix the compile compatibility? With main, it runs as expected:

==================================== 3 passed, 1 skipped, 1 xfailed, 5 warnings in 109.93s (0:01:49) =====================================

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compilation issues still need to be resolved I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUN_SLOW=1 RUN_COMPILE=yes pytest tests/models/transformers/test_models_transformer_qwenimage.py::QwenImageTransformerCompileTests
...
============================================================ 4 passed, 1 skipped, 12 warnings in 45.10s =============================================================

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes passing for me as well. What was the major change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to fix device placements and data dependent ifs

kashif and others added 10 commits December 17, 2025 15:01
Removes detailed attention backend benchmarks and simplifies torch.compile performance description

Focuses on key performance improvement with torch.compile, highlighting the specific speedup from 4.70s to 1.93s on an A100 GPU

Streamlines the documentation to provide more concise and actionable performance insights
Extends deprecation timeline for txt_seq_lens from version 0.37.0 to 0.39.0 across multiple Qwen image-related models

Adds a new unit test to verify the deprecation warning behavior for the txt_seq_lens parameter
Comment on lines 206 to 212
"patch_size": 2,
"in_channels": 16,
"out_channels": 16,
"num_layers": 2,
"attention_head_dim": 128,
"num_attention_heads": 4,
"joint_attention_dim": 16,
Copy link
Member

@sayakpaul sayakpaul Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use smaller values please? @cdutr

I also think we should use_additional_t_cond=True here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kashif already addressed it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen Image: txt_seq_lens is redundant and not used

7 participants