Skip to content

Conversation

@yuantailing
Copy link
Member

GroupNorm v2 outperforms the original GroupNorm extension by utilizing coalesced memory access, tested on B200 with a set of shapes.

@alpha0422
Copy link
Contributor

Hi @crcrpar, could you also help review this PR?

Comment on lines +184 to +191
constexpr auto params = compute_gn_params<T, false, false, HW, G, CPG, LB_N, RUNTIME_CUDA_ARCH, LB_SM_COUNT, EFFECTIVE_CUDA_ARCH, SM_MARGIN>();
constexpr int BLOCK_DIM_X = std::get<0>(params);
constexpr int C_PER_BLOCK = std::get<1>(params);
constexpr int ROWS_PER_BLOCK = std::get<2>(params);
constexpr int VEC_ELEMS = std::get<3>(params);
constexpr bool LOAD_TWICE = std::get<4>(params);
constexpr int BLOCKS_PER_SM = std::get<5>(params);
constexpr bool HARDWARE_CLUSTER = std::get<6>(params);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ah, structured binding cannot be used with constexpr...)

Copy link
Collaborator

@crcrpar crcrpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some test cases, e.g.

  • check numeric and functionality of v2
  • error is expectedly thrown for invalid inputs
    ?

if bare_metal_version >= Version("12.8"):
arch_flags = ["-gencode=arch=compute_100,code=sm_100"]
else:
arch_flags = ["-gencode=arch=compute_90,code=compute_90"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: Looking at this conditions, it is a Blackwell exclusive. Is it supposed to work on Hopper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works on Hopper if we add it to arch_flags and DISPATCH_CUDA_ARCH_AND_LOWER_BOUND_SM_COUNT. However, I think we need a smarter way to determine template args to support more GPUs, instead of exhausting SM_COUNT.

@yuantailing yuantailing marked this pull request as ready for review March 21, 2025 02:15
@nWEIdia
Copy link
Collaborator

nWEIdia commented Mar 26, 2025

Has "check numeric and functionality of v2" tests been performed?

@yuantailing
Copy link
Member Author

Has "check numeric and functionality of v2" tests been performed?

Yes, tests were added and passed.

@crcrpar
Copy link
Collaborator

crcrpar commented Mar 26, 2025

@nWEIdia yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants