Why can the mma instruction only reach 50% peak computing throughput?

1457689744 · February 5, 2026, 10:25am

I am using the ‘tcgen05.mma.cta_group::2.kind::mxf4nvf4.block_scale.scale_vec::4X’ instruction to write a nvfp4 gemm kernel. When I use clusterdim (2,1,1) , the mma instruction can reach 95% peak computing throughput. But when I use clusterdim (4,1,1) or (2,2,2), the mma instruction only reach 50% peak computing throughput no matter what I try. I wonder if there any way to solve or explain this problem?

AastaLLL · February 6, 2026, 12:10pm

Hi,

Could you share the sample code you are running?
So we can check it internally?

Thanks.

1457689744 · February 9, 2026, 2:41am

It seems that the number of Active Clusters affects the compute throughput.
When I use _cluster_dims_(2,1,1), ncu showed 10 clusters active and can fill up 20 SMs;
But when I use _cluster_dims_(4, 1, 1), the activate clusters was 4 or 3, and it couldn’t use all SMs.
I ran the cutlass case 72a_blackwell_nvfp4-bf16_gemm https://github.com/NVIDIA/cutlass/blob/main/examples/72_blackwell_narrow_precision_gemm/72a_blackwell_nvfp4_bf16_gemm.cu or the case examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100 https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/blackwell/04_mma_tma_2sm_sm100.cu, and ncu analyzed the same result. I am wondering if the distribution of GPC determined that the clusterdim of thor could only be set to (2, 1, 1). Can you confirm internally? Thank you.

1457689744 · February 9, 2026, 2:42am

Here is my ncu result.

AastaLLL · February 9, 2026, 3:36am

Hi,

Sure, we will check this with our internal team.
Just double-confirm that these two experiments are all running with the same nvpmodel mode. Is that correct?

Some nvpmodel (ex., 90W) will turn off partial TPC so it will affect the results.

Thanks.

1457689744 · February 9, 2026, 3:57am

yes，and command nvpmodel -q shows “NV Power Mode: MAXN“

AastaLLL · February 9, 2026, 4:43am

Hi,

Could you try to set it to 90W?

Thanks.

1457689744 · February 9, 2026, 6:16am

I found there are only mode 0 and mode 1 in /etc/nvpmodel.conf， how to use nvpmodel -m 2 ？

1457689744 · February 9, 2026, 11:21am

May I ask what the purpose of 90W test is？And if it is necessary, what to add to nvpmodel.conf? Because I noticed that 90W can only use 6 TPCs.

1457689744 · February 11, 2026, 2:43am

Hello, may I ask if you have any results yet?

AastaLLL · February 12, 2026, 6:09am

Hi,

Thanks for your patience.
We are still checking this issue internally.

90W is for a test to see if any difference when changing the number of TPC.
We do see the 90W configuration in our nvpmodel.conf.

...
< POWER_MODEL ID=2 NAME=90W >
...

Do you use JetPack 7.1?
Thanks.

1457689744 · February 25, 2026, 1:48am

I’m using Jetpack 7.0 and I can only find MAXN and 120W configuration when I use cat /etc/nvpmodel.conf.

1457689744 · February 26, 2026, 7:51am

Hello，I’m wondering if there are any results yet?

Topic		Replies	Views
Verifying claimed TOPS performance on Jetson Thor – CUTLASS kernel for SM110 does not run, SM80 gives very low performance (~6.9 TFLOP/s) Jetson Thor cudnn , cublas	22	569	January 21, 2026
Kepler max effective FMA throughput CUDA Programming and Performance	5	1804	May 2, 2014
Roofline Model for Nvidia GTX1080 CUDA Programming and Performance	1	727	September 19, 2018
How many tensor cores to execute the wmma.mma.sync.aligned.{alayout}.{blayout}.m16n16k16 instruction？ CUDA Programming and Performance cuda	23	306	December 12, 2025
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	2096	August 8, 2023
Cluster size limitation CUDA Programming and Performance	4	948	February 1, 2024
Peak bandwidth to memory on Tesla M2050 CUDA Programming and Performance	10	1805	December 4, 2010
Thor torch.mm benchmark results (float32/float16/float8_e3m2fn) Jetson Thor cuda , pytorch , benchmarks	5	317	September 15, 2025
peak computational throughput CUDA Programming and Performance	3	996	December 24, 2015
How cluster influence GEMM or other application? CUDA Programming and Performance	2	102	August 5, 2024

Why can the mma instruction only reach 50% peak computing throughput?

Related topics