About performance profile of RTX A4000

Shaquille · July 31, 2023, 6:57am

hello, NV’s experts,
I have a problem about RTX A4000, I test the performance of RTX A4000 through cublass. and I found I cannot reach the peak performance, the result as following:

it alway cannot reach the peak value, and I guess the bandwidth of A4000’s global memory cannot support the peak performance.
So, I profile my program through NCU.
the Pipe Utilization show me as following:
pipe_utilization

the FMA utilization is about 50%, I think it match my test result.
and NCU tell me as following:

[Warning] LSU is the highest-utilized pipeline (87.9%). It executes load/store memory operations. The pipeline is over-utilized and likely a performance bottleneck.

I don’t know why LSU is the bottleneck, I guess the latency of shared_mem is very poor. So, I create a test about the latency of shared_mem, and the result as following:

shared memory accessed: 2097152 byte
duration: 18766 cycles
shared memory bandwidth per SM (measured): 111.752747 byte/cycle
shared memory bandwidth per SM (theoretical): 128 byte/cycle
standard clock frequency: 1560 MHz
SM: 48
whole chip shared memory bandwidth (theoretical): 9584.639648 GB/s
shared memory latency 23 cycles

the measured bandwidth of shared memory is close to the theoretical, though shared memory latency is 23 cycles

I’m very confused, how to understand the NCU’s “Warniing” about LSU?

and then, I continue to check the “memory work load” in NCU, like this:
memory_work_load

the “% Peak” is 53.5 for “Shared Load”, and 2.11 for “Shared Store”. obviously, these values are low. more important things from above table is Bank Conflicts. I think those “Peak” value are low, is because of these bank conflict?

then, I continue to check the “Scheduler Statistics”, like this:
scheduler_statistic

“Issued Warp Per Scheduler” is 0.71, I think this value cannot match expectation.

then, I continue to check the “warp state statistics”, like this:
warp_state_statistic

the topest is “Stall Not Selected”, instead of “Stall Long Scoreboard”, So, I don’t think the bandwidth of Global Memory is the main reason for poor performance, is it right?

at last, I checked the “Instruction Statistics”, like this:
instruction_statistic

the topest is FFMA and LDS, the number of FFMA is 2,417,483,648, and the number of LDS is 269,156,352, FFMA/LDS < 10, So, I think the real bandwidth of shared memory cannot support higher performance, and, I think FFMA/LDS should be > 20, if RTX A4000 take full use of FFMA capability, because the latency of shared_mem is 23, is it right?

conclusion:
I think the main reason of poor performance is shared memory instead of global memory. I do not hold other cuda GPUs, so, I don’t know what is the suitable value for the latency of shared memory.

I don’t know why the main reason is shared_memory instead of global memory, is there anyone would like to help me to check the profile?

jmarusarz · August 3, 2023, 8:31pm

I see that you got a reply on this thread. Why the performance of tf32 tensor_core is poor? - #7 by Shaquille I think that’s the best place to continue the discussion since there are more details there already.

Topic		Replies	Views
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	2197	August 8, 2023
theoretical/real shared/dram peak memory throughput CUDA Programming and Performance	12	5323	January 5, 2017
How to interpret the difference between LSU utilization and Shared Memory utilization (in case of shared memory access only)? Nsight Compute cuda , kernel	0	627	July 6, 2022
Jetson TK1 performance Jetson TK1	18	6707	June 18, 2014
Cannot achieve max shared memory bandwith CUDA Programming and Performance	12	1144	November 20, 2023
Warp-shuffle - Shared Memory Performance Comparison For Reduction, Between RTX4070 and RTX5070 CUDA Programming and Performance	7	203	November 2, 2025
High Compute in Flight, low DRAM Bandwidth usage CUDA Programming and Performance	35	696	January 19, 2025
Trouble to Reach Peak Bandwidth of A100 CUDA Programming and Performance cuda	8	405	July 29, 2025
Benchmarking Different Memory Access Patterns CUDA Programming and Performance	6	1846	June 11, 2008
[Fermi] Number of registers CUDA Programming and Performance	36	20561	September 15, 2010

About performance profile of RTX A4000

Related topics