About performance profile of RTX A4000

hello, NV’s experts,
I have a problem about RTX A4000, I test the performance of RTX A4000 through cublass. and I found I cannot reach the peak performance, the result as following:

it alway cannot reach the peak value, and I guess the bandwidth of A4000’s global memory cannot support the peak performance.
So, I profile my program through NCU.
the Pipe Utilization show me as following:
the FMA utilization is about 50%, I think it match my test result.
and NCU tell me as following:

[Warning] LSU is the highest-utilized pipeline (87.9%). It executes load/store memory operations. The pipeline is over-utilized and likely a performance bottleneck.

I don’t know why LSU is the bottleneck, I guess the latency of shared_mem is very poor. So, I create a test about the latency of shared_mem, and the result as following:

shared memory accessed: 2097152 byte
duration: 18766 cycles
shared memory bandwidth per SM (measured): 111.752747 byte/cycle
shared memory bandwidth per SM (theoretical): 128 byte/cycle
standard clock frequency: 1560 MHz
SM: 48
whole chip shared memory bandwidth (theoretical): 9584.639648 GB/s
shared memory latency 23 cycles

the measured bandwidth of shared memory is close to the theoretical, though shared memory latency is 23 cycles

I’m very confused, how to understand the NCU’s “Warniing” about LSU?

and then, I continue to check the “memory work load” in NCU, like this:

the “% Peak” is 53.5 for “Shared Load”, and 2.11 for “Shared Store”. obviously, these values are low. more important things from above table is Bank Conflicts. I think those “Peak” value are low, is because of these bank conflict?

then, I continue to check the “Scheduler Statistics”, like this:

“Issued Warp Per Scheduler” is 0.71, I think this value cannot match expectation.

then, I continue to check the “warp state statistics”, like this:

the topest is “Stall Not Selected”, instead of “Stall Long Scoreboard”, So, I don’t think the bandwidth of Global Memory is the main reason for poor performance, is it right?

at last, I checked the “Instruction Statistics”, like this:

the topest is FFMA and LDS, the number of FFMA is 2,417,483,648, and the number of LDS is 269,156,352, FFMA/LDS < 10, So, I think the real bandwidth of shared memory cannot support higher performance, and, I think FFMA/LDS should be > 20, if RTX A4000 take full use of FFMA capability, because the latency of shared_mem is 23, is it right?

I think the main reason of poor performance is shared memory instead of global memory. I do not hold other cuda GPUs, so, I don’t know what is the suitable value for the latency of shared memory.

I don’t know why the main reason is shared_memory instead of global memory, is there anyone would like to help me to check the profile?

I see that you got a reply on this thread. Why the performance of tf32 tensor_core is poor? - #7 by Shaquille I think that’s the best place to continue the discussion since there are more details there already.