Bank Conflicts on Ampere

A colleague of mine who works with a A100 couldn’t get rid of some shared memory bank conflicts. He then profiled the 6_Performance/transpose CUDA sample for comparison and saw conflicts in the transposeNoBankConflicts kernel. I found out that that kernel produces conflicts on architectures from Pascal to Ampere when one doesn’t fix TILE_DIM and BLOCK_ROWS to 32 in transpose.cu. With that fix the conflicts are gone on Pascal (measured with nvprof) and Turing (measured with ncu), but on Ampere (both A100 and RTX 3090) there is still a small amount of conflicts. Are these real or some kind of bug/artifact? Did I miss some change between Turing and Ampere?

I used

ncu -k transposeNoBankConflicts --metrics=l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.avg,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.avg transpose -dimX=8192 -dimY=8192

for profiling and CUDA 11.8 with gcc 11.3 and ncu 2022.3 on the A100 and CUDA 11.2.2 with gcc 9.4 and ncu 2022.4 on the RTX 3090. The code was compiled with the given Makefile using make SMS=80 and make SMS=86 respectively.

You may want to check the check the bank conflicts against the L1 Wavefronts Shared Excessive instruction-level metric on the Source page, as mentioned in this post. If this metric has any non-zero values on the Source page, that is where the bank conflicts are originating from.

So if I understand it correctly “L1 Wavefronts Shared Excessive” in the Source View is a better metric for “solvable”, shared memory conflicts. For the transpose on A100 I see zero there which is what I would expect. I think this solves the most important part of my question.

I would still be interested in an explanation of why the metrics in my original question are nonzero on Ampere for the given example (and zero on other architectures). I guess it comes from L1 caching then?

L1 caching certainly has an influence on the numbers of conflicts reported. Using __ldcg and __stcg for global loads and stores seems to increase the number of store conflicts but decrease the number of load conflicts, while __ldcs and __stcs decrease both numbers.

So if I understand it correctly “L1 Wavefronts Shared Excessive” in the Source View is a better metric for “solvable”, shared memory conflicts

That’s the correct takeaway, yes.