Bank Conflicts on Ampere

paleonix · January 23, 2023, 2:08pm

A colleague of mine who works with a A100 couldn’t get rid of some shared memory bank conflicts. He then profiled the 6_Performance/transpose CUDA sample for comparison and saw conflicts in the transposeNoBankConflicts kernel. I found out that that kernel produces conflicts on architectures from Pascal to Ampere when one doesn’t fix TILE_DIM and BLOCK_ROWS to 32 in transpose.cu. With that fix the conflicts are gone on Pascal (measured with nvprof) and Turing (measured with ncu), but on Ampere (both A100 and RTX 3090) there is still a small amount of conflicts. Are these real or some kind of bug/artifact? Did I miss some change between Turing and Ampere?

I used

ncu -k transposeNoBankConflicts --metrics=l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.avg,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.avg transpose -dimX=8192 -dimY=8192

for profiling and CUDA 11.8 with gcc 11.3 and ncu 2022.3 on the A100 and CUDA 11.2.2 with gcc 9.4 and ncu 2022.4 on the RTX 3090. The code was compiled with the given Makefile using make SMS=80 and make SMS=86 respectively.

felix_dt · January 23, 2023, 4:13pm

You may want to check the check the bank conflicts against the L1 Wavefronts Shared Excessive instruction-level metric on the Source page, as mentioned in this post. If this metric has any non-zero values on the Source page, that is where the bank conflicts are originating from.

paleonix · January 23, 2023, 6:14pm

So if I understand it correctly “L1 Wavefronts Shared Excessive” in the Source View is a better metric for “solvable”, shared memory conflicts. For the transpose on A100 I see zero there which is what I would expect. I think this solves the most important part of my question.

I would still be interested in an explanation of why the metrics in my original question are nonzero on Ampere for the given example (and zero on other architectures). I guess it comes from L1 caching then?

paleonix · January 23, 2023, 7:16pm

L1 caching certainly has an influence on the numbers of conflicts reported. Using __ldcg and __stcg for global loads and stores seems to increase the number of store conflicts but decrease the number of load conflicts, while __ldcs and __stcs decrease both numbers.

felix_dt · January 25, 2023, 9:34am

So if I understand it correctly “L1 Wavefronts Shared Excessive” in the Source View is a better metric for “solvable”, shared memory conflicts

That’s the correct takeaway, yes.

system · October 30, 2023, 1:05pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ncu detects bank conflicts in matrix transposition after padding Nsight Compute cuda	5	1299	January 30, 2023
Problems about Profiling Shared Memory Bank Conflicts using nsight-compute Nsight Compute	2	1477	January 25, 2022
Shared memory bank conflicts and nsight metric CUDA Programming and Performance	15	5489	October 19, 2024
Problem about bank conflict test CUDA Programming and Performance	6	522	March 12, 2024
How to profile such metrics like l1tex__data_bank_conflicts_pipe_lsu_mem_global? Nsight Compute cuda	2	827	March 3, 2022
Questions about "L1 Conflicts Shared N-way" & metrics related to "Excessive" CUDA Programming and Performance	5	524	April 1, 2024
Matrix Transposition: Why are the number of Shared Store/Load Transactions different in a CUDA Kernel? Nsight Compute	2	691	July 5, 2022
Why will different numbers of thread blocks cause bank conflicts? CUDA Programming and Performance cuda , kernel , ubuntu	1	46	August 22, 2024
Find L1 Wavefronts Shared Excessive in NCU commandline Nsight Compute cuda , nsight	1	727	August 24, 2023
The increase of the shared memory size leads to the bankconflict (from 9 KB shared memory) Nsight Compute	5	525	July 14, 2023

Bank Conflicts on Ampere

Related topics