Shared memory bank conflicts and nsight metric

Hi,
Using nsight’s l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum metric I can see a significant number of shared memory bank conflicts. Is it possible to locate which line of code these bank conflicts occur? Also, on Compute Capability > 7 devices does this metric only report the conflicts that occur in the shared memory portion of L1 or bank conflicts in L1 as whole? Thanks

-Muaaz

In Nsight Compute you first want to determine if the bank conflicts are a performance limiter. This can be observed in two different ways:

  1. In the GPU Speed of Light section determine if L1/TEX Cache [%] (l1tex__throughput.avg.pct_of_peak_sustained_active) is one of the highest values. If it is then look at the SOL Memory Breakdown for SOL L1: *. Bank conflicts require additional data bank reads. If SOL L1: Data Bank Reads [%] (l1tex__data_banks_reads.avg.pct_of_peak_sustained_elapsed) is high or SOL L1 : Data Bank Writes [%} (l1tex__data_bank_writes.avg.pct_of_peak_sustained_elapsed) then reducing bank conflict could help.

  2. The other method is to see how inefficient shared memory accesses are stalling warps by looking at the Warp State Statistics section. Shared memory accesses can have two impacts on warp state. The additional cycles to process shared bank conflicts can cause warps to be stalled on issuing instructions to the Load Store Unit in MIO (Memory Input/Output) partition. In this case the warp will report Stall MIO Throttle (smsp__average_warps_issue_stalled_mio_throttle_per_issue_active.ratio). The additional cycles to process shared bank conflicts increases the access latency. In this case the warp will report Stall Short Scoreboard (smsp__average_warps_issue_stalled_short_scoreboard_per_issue_active.ratio) on the instruction waiting for the shared memory data. If either of these reasons are high then it may be worth fixing the shared memory accesses.

The Source Page can help identify bank conflicts. Open the Source Page (top left drop down). in the source page change the navigation column from Instructions Executed to Memory L1 Transactions Shared. This value includes bank conflicts. You can then navigate through the highest values using the buttons to the right of the selection. What you are looking for is the rows with the highest Memory L1 Transactions Shared and the largest difference between Memory L1 Transactions Shared and Memory Ideal L1 Transactions Shared.

1 Like

Sorry to reopen an old thread but it seems most relevant to my question.

I wrote a kernel that should be bank conflict free, but nsight compute is telling me there are bank conflicts under the Shared Memory report. And when I manually add in l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum as a metric to be collected, it shows up almost matching the bank conflict count, the discrepancy I assume due to sampling.

However, at the source counters, Memory L1 Transactions Shared and Memory Ideal L1 Transactions Shared match on every line.

So a few questions:

  1. Why would that be?
  2. How can I get get source-line annotation of that counter? I tried adding my own custom section but am unable to make it work, so if someone could provide a section file with that counter that would be great.
  3. Even when there aren’t conflicts I sometimes see mismatch between Instructions and Wavefronts. Is that a sampling issue or does it indicate something else?

Thank you!

  • Harris
  1. Why would that be?

Can you provide a minimal reproducible? The l1tex__data_bank_conflicts_pipe_lsu_mem_shared* counters should all be able to be collected in one pass so I don’t think this is a work distribution or sampling issue.

  1. How can I get get source-line annotation of that counter? I tried adding my own custom section but am unable to make it work, so if someone could provide a section file with that counter that would be great.

l1tex__data_bank_conflict* are hardware metrics and cannot be collected per instruction. These metrics are collected per L1TEX instance.

The transactions shared and ideal transactions shared are collected by patching each user instruction. The profiler does not patch some syscalls.

  1. Even when there aren’t conflicts I sometimes see mismatch between Instructions and Wavefronts. Is that a sampling issue or does it indicate something else?

Most of these counters are collected in a single pass so the counter values should be very deterministic. The easiest way to debug these issues is if you can post a reproducible.

From first post

on Compute Capability > 7 devices does this metric only report the conflicts that occur in the shared memory portion of L1 or bank conflicts in L1 as whole?

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum only counts shared memory. There are specific shared memory atomic instructions that will result in non-deterministic number of store wavefronts.

Thank you Greg. Could you expand a little on a few points? I will try to make a minimal reproducer and I know that will help debug a lot but it will take a little bit, and it’d be nice to understand things more deeply.

Why wouldn’t the ideal vs actual calls show up if the conflict counter is firing? You mention not patching syscalls but I’m not sure what that means in this context. If you mean calls within the kernel to e.g. printf or a cub library, I don’t have any of that. The kernel is all my own code within one file, with the only external bits being some header intrinsics like __ldg or __half2float, nothing that touches shared memory.

The other oddity is that the target # of blocks in launch bounds affects things. For a specific # of threads, if I target 1 block I get no conflicts, whereas if I target 2 I get conflicts. I don’t see how that’s possible under the programming model unless the compiler is doing something really weird. The target # of blocks is only used in the launch bounds.

One observation: my code requires that the amount of shared memory is proportional to the number of threads, and I do only see conflicts when the combination of target blocks and # threads requires the 64k shared memory configuration (this is on TU102). But there aren’t many configurations I can make that only need 32k so that might be a red herring.

It actually wasn’t too bad to make a reproducer. It started from some very complicated code but the result is pretty simple. It does seem to require getting the shared memory configuration into 64k mode. At least, dropping buffer to 2048 (which would fit in 32k w/ 4 blocks) makes the problem go away.

Also the odd_warp if statement seems required, for some reason.

Anyway, I’m thinking either I’m doing something really basic wrong, or something is funny with that counter, and also with whatever counter(s) for Shared Memory Bank Conflicts in the Memory Workload Analysis section if it isn’t from that family.

This was with nvcc from cuda-10.1 and the profiler from there as well, using drivers 440.82 drivers on a Quarto RTX 6000.

reproducer.cpp (966 Bytes)

Actually as Instructions and Wavefronts are independent counters and they differ by roughly # of conflicts, I guess this is probably some strange but real phenomenon being accurately captured by counters. But I don’t see why this would happen. Even if something weird happened when crossing the 32k to 64k boundary, each warp’s access is on one side of the boundary. Unless there was some TLB-like thing for the two possible partitions of shared memory and it’s being thrashed. When I make buffer 4096, so each block’s whole allocation is on one side or the other of the boundary, not just the access, it doesn’t help any. Seeing conflicts is very sensitive to # of threads, size of buffer, etc., but I didn’t pull out any additional pattern.

Thank you for the reproducible. I will work with the Nsight Compute and Perfworks team to isolate the issue. I have made several of the same tests as you and executed on them on Volta and Turing chips and I can confirm that on Turing the hardware counter report additional unexpected bank conflicts.

Thank you Greg! Looking forward to whatever you uncover.

Was there ever an answer to this question? I’m seeing a similar issue in nsight compute where I get bank conflicts reported only when the shared memory allocated is large (20k per thread block).

I have not heard back yet.

The development team is investigating the issue and can reproduce the issue.

On the Details page:

  • SOL L1 (l1tex__throughput.avg.pct_of_peak_sustained_active) is correct.
  • SOL L1: Data Bank Reads/Write [%] (l1tex__data_banks_{reads,writes}.avg.pct_of_peak_sustained_elapsed) is correct.
  • Shared Memory Bank Conflicts (l1tex__data_bank_conflicts_{reads,writes}.avg.pct_of_peak_sustained_elapsed) hardware performance counter is showing a value higher than expected. It also counts certain types of stalled cycles.

On the Source page:

  • the Memory L1 Transactions Shared and Memory Ideal L1 Transactions Shared are correct.

If SOL L1: Data Bank Reads/Writes is high then please go to the Source page and determine if there are Memory L1 Transactions Shared > Ideal Memory L1 Transactions Shared.

Thank you Greg!
Maybe a typo on the third Shared Memory Bank Conflicts counter name, as you list the same counter there as in the prior point on Data Bank Reads/Writes?

I’ll focus on the Transactions vs Ideal Transactions, as you suggest.

@Greg So currently is there any way to count the bank conflict exactly? It seems that for both my RTX-3090 or Telsa V100 GPU, the problem still exists. Using l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum results in an incorrect number. The compute capabilities of these GPUs are 8.6 and 7.0 respectively. The CUDA version is 11.1.

There is not currently a hardware counter that only counts bank conflicts. Other arbitration conflicts that result in a replayed wavefront are included. Summing L1 Wavefronts Shared Excessive on the Source View page is the best method to only count bank conflicts. The L1 Wavefronts Shared Ideal accounts for additional wavefronts for wider data types (e.g. 64b or 128b).

Sorry to reopen the old issue but I’m encountering a similar case with this l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st metric. It contradicts to the L1 excessive wavefronts metric reported by nsight compute.

@Greg from your earlier comment

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum only counts shared memory. There are specific shared memory atomic instructions that will result in non-deterministic number of store wavefronts.

In my case I’m not using any atomic instruction or shuffle instruction, the only operation on shared memory is store from global read. I’m confused as to why there’s a discrepancy.