128-bit access bank conflict

unital · March 22, 2024, 12:19pm

Hi, I am loading 32-bit float data from global to shared like this

The shared memory is defined as a 16 x 8 row-major matrix. Each thread in a warp loads 4 floats from global and store to shared. Here shared memory banks repeat every 4 rows ie T0-T7 are accessing different banks.

Basically I am getting alot of bank conflicts from this store pattern

According to page 66 of these slides https://on-demand.gputechconf.com/gtc/2018/presentation/s81006-volta-architecture-and-performance-optimization.pdf , this store pattern should be bank-conflict free, right?

But according to the documentation mentioned in here float4 Shared memory doesn't yield bank conflict according to nvprof when it should - #4 by rs277 it states

**128-Bit Accesses:** The majority of 128-bit accesses will cause 2-way bank conflicts, even if no two threads in a quarter-warp access different addresses belonging to the same bank. Therefore, to determine the ways of bank conflicts, one must add 1 to the maximum number of threads in a quarter-warp that access different addresses belonging to the same bank.

I am abit struggling to understand what this means? Does this mean in this case, T0-T7 is still causing a 2-way bank conflict? How do I resolve the bank conflicts in this case? Thanks!

I am using a V100 GPU.

Robert_Crovella · March 22, 2024, 6:27pm

The thing you have excerpted only applies to cc2.0 devices, which you can discover by carefully reading the thread you linked. The idea that 128-bit per thread accesses automatically give rise to bank conflicts is not applicable anymore, AFAIK.

A V100 is a cc7.0 device, not a cc2.0 device.

The diagram you have provided does not help me to deduce your access pattern. As near as I can tell, the picture on the left is identical to the picture on the right, other than the “global” and “shared” titles.

If T0-T7 are accessing different banks, for 128 bit load per thread, you should not witness a bank conflict. A simple test case should not be difficult to construct and provide.

Deducing bank conflcts from the profiler can be a non-trivial matter. There are various forum posts covering this idea, a google search will turn some of them up for you.

rs277 · March 22, 2024, 6:29pm

Edited: Post deleted due to incorrect understanding.

Robert_Crovella · March 22, 2024, 6:33pm

On a load involving 128 bits per thread, the load request will be split up by the GPU into 4 phases, what used to be called transactions. Those 4 phases are executed independently from the standpoint of shared memory as it pertains to bank conflicts, and bank conflicts are only relevant/applicable per transaction or per phase.

rs277 · March 22, 2024, 6:36pm

Thanks, my understanding was that these phases were conflicts. I’ll amend my post.

So if the load is done in 4 phases, the performance impact will presumably be the same as for conflicts?

Robert_Crovella · March 22, 2024, 6:57pm

The load is done in 4 phases because shared memory has a bandwidth of 32 bits per bank per cycle (ignoring certain kepler variants/situations).

With a 128-bit load per thread, considered warp-wide, that is a total of 512 bytes being requested. Since shared memory can serve up only 128 bytes per cycle maximum (deducible from the previous statement), then the load request has to be split into 4 transactions. Viewed from a bandwidth perspective, that load is running at the full bandwidth of shared memory, whether you look at an individual transaction, or the request as a whole (i.e. all 4 transactions). Therefore I would not refer to it as bank-conflicted. There are no bank conflicts, and the performance is not the same as if it were bank-conflicted.

unital · March 25, 2024, 4:58am

Thanks alot @Robert_Crovella @rs277 for the clarification.

unital · March 25, 2024, 5:17am

Can I ask one more thing? It looks like this kernel has a bottleneck at shared store with 95% long scoreboard. Looking at the profiling guide it states

Warp was stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. Find the instruction producing the data being waited upon to identify the culprit. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality (coalescing), or by changing the cache configuration. Consider moving frequently used data to shared memory.

This is a tiled matrix multiplication kernel and I think at this line it is loading a single tile from global to shared. I am assuming at this point data has loaded from global memory into register and somehow its taking alot of time for data in register to be stored into shared memory? I just dont understand why this is the case? Why are the warps stalling at shared store but not at global load?

Can I ask how one would go optimize for this? Is there any place I can read more about this? Thanks!

Robert_Crovella · March 25, 2024, 1:33pm

That isn’t the right way to think about it. A clue for proper analysis is given in your excerpted text:

In C++ source code, a shared store from global load might look something like:

smem[threadIdx.x] = gdata[idx];

when we look at the corresponding SASS, there could be an instruction sequence like:

LDG R0, [R4.64]
STS [R5], R0

So R4 contains the 64-bit global address, R5 contains the shared address, and the R0 register is what the global data gets loaded into, before it will be stored in shared memory.

A global load such as the first SASS instruction above will never stall, by itself. Assuming the address in R4 has been calculated, the instruction can and will be issued. The fact that the data does not immediately appear in R0 is not relevant at this point.

However it is relevant for the next instruction. Global loads do not occur in 1 clock, so the next instruction cannot be immediately issued, because it has a dependency on the contents of R0. This dependency, because it involves a global load, is tracked via a mechanism in the GPU referred to as “long scoreboard”.

So the instruction that stalled in your case is the STS instruction. But the reason for the stall is the previous instruction, although issued already, has not completed its work yet.

Things to think about:

a global load is generally going to produce such dependencies and stalls. A certain amount of that is unavoidable.
if the profiler is indicating that this is a large issue (i.e. the source page shows that the line of code in question has a preponderance of all sampled stalls) then you might want to consider if it can be improved
improving the situation generally has all the suggestions for improvement of global loading. Make sure you are using global memory efficiently, see if there is other work you can do while the data is being loaded (e.g. double-buffered loading) and try to expose additional parallelism so the GPU has other work to do (even if it is launching more global loads) while it is waiting for those global loads to complete.

unital · March 25, 2024, 9:41pm

Thanks so much @Robert_Crovella for the explanation. It really cleared up my confusion.

1793030808 · March 29, 2024, 7:36am

I have a question about this. Does the “phases” mean “wavefronts”?

Robert_Crovella · March 29, 2024, 2:53pm

In the older profiler terminology it used to be called a transaction. In the new profiler terminology, it is not called a transaction, I believe the correct term is wavefront.

Topic		Replies	Views
Problem with bank conflict. Something wrong with my experiment?Confused! CUDA Programming and Performance	4	1269	February 26, 2009
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2677	March 31, 2010
Bank Conflict when each thread accesses 2 elements CUDA Programming and Performance	8	5614	July 9, 2010
shared memory bank conflicts cc 2.0 CUDA Programming and Performance	3	912	December 29, 2011
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	11	3515	August 20, 2009
Bank Conflicts CUDA Programming and Performance	2	1988	December 6, 2009
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6719	February 8, 2009
handle bank conflicts on shared memory of Fermi devices? How does the hardware work CUDA Programming and Performance	5	6939	November 15, 2010
shared memory bank conflicts when reading? CUDA Programming and Performance	5	2576	August 3, 2007
Coalescing global memory and avoiding shared bank conflicts Do I need to use this complex of indexin CUDA Programming and Performance	3	3220	March 30, 2009

128-bit access bank conflict

Related topics