128-bit access bank conflict

Hi, I am loading 32-bit float data from global to shared like this

The shared memory is defined as a 16 x 8 row-major matrix. Each thread in a warp loads 4 floats from global and store to shared. Here shared memory banks repeat every 4 rows ie T0-T7 are accessing different banks.

Basically I am getting alot of bank conflicts from this store pattern


According to page 66 of these slides https://on-demand.gputechconf.com/gtc/2018/presentation/s81006-volta-architecture-and-performance-optimization.pdf , this store pattern should be bank-conflict free, right?

But according to the documentation mentioned in here float4 Shared memory doesn't yield bank conflict according to nvprof when it should - #4 by rs277 it states

**128-Bit Accesses:** The majority of 128-bit accesses will cause 2-way bank conflicts, even if no two threads in a quarter-warp access different addresses belonging to the same bank. Therefore, to determine the ways of bank conflicts, one must add 1 to the maximum number of threads in a quarter-warp that access different addresses belonging to the same bank.

I am abit struggling to understand what this means? Does this mean in this case, T0-T7 is still causing a 2-way bank conflict? How do I resolve the bank conflicts in this case? Thanks!

I am using a V100 GPU.

1 Like

The thing you have excerpted only applies to cc2.0 devices, which you can discover by carefully reading the thread you linked. The idea that 128-bit per thread accesses automatically give rise to bank conflicts is not applicable anymore, AFAIK.

A V100 is a cc7.0 device, not a cc2.0 device.

The diagram you have provided does not help me to deduce your access pattern. As near as I can tell, the picture on the left is identical to the picture on the right, other than the “global” and “shared” titles.

If T0-T7 are accessing different banks, for 128 bit load per thread, you should not witness a bank conflict. A simple test case should not be difficult to construct and provide.

Deducing bank conflcts from the profiler can be a non-trivial matter. There are various forum posts covering this idea, a google search will turn some of them up for you.

Edited: Post deleted due to incorrect understanding.

On a load involving 128 bits per thread, the load request will be split up by the GPU into 4 phases, what used to be called transactions. Those 4 phases are executed independently from the standpoint of shared memory as it pertains to bank conflicts, and bank conflicts are only relevant/applicable per transaction or per phase.

Thanks, my understanding was that these phases were conflicts. I’ll amend my post.

So if the load is done in 4 phases, the performance impact will presumably be the same as for conflicts?

The load is done in 4 phases because shared memory has a bandwidth of 32 bits per bank per cycle (ignoring certain kepler variants/situations).

With a 128-bit load per thread, considered warp-wide, that is a total of 512 bytes being requested. Since shared memory can serve up only 128 bytes per cycle maximum (deducible from the previous statement), then the load request has to be split into 4 transactions. Viewed from a bandwidth perspective, that load is running at the full bandwidth of shared memory, whether you look at an individual transaction, or the request as a whole (i.e. all 4 transactions). Therefore I would not refer to it as bank-conflicted. There are no bank conflicts, and the performance is not the same as if it were bank-conflicted.

5 Likes

Thanks alot @Robert_Crovella @rs277 for the clarification.

Can I ask one more thing? It looks like this kernel has a bottleneck at shared store with 95% long scoreboard. Looking at the profiling guide it states

Warp was stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. Find the instruction producing the data being waited upon to identify the culprit. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality (coalescing), or by changing the cache configuration. Consider moving frequently used data to shared memory.

This is a tiled matrix multiplication kernel and I think at this line it is loading a single tile from global to shared. I am assuming at this point data has loaded from global memory into register and somehow its taking alot of time for data in register to be stored into shared memory? I just dont understand why this is the case? Why are the warps stalling at shared store but not at global load?

Can I ask how one would go optimize for this? Is there any place I can read more about this? Thanks!

That isn’t the right way to think about it. A clue for proper analysis is given in your excerpted text:

In C++ source code, a shared store from global load might look something like:

smem[threadIdx.x] = gdata[idx];

when we look at the corresponding SASS, there could be an instruction sequence like:

LDG R0, [R4.64]
STS [R5], R0

So R4 contains the 64-bit global address, R5 contains the shared address, and the R0 register is what the global data gets loaded into, before it will be stored in shared memory.

A global load such as the first SASS instruction above will never stall, by itself. Assuming the address in R4 has been calculated, the instruction can and will be issued. The fact that the data does not immediately appear in R0 is not relevant at this point.

However it is relevant for the next instruction. Global loads do not occur in 1 clock, so the next instruction cannot be immediately issued, because it has a dependency on the contents of R0. This dependency, because it involves a global load, is tracked via a mechanism in the GPU referred to as “long scoreboard”.

So the instruction that stalled in your case is the STS instruction. But the reason for the stall is the previous instruction, although issued already, has not completed its work yet.

Things to think about:

  • a global load is generally going to produce such dependencies and stalls. A certain amount of that is unavoidable.
  • if the profiler is indicating that this is a large issue (i.e. the source page shows that the line of code in question has a preponderance of all sampled stalls) then you might want to consider if it can be improved
  • improving the situation generally has all the suggestions for improvement of global loading. Make sure you are using global memory efficiently, see if there is other work you can do while the data is being loaded (e.g. double-buffered loading) and try to expose additional parallelism so the GPU has other work to do (even if it is launching more global loads) while it is waiting for those global loads to complete.
2 Likes

Thanks so much @Robert_Crovella for the explanation. It really cleared up my confusion.

I have a question about this. Does the “phases” mean “wavefronts”?

In the older profiler terminology it used to be called a transaction. In the new profiler terminology, it is not called a transaction, I believe the correct term is wavefront.