CUDA9 and memory bank conflicts

TL,DR: I need to know whether CUDA9 will do so much thread re-arrangement that things I am trying to do to avoid shared memory bank conflicts between threads in the same warp will be meaningless.

Read on…

So, I’ve got a pile of C++ code that neatly arranges a bunch of calculations for a particle simulation code, essentially compiling a list of things for the GPU to do. For example, there are “dihedral angle” terms that apply to groups of four particles–the identities of the particles don’t change over the course of the simulation, so the term can be applied by simply getting the coordinates of each particle and then applying stiffness and phase angle parameters which are also known from the outset of the simulation. My C++ code arranges things so that all terms involving some or all of the same atoms can be grouped into “work units.” The kernel that executes a work unit will read coordinates for the atoms it needs, then crunch through a list of tasks, each of which is a set of 32 dihedral angles or other terms applying to the atoms that were just imported. Accumulation of forces on each particle occurs in shared via atomicAdd() ops. (Obviously, some tasks are not completely filled, but most work units end up importing 128 or fewer atoms to do exactly 256 dihedral angles, then write back forces on those atoms–better than importing 256*4 = 1024 atoms, then writing back 1024 force contributions individually to global memory.)

With that background, I am trying to improve the performance further, this time by arranging each task to avoid memory bank conflicts as much as possible. Looking at the work units I’ve got, each task in the example (a warp computed 32 dihedral angles) suffers 80 to 100 conflicts as it reads/writes coords/forces for particles A, B, C, D of the 32 dihedrals handled by its 32 threads. The best situation would be zero conflicts (every thread accesses a different particle stored on a different bank as it gets coords or writes forces for particles A, B, C, and D), and the worst I could possibly do is about 120 conflicts (all 32 threads looking at one of two choices for particle A, two choices for particle B, two choices for C, and two choices for D–30 conflicts in accessing each particle, 120 overall). The corner case of ALL threads computing dihedrals for the same four particles would actually be efficient thanks to there being intrinsics for broadcasting one piece of shared data to an entire warp provided that the entire warp needs it. But, I’m still seeing numbers of conflicts closer to “worst situation possible” than “smooth sailing.”

I think I can get rid of a lot of them by re-arranging the dihedrals (or other terms) computed by each task. If each work unit has eight tasks that apply to 128 atoms overall, they can surely swap terms between them so that, when each task gets executed on the GPU, it’s drawing on as many separate memory banks as possible. (Again, the tasks are all set up on the CPU at the outset of the simulation–the GPU is just following a script that says “import atoms 132, 763, 209, 1093, … ==> apply dihedral between imported atoms 0 5 6 7, dihedral between 29, 45, 1, 105, dihedral between …”)

My QUESTION is, then, is this optimization futile in the face of CUDA9? Will CUDA9 just randomly regroup threads so that anything I try to set up to avoid shared bank conflicts among threads in the same warp will break down?


I think adjustments to the mental model of GPU thread execution may be in order. The CUDA documentation spells it out in sufficient detail, IMHO. There is no “random regrouping” of threads in the same warp with currently shipping hardware (no idea about Volta). All currently active threads in the warp execute in lockstep, and if any of them try to access the same shared memory bank, there is serialization of access with accompanying stall.

The way to deal with shared memory bank conflicts is to make sure the threads within the same warp all access different banks. One canonical strategy is to pad row (or column) length of a 2D data block to a (relatively) prime number of elements, e.g. by using a 17x16 matrix instead of a 16x16 matrix.

Does the CUDA profiler indicate that shared memory conflicts are in fact a bottleneck in your code?

OK that’s very helpful. I have given a lot of thought to padding and I do use it in various situations, but it won’t be as effective in this particular case. As I outlined, each work unit will read atoms from global memory, like so:

Read 5, 7, 89, 209, 210, 211, 215, 217, 218, 219, … , 1009, 1010, 1011, 1023
(I am trying to make that somewhat realistic–there are stretches of consecutive atoms but also breaks as the C++ code ‘compiles’ a list of terms that it can cram into a single work unit with space for 128 atoms.)

Then, the work unit looks at its list of imported atoms: 0 (5 in global memory), 1 (7 in global memory), 2 (that’s 89), etc. It has a list of dihedrals running on constants Kstiff, Phase, and Nperiod, all unique and therefore stored in the order they will be executed in separate global memory arrays (fully coalesced access to those parameters), but the coordinates for each dihedral could map to IMPORTED atoms like this:

[ 0 1 5 6 ]
[ 0 1 7 8 ]
[ 0 1 7 8 ]
[ 0 1 5 9 ]
[ 0 1 5 9 ]
[ 0 1 6 10 ]

[ 60 61 62 65 ]
[ 60 61 63 66 ]

There can be multiple terms that apply to exactly the same four atoms (with different constants), terms that apply to the same first two atoms but different third and fourth atoms, etc. So what I’m considering doing is making sure that warps don’t get saddled with terms that always request the same atoms (because if two threads request the same atom, it’s a guaranteed bank conflict). After that I can change the order in which atoms are imported to minimize the conflicts when two distinct atoms are nonetheless stored by the same bank.

As for memory padding, there is another part of the code where I re-arrange coordinates and properties from four separate arrays (all X coord for each atom, all Y coord, all Z coord, all of one property for each atom) to a unified array ordered as [X Y Z property X Y Z property … ]. That makes use of shared memory padding to transition between each format and the bank utilization is perfect.

Not having much success with the CUDA profiling toolkit, but I’ll give it another shot–the compilation of this code, which predates me, hasn’t made it very easy to use some of the tools. Fortunately in this little project the bulk of the coding happens in C++ which I can debug with gdb and valgrind.

I would strongly encourage you to get CUDA profiling set up. When optimizing by making assumptions about code behavior, i.e. flying blind, it is easy to optimize the wrong things. You don’t necessarily need the visual profiler, the non-visual command-line profiler nvprof should be sufficient to tell you whether there is a significant impact from shared-memory bank conflicts. My best guess is: there is not.

I can’t claim to understand how the code really work based on the description above, but I think your general strategy is sound: As you pull data items from global memory you re-arrange them in shared memory such as too minimize the number of bank conflicts in subsequent processing. If the number of potential conflicts is limited, and provided you have enough shared memory left (a big ‘if’, obviously, as that space it often at a premium), you could even think about duplicating the shared memory storage so as to avoid conflicts further (trading off time for space). And the two shared memory copies could even use different arrangements of the same data, if necessary.

Yep, thought about duplicating–it definitely helped in some early CUDA programming I did in the Spring. Space is at a premium and the amount of shared memory I allocate for this kernel is just enough to let me fit 8*128 threads inside the 48k SM limit.

I will make a new effort to get myself situated with the profiler, rather than “flying blind.”

Well, I’ve got the code taking the number of bank conflicts that each warp experiences from 80-100 down to 10 or 15. Furthermore I’ve optimized to eliminate many bank conflicts in one particular thread. With the lockstep performance, do I presume the following correctly?

Case A: eight triplets of threads in the warp experience bank conflicts such that they each argue with the other members of their triplet when accessing the memory bank they need (eight memory banks each have three threads pulling on them). Eight 3-way bank conflicts.

Case B: only one triplet of threads in the warp causes a bank conflict like that in Case A (only one memory bank is getting triple-teamed). One 3-way bank conflict.

Case C: ten threads in the same warp all pull on the same memory bank to get the information they need at the same time. One 10-way bank conflict.

I presume that Case A and Case B are equally bad, while Case C is worse than either A or B.

Excellent progress. How much of a positive performance impact are you seeing from the bank conflict reductions?

Per cycle, each shared memory bank can only produce one data item. If multiple threads in the warp try to grab the same item, they take turns until they are all served. So to the best of my understanding:

A: 3 stall cycles
B: 3 stall cycles
C: 10 stall cycles

In other words, the multiplicity of the shared memory conflict is the main thing to watch out for.

Not sure of the speedup just yet, have to do a bit more work. But, I’ve got the work units in their first implementation passing 150 simulation test cases, and this is just re-arranging the order of objects they contain, so it won’t be long. It did confirm for me that the multiplicities I had were coming in 10s and 12s, now they’re all single conflicts, maybe a case of three threads converging somewhere in the pile. I’m also working to add more things to the work units, that is specific non-bonded electrostatic interactions, which are individually very cheap but could really get fouled up if we had 5-6 fold conflicts hitting each of them. The general non-bonded interactions of course get done in a giant loop, but there are special cases of them, touch-ups if you will, that need to be done separately between known pairs of atoms.

This is taking the code down to its foundations and rebuilding it :-)