TL,DR: I need to know whether CUDA9 will do so much thread re-arrangement that things I am trying to do to avoid shared memory bank conflicts between threads in the same warp will be meaningless.
So, I’ve got a pile of C++ code that neatly arranges a bunch of calculations for a particle simulation code, essentially compiling a list of things for the GPU to do. For example, there are “dihedral angle” terms that apply to groups of four particles–the identities of the particles don’t change over the course of the simulation, so the term can be applied by simply getting the coordinates of each particle and then applying stiffness and phase angle parameters which are also known from the outset of the simulation. My C++ code arranges things so that all terms involving some or all of the same atoms can be grouped into “work units.” The kernel that executes a work unit will read coordinates for the atoms it needs, then crunch through a list of tasks, each of which is a set of 32 dihedral angles or other terms applying to the atoms that were just imported. Accumulation of forces on each particle occurs in shared via atomicAdd() ops. (Obviously, some tasks are not completely filled, but most work units end up importing 128 or fewer atoms to do exactly 256 dihedral angles, then write back forces on those atoms–better than importing 256*4 = 1024 atoms, then writing back 1024 force contributions individually to global memory.)
With that background, I am trying to improve the performance further, this time by arranging each task to avoid memory bank conflicts as much as possible. Looking at the work units I’ve got, each task in the example (a warp computed 32 dihedral angles) suffers 80 to 100 conflicts as it reads/writes coords/forces for particles A, B, C, D of the 32 dihedrals handled by its 32 threads. The best situation would be zero conflicts (every thread accesses a different particle stored on a different bank as it gets coords or writes forces for particles A, B, C, and D), and the worst I could possibly do is about 120 conflicts (all 32 threads looking at one of two choices for particle A, two choices for particle B, two choices for C, and two choices for D–30 conflicts in accessing each particle, 120 overall). The corner case of ALL threads computing dihedrals for the same four particles would actually be efficient thanks to there being intrinsics for broadcasting one piece of shared data to an entire warp provided that the entire warp needs it. But, I’m still seeing numbers of conflicts closer to “worst situation possible” than “smooth sailing.”
I think I can get rid of a lot of them by re-arranging the dihedrals (or other terms) computed by each task. If each work unit has eight tasks that apply to 128 atoms overall, they can surely swap terms between them so that, when each task gets executed on the GPU, it’s drawing on as many separate memory banks as possible. (Again, the tasks are all set up on the CPU at the outset of the simulation–the GPU is just following a script that says “import atoms 132, 763, 209, 1093, … ==> apply dihedral between imported atoms 0 5 6 7, dihedral between 29, 45, 1, 105, dihedral between …”)
My QUESTION is, then, is this optimization futile in the face of CUDA9? Will CUDA9 just randomly regroup threads so that anything I try to set up to avoid shared bank conflicts among threads in the same warp will break down?