Warps, Random Shared Memory Access, Bank Conflicts

Can a developer expect that warps are constructed of consecutive threads in groups whose base thread index is some multiple of 32?

I am trying to implement a “Multi-population Differential Evolution” algorithm on the 8800 GTX gpu. This DE algorithm requires that within each thread block, the algorithm code be able to access shared memory (containing a subpopulation of test value sets - all float), in some random fashion (selection without replacement). But the challenge is to do so without bank-conflicts.

From the Cuda programming guide (sec 5.1.2.4), I can see from the commonly used linear addressing scheme, that no pair of threads in a block, accessing the same bank, are closer together in their thread indexes than 16.

In fact, all “same” bank accesses occur periodically every 16 threads so that any consecutive group of 16 threads (regardless of starting thread index) do not access the same shared memory bank at the same time. I could use this same periodicity, and achieve conflict free shared memory access, by repeating the same random permutation of thread addressing every 16 threads (each like figure 5.1 of cuda guide). But it would be better if I could assume that the base thread index of any warp is always some multiple of 32. I would then not have to use the same permutation repeatedly. I would know what threads are grouped with each other and which will never appear together in the same warp. I could access the shared memory in a more truly random fashion without bank conflicts.

If anyone would like to suggest an alternate scheme for “bank conflict free” random selection of shared memory value sets, that would also be appreciated.

Section 2.2.1 of the programming guide:
“Each thread is identified by its thread ID, which is the thread number within the block. To help with complex addressing based on the thread ID, an application can also specify a block as a two- or three-dimensional array of arbitrary size and identify each thread using a 2- or 3-component index instead. For a two-dimensional block of size (Dx, Dy), the thread ID of a thread of index (x, y) is (x + y Dx) and for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).”

So you are safe assuming that your warps start on thread Ids with multiples of 32 as long as the block size is a multiple of 32.

My advice to you is to try your algorithm in the simplest way with the bank conflicts first, depending on the structure of your algorithm, it may not effect the overall performance as much as you think. Global memory access patterns are much more important.

Your probably right about first writing the code without regards to bank conflicts. But it may be desirable to avoid these conflicts in a final version. It depends on how important the component of “selection without replacement” is to the algorithm and how efficiently a random number generator could be altered for selecting banks without repetition and a memory location within. Otherwise, it may be better to simply generate the shared memory indice without concern for either bank comflict or repetiion.

It also could be that the price you pay for bank conflicts entirely fits into the global memory “latency hiding” window. You can do a LOT of calculations in between global memory accesses and still be limited by global memory bandwidth. It has been my experience that global memory access patters should be the first thing to optimize.

I probably need to look more at global memory access issues. Its my understanding that the gpu swaps out warps(or blocks) that are waiting for global memory data. I’m thinking that PERHAPS(a big if) each warp that request global memory data doen’t have to wait for other warp global memory requests data returns before its request is issued. Thus its possible that after a latency period, the global data for multiple warps requests could arrive in rapid succession. Currently my kernal code accesses global memory before any calculations are done. All iterations(between DE “migrations”) of the algorithm are using shared memory. So it seems that bank-conflicts could simply add to the compute time. Perhaps I need to restructure my code.

Actually, your code design sounds perfect for a CUDA application. If you can really do one global memory read and then a loooonng computation entirely in shared memory, do it. All the multiprocs will be busy processing (after the initial wait for the global memory read) and will keep the device fully occupied.

In this case, I would agree with your assesment. Bank conflicts will probably increase your running time. I still suggest, however, that you try versions of the code with and without bank conflicts. My experience in CUDA is that there is no substitute for playing around to find the optimal implementation.