Can a developer expect that warps are constructed of consecutive threads in groups whose base thread index is some multiple of 32?
I am trying to implement a “Multi-population Differential Evolution” algorithm on the 8800 GTX gpu. This DE algorithm requires that within each thread block, the algorithm code be able to access shared memory (containing a subpopulation of test value sets - all float), in some random fashion (selection without replacement). But the challenge is to do so without bank-conflicts.
From the Cuda programming guide (sec 22.214.171.124), I can see from the commonly used linear addressing scheme, that no pair of threads in a block, accessing the same bank, are closer together in their thread indexes than 16.
In fact, all “same” bank accesses occur periodically every 16 threads so that any consecutive group of 16 threads (regardless of starting thread index) do not access the same shared memory bank at the same time. I could use this same periodicity, and achieve conflict free shared memory access, by repeating the same random permutation of thread addressing every 16 threads (each like figure 5.1 of cuda guide). But it would be better if I could assume that the base thread index of any warp is always some multiple of 32. I would then not have to use the same permutation repeatedly. I would know what threads are grouped with each other and which will never appear together in the same warp. I could access the shared memory in a more truly random fashion without bank conflicts.
If anyone would like to suggest an alternate scheme for “bank conflict free” random selection of shared memory value sets, that would also be appreciated.