Some Performance Consideration Questions warp divergence, coalescing and shared mem then and now

Hello. I had a certain computation model for CUDA in my mind, mainly built on books (CUDA By Example, Programming Massively Parallel Processors: A Hands-on Approach) and the CUDA C Programming Guides, plus some things from the Best Practices. With the evolution of the hardware, the model changed and I thought I was keeping up with the changes, but every now and then I come across a paper or whitepaper challenging this model.

For example, in Mark Harris’ (famous) “S05: High Performance Computing with CUDA, Optimizing CUDA” , although quite old now, says something about warp divergence:

__global__ void reduce1(int *g_idata, int *g_odata)


   extern __shared__ int sdata[];

// each thread loads one element from global to shared mem

   unsigned int tid = threadIdx.x;

   unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

   sdata[tid] = g_idata[i];


for (unsigned int s=1; s < blockDim.x; s *= 2)


      if (tid % (2*s) == 0) 


         sdata[tid] += sdata[tid + s];




// write result for this block to global mem

   if (tid == 0) g_odata[blockIdx.x] = sdata[0];


In the above, the warp divergence is said to be happening inside the for loop at the if conditional, where not all threads of the warp will evaluate to the same boolean value. What I don’t get is, why would this result in a warp divergence? If a thread evaluates the expression to false, then there is no else branch to execute and will simply have to wait for the threads in the same warp to finish executing (if they made it into the if). The problem here isn’t warp divergence but the simple fact that we are not taking advantage of all/most threads in the warp that we could possibly take. And that’s Harris’ next improvement, that is to assign some job to as many threads of a warp as possible. So my question is, why was this labeled warp divergence? In my mind warp divergence happens when threads in a warp have to execute difference branches, but not when there is only a single branch to do, is it possible that what is meant here by warp divergence is that the body of the if has to be executed separately by every single thread that gets a true evaluation? Has this changed through the compute capability upgrades?

This is important because I often have if statements without else branches, sometimes with substantial amount of work in the if clause, but where most threads of a warp would evaluate the boolean expression of the if to same value. According to the model I had in mind this is not considered as a warp divergence, in any architecture to date.

I believe with the 2.x capability devices, something called predication was introduced, which I think executes both branches of a conditional and somehow drops the results of the branch of the threads which were not supposed to execute a certain branch according to the evaluated boolean, but this happens only for relatively small bodies of the conditional. This of course avoids warp divergence, but isn’t this almost the same as executing the true branch then executing the false branch? I mean the total number of instructions executed is the same and both would take as much time.

I am certainly missing some crucial detail which I couldn’t find in the programming guide, or just missed it. I hope I was clear enough with the problem I am facing so that someone finds the exact explanation.

Now for the second part (much shorter) of the topic, coalesced memory accesses. The simple question is, do we still need to worry about coalesced accesses? Of course I mean when using a compute capability 2.x device, which I would also like to know your opinion: Can we consider the 2.x devices as the norm when developing CUDA apps today? I don’t want to expand on this much right now, perhaps another topic.

With early 1.x machines non-coalesced accesses meant almost linearazing the global memory read, but with newer devices, especially the 2.x cards, this isn’t the case, but is it still worth taking it into account? Keep in mind a few things: many applications are based on (some type of) stencil computation for which GPU data parallelism seems ideal, however stencil computation means requiring neighborhood information which unavoidably leads to unaligned memory reads (am I wrong in saying this? is it actually possible to avoid this? My experience tells not, but I could be wrong, and very interested if so), next dealing with misalignment involves using precious and rare shared memory. 2.x compute capable devices take care of misaligned accesses with L2 and L1 caches which are not programmable. Reformulating the question: Is it worth sacrificing some shared memory to ensure coalesced memory accesses, regardless of the perhaps increased kernel complexity and hence poorer maintenance and performance, while it’s being handled automatically anyway by the caching system of the GPU? The answer would be obviously no unless there is a large user-base who are still using old generation cards.

I want to generalize the question: With 2.x cards and perhaps the future generation cards, would it still make sense to use the shared memory to bring data from global? Can we move on to find new and better uses for the shared memory and trust the caches to keep our data on chip, ready to fast automatic access?

I have experimented many times on a GTX 460 (sm_21) and usually saw no improvement when shared memory when used to bring in global memory, L1 and L2 had things sorted out under the rug. I ask for your opinion/experience in the matter.

Third and last part here (for the moment): shared memory. More specifically bank conflicts. I know some there are some topics out there discussing some of the issues here but I want to group things.

Are there coalescing issues here like there are with global memory reads? Can having thread 0 read an element from the beginning of the shared array and thread 1 reading an element from the end of the shared array cause problems other than bank conflicts?

An interesting question occurred to me, though it is implausible. I hope this doesnt sound stupid: Can the GPU operate on shared memory data directly? Without going through registers?

If a single thread does 2 reads from the same bank is it any different from having a single thread read from 2 different banks? For example, if in the below code shared_int is a shared array of ints, and tid is just threadIdx.x, how would the GPU do this computation?

shared_int[tid] = shared_int[tid] + shared_int[tid+32];

The data shared_int[tid] and shared_int[tid+32] are held in the same bank. Is shared_int[tid] loaded, then shared_int[tid+32] loaded (or vice versa), or is there an advantage of having a thread read from the same bank? I believe this has a lot to do with the actual instruction set which I don’t know much about.

Shared memory was last but I just recalled another issue which effects kernel design. When a warp writes back to global as its the last action, is the warp considered to have completed its work or does it have to wait for the write to actually happen (long latency)? What if the write to global is in the middle of the kernel but there are no dependencies on the written data in the running kernel? It doesn’t seem fair that threads have to wait for the write, but I don’t know what is happening on the hardware.

If you read this after having read the entire post, well done and thank you! Eagerly waiting for your reply. :))

Remember that the architecture is SIMT. Each warp executes the same instruction at one point in time. If parts of a warp have different instructions, then we encounter divergence. The stalled threads have to wait until their brother-threads are finished executing the if() statement. Even though there isn’t an else statement, I’d imagine that there would be less divergence since there is only one control path of execution. Think about it this way, you still lose a great number of threads per warp.

DMA handles this. Warp is context switch’d out when writing to global memory since there is no ‘compute’ portion left.

In case 1, the two reads are serialized. In case 2, memory is read in parallel

Application-dependent. Shared memory will always be magnitudes faster than global memory will be. You just have to make sure that you have a high compute to memory load ratio. If this is low, then… you get low performance.

Only worry about bank conflicts in shared memory. Shared memory is on-chip… thus, coalescing is not an issue. Register file will always be faster than shared memory.

Please look at the assembly code your code generates. You’ll answer more questions that way with regards to registers and at least gain an intuitive knowledge of how CUDA does things.

I suggest going back to Chapter 5/6 on Programming Massively Parallel Processors. Your questions are essentially the topics discussed there.