Problem on writing to global memory with warp serialize? writing to global memory cost much time

Hello, everybody! I find in my CUDA code that writing to global memory cost more than a half of timing. So, I run the CUDA Visual Profiler. I find more divergent branches and warp serialize. And global mem write throughput is only 0.415838 GB/s. I post the code sample, this part cause the divergent branches and warp serialize.

udword tx = threadIdx.x;
udword ty = threadIdx.y;
udword bw = blockDim.x;
udword bh = blockDim.y;
udword tid = __umul24(ty, bw) + tx;
udword bid = __umul24(__umul24(blockIdx.x, bw), bh);
udword th_id = bid + tid;
if (th_id < totalThreadNum)
{
unsigned int offset = 1;
while(i<asig_num)
{
float t1 = …;
float e1 = …;

if(fabs(t1)<=e1)
{
float t2 = …;
float e2 = …;
if(fabs(t2)<=e2)
{
results[th_id+offset*totalThreadNum] = *index; // **
offset++;
}
}
index++;
i++;
}
}
If I comment the line marked **, the warp serialize will disappear. I think branches cause the threads in warp going different paths and produce writing to global mem serialize. Is there anyone has any idea to solve this problem? Thanks a lot!

Be careful with ** removal. If this is the only output you have compiler may detect your whole (or major part) of your algorithm redundant and simply cut it out. As a result something else that might cause warp serialisation is removed as well.
Assuming however that this line is a cause of the problem, may I ask how pointer index is computed?

Another thing. As long as all threads in a half-warp have same offset, the store instruction will be coalesced. However, the more different values of offset among threads of the same half-warp, the more costly that operation will be. In worst case scenario, if all 16 threads of the half-warp will have different offset value, that will be completly uncoalesced, resulting in 16 memory transactions.

To overcome this problem you may consider the following:

  • store always a value, either a valid one or some special “empty”, like -1 or FLT_MAX or something… (not sure what is the type of results)
  • once you have the results computed, launch another kernel and remove incorrect values from your array using compaction algorithms. There are fast compaction algorithms available. If you are interested check this thread http://forums.nvidia.com/index.php?showtopic=152868

The above idea will work if the result array is not too sparse. If it is, that array may consume a lot of memory and storing useless values introduce to much overhead. But I think that if at least 25% of the data would be meaningful it would be worth trying.

Thanks a lot! Results array is unsigned int type used to store the index of objects. The pointer index: unsigned int * index = & objectID[0]; objectID is an array stored each object’s index.

Your suggestion is constructive. But, I need to compute a flag to indicate whether this element is needed to store or remove. So, if launch another kernel, how to store the flag, it also need to write to global memory and need use branches if and else. I have no idea. Can you give me some advice?

I was thinking about something like this:

while(i<asig_num) {

  float t1 = ...;

  float e1 = ...;

unsigned int store=0xffffffff;

  if(fabs(t1)<=e1) {

	float t2 = ...;

	float e2 = ...;

	if(fabs(t2)<=e2)

	  store=*index;

  }

  results[th_id+offset*totalThreadNum]=store;

  offset++;

  index++;

  i++;

}

Obviously it does not help you get rid of branches and fetching *index may still be costly if it is different for different threads. But storing values to results will be perfectly coalesced because offset will be the same for all threads.

This assumes that *index is never equal to 0xffffffff so that this case can be immediately identified and removed by subsequent compaction algorithm.

Thanks very much! I tried your suggestion, I find although the the global mem write throughput be increased 30 times, but the warp serilalize is not reduced. It still cost much time. I’m puzzled about how the warp serialize came from.