Warp diverge issue?

I am trying to work with a pretty entry level card GeForce 9400 GT. Everything works perfectly except the following performance issue.


cudakernel1 (GOOD Performane, as if “do something complicated” doesn’t exist)
for(int i=0; i<no; i++){
if(i<4){
do something complicated
}
do anotherthing complicated
}

cudakernel2 (BAD performance as if “do something complicated” is always there)
if(threadId.x<32){
do something complicated //for the first warp only
}
do anotherthing complicated

The best practices told me that using something like if(threadId.x<32){…} should give negligible performance penalty since within any warp, the same branch is followed.
However, this is not what I get in experiment.
I noticed that my card only support CUDA compute capability 1.1. Is this the reason, Or what am I wrong?
Your help is greatly appreciated. THANKS!

On pre-Fermi devices new blocks are only started once all warps of all previously started blocks have finished. So if you have one warp that takes significantly longer, the GPU will mostly idle until that one warp has finished.

If you have multiple blocks running per multiprocessor, running your own block scheduler can help as it allows a new block to start as soon as one block finishes.
Within the block, if you can’t keep the work more balanced, it might help to reduce the blocksize (so that fewer warps sit idle and decrease effective occupancy).

First, thanks for sharing the block scheduler knowledge (something I have never seen before).

Yes. I cannot balance the work. I have around 64 warps (organized in around 16 blocks) to finish. Only one of the warps needs extra work.

(I tried to use one special kernel to handle the extra work. It performs worse!)

Anyway, I guess I am still confused why only one warp doing extra work slows down the whole thing.

Do you have anything more official (such as cuda best practices) or details to suggest for reading?

If you really cannot parallelize the extra work, the only thing I can offer for reading is Amdahl’s law. ;)

Regarding block scheduler details, it’s completely undocumented and the only source I know is a thread in the forum.