I am trying to work with a pretty entry level card GeForce 9400 GT. Everything works perfectly except the following performance issue.
cudakernel1 (GOOD Performane, as if “do something complicated” doesn’t exist)
for(int i=0; i<no; i++){
if(i<4){
do something complicated
}
do anotherthing complicated
}
cudakernel2 (BAD performance as if “do something complicated” is always there)
if(threadId.x<32){
do something complicated //for the first warp only
}
do anotherthing complicated
The best practices told me that using something like if(threadId.x<32){…} should give negligible performance penalty since within any warp, the same branch is followed.
However, this is not what I get in experiment.
I noticed that my card only support CUDA compute capability 1.1. Is this the reason, Or what am I wrong?
Your help is greatly appreciated. THANKS!
On pre-Fermi devices new blocks are only started once all warps of all previously started blocks have finished. So if you have one warp that takes significantly longer, the GPU will mostly idle until that one warp has finished.
If you have multiple blocks running per multiprocessor, running your own block scheduler can help as it allows a new block to start as soon as one block finishes.
Within the block, if you can’t keep the work more balanced, it might help to reduce the blocksize (so that fewer warps sit idle and decrease effective occupancy).