Efficient thread warp size? How small should a wrap get?

Are there any guidelines as to how small a wrap of threads can be and still efficiently utilize the G80 H/W?

At the present I am using 256 threads in a block but for the “last” block, I have to put in condition code like this to prevent data overrun into the kernel.

if (thread_id > max_thread) {


} else {

    // run kernel


I know this is inefficient because it stalls the ALUs. My problem is that I never know ahead of time how big the video frames are in the input video stream (QCIF? CIF? D1? 720P?..). Now the “natural” number of threads per warp can be as low as 8 which I suspect would be rather be inefficient.


Warps that are not a multiple of 16 threads will lead to idle processors.



I am under the impression that the organization of the G80 GPU is in blocks of 8 ALUs. Why is 16 threads the natural multiplier per warp rather than 8?


Hope you don’t mind I jump in here. Assuming you use shared memory to store your data, which is the most efficient memory by the way. Since each access of the data consumes 2 cycles, half of the warp can be executed while the other half accessing the data. By the same token, if you use device/local/global memory to store data, you will need a lot more threads in the block to keep the ALU busy. Hope this is right.

Each multiprocessor is composed of 8 ALUs, yes, but these ALUs run at twice the clock frequency of the instruction unit, so threads are executed in groups of 16. These groups are paired off (one warp = two groups of 16 threads) and each warp is issued over two cycles.

Thanks to everyone for the explanation. In my case, my data is not going to fit very well into shared memory. The video frames are just too big. I guess I will have to use more threads and fewer blocks and live with the "last " block stalling a bit.

The idea is that you would divide the frame into chunks, one per thread block, that fit in shared memory. Then a thread block can process that chunk and output its results, independent of all other thread blocks.

If you are applying global operators things are a bit trickier, but you should still be able to parallelize it so that it uses shared memory to amplify arithmetic intensity, and use multiblock reductions to combine the global data.