Scheduling thread blocks adaptively How to distribute thread blocks to arbitrary dimensions

Hi Guys,

I am relatively new to CUDA and I am interested to know how you guy’s would tackle a problem I am having at the moment.

Maybee it’s a stupid question but please bare with me…

I am writing a raytraycing application. Naturally you would want your end user to be able to resize the canvas to any desired width and height.

Things brings me to my problem. If I am not mistaken I understood that it’s best to keep thread dimension in a thread block 2 pow x. Please correct me if I am wrong.

Say we have a canvas of 513 X 513 pixels, and thread blocks of 16 * 16 threads, one would need at least 33 x 33 threadblocks in order to fully cover the canvas.

However for 65 threads blocks only a marginal percentage of the total threads would be occupied with rendering. I would like to find a way to ensure that all the threads in my thread blocks are in fact doing the work they are supposed to do.

I hope you guys understand my question, if you have any question please let me know.

Thanks in advance,

T Kroes

The Netherlands

BLOCK dimensions should prefferably be a multiple of 32.

Your question is how to adapt a grid of thread blocks that are multiples of 32 to data that is not ?

I think the most common way to do it is to make a grid that is bigger than your dataset and then have a conditional check in the kernel to make sure you’re not going out of bounds. What happens is that your boundary blocks are branched and there might be som slight loss of performance, but probably no biggy.


dim3 dimGrid( (DIM_X + cols - 1)/DIM_X , (DIM_Y + rows - 1)/DIM_Y );

Hope i understood your question.