Determining Thread vs Block

I have written a kernel that requires to spawn 1024 threads. Each of these threads operate on separate data set and quite heavy in functionality. I have a Tesla C1060 card that is having 240 cores divided into 30 MP’s.
In this scenario, what would be the best(idea) way of invoking the kernel.
Is 2 block and 512 threads OR 32 blocks of 32 threads each?
I would appreaciate your response.


I don’t think you want either.

Firstly, you want at least enough blocks to fill your 30 MP’s, so that’s at least 30 blocks to start with. Then you want to look at your register & shared memory allocation - it may be that you can’t get 1024 threads on a MP. This is not a problem - you get pretty much optimial performance with only 512.

There are quite a few other things to take into account. In my application more threads per block are more efficient as they reduce global memory loads, however I have to sync my blocks quite regularily, so smaller ones are better to reduce waiting. I compromise at 128 threads per block.

Also - you can only have 8 blocks per MP. Can be an issue if you want 32 thread blocks. Given resources are allocated in lumps of 64, 32 thread blocks are wasteful in that respect too (as are 96 thread blocks and other such numbers).

I think the general method is to play around and see what works best.