Is fewer threads better?

Hi,
I am working on an image processing algorithm, and to make my life easier I have implemented the kernels with 1 thread/pixel, but now I am wondering will I ever been able to max out the number of threads? I’ve calculated that the maximum no. of threads I can create is 6553565535512=2198956147200 (that is a grid of 65535x65535, and block size of 512). It seems like this is more than enough to accommodate my images, which are in the ~10 megapixel range. But is there an overhead to creating such a large amount of threads? I could tile my images, and use 1 thread / NxN pixels, but it would be a lot of work to change the code, without a justification that it would improve performance.
Any advice is welcome.

I think the answer is generally fewer threads is worse !

having fewer threads would also make it harder to have contiguous reads and writes to/from global arrays and you definately want those for performance.
It is also desirable to have 1 or more complete warps per block i.e. 32,64,96,… threads per block

NB if your image is say 28003500 and you decide on 1616 threads per block, that is a grid of 175*219 (218.75 rounded up) and 175 blocks will be slightly underutilised but that doesnt matter.

Sorry for saying some stuff you already knew.
kbam

I think the answer is generally fewer threads is worse !

having fewer threads would also make it harder to have contiguous reads and writes to/from global arrays and you definately want those for performance.
It is also desirable to have 1 or more complete warps per block i.e. 32,64,96,… threads per block

NB if your image is say 28003500 and you decide on 1616 threads per block, that is a grid of 175*219 (218.75 rounded up) and 175 blocks will be slightly underutilised but that doesnt matter.

Sorry for saying some stuff you already knew.
kbam

Sorry internet glitched and my reply was sent twice

Sorry internet glitched and my reply was sent twice

In CUDA, active threads have no switching overhead (unlike threads on a CPU) so, in general, more is better because they can hide the latency of global memory. I do recall seeing some posts that measured a small launch overhead proportional to the number of blocks, but I can’t locate them now.

Note that the number of threads can have other subtle timing effects. You should, when possible, design your kernels to work with any block size and benchmark all reasonable multiples of 32.

In CUDA, active threads have no switching overhead (unlike threads on a CPU) so, in general, more is better because they can hide the latency of global memory. I do recall seeing some posts that measured a small launch overhead proportional to the number of blocks, but I can’t locate them now.

Note that the number of threads can have other subtle timing effects. You should, when possible, design your kernels to work with any block size and benchmark all reasonable multiples of 32.