Cuda Cores Cuda Cores - run threads bloocks, kernels etc.

Hello there,
I know that a block can run 512 threads, and you can have 65,535 blocks. Also i know that threads within a block can communicate with each other which is very useful, but my main point is if you have 65535 blocks running with 1 thread each on a gpu with 300 cuda cores will 300 blocks run at once? ie 300 simultatious actions? or do you have to have 1 block with 512 threads to run to take advantage of the 300 cores? I know theres something about a warp and 32 threads running at once… Could someone sum it all up to me? Would 10 blocks running each with 32 threads be optimal?

Stuart P.


The CUDA cores you are referring to are actually just vector engines. They run in blocks of 32 (the warp size) over the threads in a given block. There are limits to the number of blocks you can put onto a given SM. A GPU consists of several SMs each of which contain N CUDA cores.

If you tried to schedule 300 blocks of 1 thread each you’d get very bad performance because the lowest scheduling unit is 32 CUDA cores. You needs LOTS of threads on each core, and not lots of blocks. Typically you see only between six to eight blocks resident on an SM at any one time.

A good rule of thumb is aim for either 192 or 256 threads per block and you’ll be fine as both of these values produce pretty good loading onto the hardware

Thanks for that, that rule of thumb is probably the easiest part to understand.

Am I right in thinking:

  • SM stands for Streaming multiprocessor.

  • GTX 480 has 480 SM’s

  • 6 to 8 blocks are on a given SM

  • 192 threads minimum per block means minimum 1152 threads per SM

  • 1152 threads per SM * 480 SMs means for insane computation I want about 552960 threads scheduled across 2,880 blocks

  • An SM only handles N blocks at a time, whilst the others are loading in order to hide memory loading latencies, hence 1 thread per block would result in each SM running just N threads instead of 32*N?


The GTX480 has 15 SM, each has 32 scalar cores which execute like a wide SIMD or vector machine.

Ah k so its more like 17280 threads across 90 blocks for 6 blocks per SM and 192 threads per block.


Those numbers don’t look bad, but the reasoning for it mixes some unrelated numbers.

On compute capability 1.x devices you should have at least 192 threads or 6 warps active per SM (not per block). On a 2.0 device, you should have at least 576 active threads per SM.