Why GK110 has 192 cores but 4 warps?


I’ve got a question that is breaking my head…

New GK110 offers 192 cores, but it just schedules 4 warps (i.e. 128 threads at the same time), so

How is it possible to fulfill the GPU if only 128 threads are able to be runned at the same time?

previous generation had only 32 cores/SM i thought that it was the reason that limited the concurrent multiwarp execution, but with this update, I have lost the meaning of how it works!

Thank you very much people!


I am reading at http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf that each warp schedule can also schedule 2 independent instructions per/warp, so, i guess that the occupancy of the multiprocessor could be reached when 2 of the 4 warps scheduled, has independent-instruction ready to be processed…

You need to abandon the concept of one thread per core.

It makes more sense to think in terms of one thread block per SMX for Nvidia GPUs.

For concurrency, think in terms of (max number of warps executed concurrently by an SMX) times (number of SMXs).


warp size remains 32 :thumbsdown:

Thank you YMC,

That’s true… that was the reason to think in the unpossibility of reach all the power but you’ve got the key: one thread block/SM is the way.

So, Is there any recommendation of, in example, try to make the blocks dimension multiply of 192? I mean… if you have 224 threads/block, SMX will need to schedule 6 warps and another one (alone maybe?)

Any suggestion?

If I understand correctly, threads per block ideally should be multiples of warp size which is 32.

there’s another factor two:

"Kepler’s quad warp scheduler selects four warps, and [u]two independent instructions per

warp can be dispatched each cycle[/u]"

You want the block size an integer multiple of four warps on Kepler (128 threads), to feed the four independent warp schedulers. Previous generations wanted at least a multiple of two warps to either feed two independent schedulers (Fermi) or for register allocation/banking reasons (Tesla). I assume running multiple blocks per SM/SMX can mitigate some of this, but not the register allocation quantization.

Thanks for your authorative reply. :thumbup:

Thank you a lot!