New GK110 offers 192 cores, but it just schedules 4 warps (i.e. 128 threads at the same time), so
How is it possible to fulfill the GPU if only 128 threads are able to be runned at the same time?
previous generation had only 32 cores/SM i thought that it was the reason that limited the concurrent multiwarp execution, but with this update, I have lost the meaning of how it works!
Thank you very much people!
EDIT:
I am reading at Page Not Found | NVIDIA that each warp schedule can also schedule 2 independent instructions per/warp, so, i guess that the occupancy of the multiprocessor could be reached when 2 of the 4 warps scheduled, has independent-instruction ready to be processed…
That’s true… that was the reason to think in the unpossibility of reach all the power but you’ve got the key: one thread block/SM is the way.
So, Is there any recommendation of, in example, try to make the blocks dimension multiply of 192? I mean… if you have 224 threads/block, SMX will need to schedule 6 warps and another one (alone maybe?)
You want the block size an integer multiple of four warps on Kepler (128 threads), to feed the four independent warp schedulers. Previous generations wanted at least a multiple of two warps to either feed two independent schedulers (Fermi) or for register allocation/banking reasons (Tesla). I assume running multiple blocks per SM/SMX can mitigate some of this, but not the register allocation quantization.