Why GK110 has 192 cores but 4 warps?

whyon · June 4, 2012, 8:28pm

Hi,

I’ve got a question that is breaking my head…

New GK110 offers 192 cores, but it just schedules 4 warps (i.e. 128 threads at the same time), so

How is it possible to fulfill the GPU if only 128 threads are able to be runned at the same time?

previous generation had only 32 cores/SM i thought that it was the reason that limited the concurrent multiwarp execution, but with this update, I have lost the meaning of how it works!

Thank you very much people!

EDIT:

I am reading at Page Not Found | NVIDIA that each warp schedule can also schedule 2 independent instructions per/warp, so, i guess that the occupancy of the multiprocessor could be reached when 2 of the 4 warps scheduled, has independent-instruction ready to be processed…

ymc · June 4, 2012, 11:56pm

You need to abandon the concept of one thread per core.

It makes more sense to think in terms of one thread block per SMX for Nvidia GPUs.

For concurrency, think in terms of (max number of warps executed concurrently by an SMX) times (number of SMXs).

ymc · June 5, 2012, 2:17am

warp size remains 32 External Image

whyon · June 5, 2012, 7:27am

Thank you YMC,

That’s true… that was the reason to think in the unpossibility of reach all the power but you’ve got the key: one thread block/SM is the way.

So, Is there any recommendation of, in example, try to make the blocks dimension multiply of 192? I mean… if you have 224 threads/block, SMX will need to schedule 6 warps and another one (alone maybe?)

Any suggestion?

ymc · June 5, 2012, 10:04am

If I understand correctly, threads per block ideally should be multiples of warp size which is 32.

d.rossetti · June 5, 2012, 5:10pm

there’s another factor two:

"Keplerâ€™s quad warp scheduler selects four warps, and [u]two independent instructions per

warp can be dispatched each cycle[/u]"

tera · June 5, 2012, 5:40pm

You want the block size an integer multiple of four warps on Kepler (128 threads), to feed the four independent warp schedulers. Previous generations wanted at least a multiple of two warps to either feed two independent schedulers (Fermi) or for register allocation/banking reasons (Tesla). I assume running multiple blocks per SM/SMX can mitigate some of this, but not the register allocation quantization.

ymc · June 5, 2012, 10:54pm

Thanks for your authorative reply. External Image

whyon · June 6, 2012, 9:55am

Thank you a lot!

Topic		Replies	Views
Beginner's question about concurrent warp execution. CUDA Programming and Performance	3	2522	July 4, 2019
How many thread are executed at the same time ? CUDA Programming and Performance	9	7924	January 21, 2024
Warp scheduling - have I got this right? CUDA Programming and Performance	17	12202	February 12, 2013
Increased number of concurrent kernels for kepler? How many concurrent kernels can a kepler card lau CUDA Programming and Performance	7	4394	March 30, 2012
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19751	July 5, 2011
Please help me understand this GK210 spec CUDA Programming and Performance	3	800	October 8, 2015
Scheduling threads as Warps CUDA Programming and Performance	3	875	July 11, 2013
192 cuda cores - how they are organized 6x32 or 4x32 + 4x16? CUDA Programming and Performance	5	3175	April 29, 2012
the 1024 threads can work concurrently? CUDA Programming and Performance	4	853	July 24, 2017
CUDA threads and warps Teaching and Curriculum Support	3	7852	May 12, 2015

Why GK110 has 192 cores but 4 warps?

Related topics