Kernel scheduling with Fermi independent blocks can be placed in new streams?

deghost · January 19, 2010, 2:03pm

With CUDA 2.x and 200-series cards running a single kernel with 90 independent blocks takes less time than running 90 streams with 1 block each (disregarding concurrent host<->device memory transfers) because streams are run serially.
Will running the same configuration on Fermi allow full utilization of the GPU? Can different kernel calls run concurrently if they were called at different times?

Thanks!

avidday · January 19, 2010, 2:23pm

I don’t think anybody outside of NVIDIA has the answer to those questions, and I doubt the NVIDIA guys are in a position to answer that right now.

erdooom · January 19, 2010, 2:40pm

I rather doubt it will be as fast, there will be some parallel program running possible on the Fermi but from what i understood from the white papers and so forth it wont let you run 90 different programmes and as far as i am concerned it makes no sense supporting such things. i would rather Nvidia spent there time and effort on other matters, it’s enough if it can run 4-8 different programs. and since it seems to have almost 4 separate gpus inside the GF100 it seems that it will be able to do that.

But i guess we will have to wait a bit more to see :)

deghost · January 20, 2010, 9:09am

Thanks

nitin.life · January 20, 2010, 9:56am

From what I Understand (limited) … the number of parallel kernel calls maybe = number of sm’s on the chip … i guess…

ONeill · January 21, 2010, 9:37am

I remember reading that with fermi u can run 2 kernels in parallel.
And actually Im not seeing much of an improvement if u would be able to run dozens of kernels at the same time. Why not just run one kernel with more threads which will complete faster and then start another one?

CapJo · January 21, 2010, 9:49am

I depends on the algorithm you use and wheter it offers enough parallelism to occupy all execution units. If not you can execute multiple kernels simultaniously and have a much better utlization of your hardware.

At this time it’s not sure how many kernel you can execute in parallel. One thing is sure however, it will be more than one kernel.

MisterAnderson42 · January 21, 2010, 12:18pm

I seem to recall the limit being mentioned in the keynote at the GTC conference, but am not sure. We speculated afterward that the limit was one kernel per SM. Unfortunately, the Fermi white paper doesn’t seem to offer any details.

E.D_Riedijk · January 21, 2010, 12:45pm

Yes it does. CUDA Fermi whitepaper V1.1, Page 11, summary table:

Concurrent Kernels: Up to 16

Which is not surprisingly the same number as the number of SM’s

MisterAnderson42 · January 21, 2010, 4:57pm

Yep, there it is. I mistakenly assumed that this would be mentioned in the section that discusses concurrent kernels… silly me.

E.D_Riedijk · January 21, 2010, 8:44pm

Hehe, you should know by now that you can find information in lots of different places in NVIDIA docs ;)

Seems like quite a feat that each SM can do it’s own kernel and not e.g. each GPC. The architecture really got a lot more parallel with Fermi.

SPWorley · January 21, 2010, 8:57pm

It gets even more complicated and confusing when you look at the graphics architecture of Fermi. There are 4 separate raster engines… so Fermi itself looks a little like 4 GPUs each with 4 SMs. However it may be for CUDA that 4x4 distinction is ignored and it looks more like the scheduling subunits look like 1x16.

nitin.life · January 21, 2010, 10:25pm

Soo my hunch was rite :) … we can run upto 16 kernels… very interesting… ! I wondering how much overhead it will be to switch kernels during execution ? will it be the same as current 10 microsecond … ?

E.D_Riedijk · January 22, 2010, 6:35am

I even read on one of the pages meantioned in this thread that each of the GPC’s can have a different number of SM’s (so you can have 2 with 4, two with 3) and still the load will get distributed evenly over the SM’s (for graphics), so there should be a per SM load distribution somewhere.

erdooom · January 22, 2010, 7:33am

10 micro seconds ? if you launch kernels back to back there is a 3 micro second overhead. i suspect that hear there will be none. as it will be the same context switching that occurs today, unless Fermi supports kernel context switching meaning loading a new instruction set to a sm but i highly doubt that .

Topic		Replies	Views
CUDA 3.0: concurrent kernel launches CUDA Programming and Performance	9	17725	April 1, 2010
Fermi speculation Kernel invocation in kernel code CUDA Programming and Performance	10	4294	October 20, 2009
Concurrently kernels running on one device CUDA Programming and Performance	17	2737	March 2, 2010
Fermi streams and kernels CUDA Programming and Performance	5	1804	July 22, 2010
Easiest way to invoke two different kernels simultaneously ? CUDA Programming and Performance	4	5756	April 12, 2012
More blocks than SMs may not make sense CUDA Programming and Performance	13	2673	November 11, 2010
concurrently running blocks from multiple kernels on the same SM related to Fermi and unified shader CUDA Programming and Performance	14	3869	November 30, 2010
putting multiprocessors in group CUDA Programming and Performance	6	1677	November 27, 2009
Scheduling on Fermi CUDA Programming and Performance	16	17541	August 9, 2010
Concurrent kernels execution using streams in multiple CPU threads CUDA Programming and Performance	7	10609	June 26, 2012

Kernel scheduling with Fermi independent blocks can be placed in new streams?

Related topics