Kernel scheduling with Fermi independent blocks can be placed in new streams?

With CUDA 2.x and 200-series cards running a single kernel with 90 independent blocks takes less time than running 90 streams with 1 block each (disregarding concurrent host<->device memory transfers) because streams are run serially.
Will running the same configuration on Fermi allow full utilization of the GPU? Can different kernel calls run concurrently if they were called at different times?

Thanks!

I don’t think anybody outside of NVIDIA has the answer to those questions, and I doubt the NVIDIA guys are in a position to answer that right now.

I rather doubt it will be as fast, there will be some parallel program running possible on the Fermi but from what i understood from the white papers and so forth it wont let you run 90 different programmes and as far as i am concerned it makes no sense supporting such things. i would rather Nvidia spent there time and effort on other matters, it’s enough if it can run 4-8 different programs. and since it seems to have almost 4 separate gpus inside the GF100 it seems that it will be able to do that.

But i guess we will have to wait a bit more to see :)

Thanks

From what I Understand (limited) … the number of parallel kernel calls maybe = number of sm’s on the chip … i guess…

I remember reading that with fermi u can run 2 kernels in parallel.
And actually Im not seeing much of an improvement if u would be able to run dozens of kernels at the same time. Why not just run one kernel with more threads which will complete faster and then start another one?

I depends on the algorithm you use and wheter it offers enough parallelism to occupy all execution units. If not you can execute multiple kernels simultaniously and have a much better utlization of your hardware.

At this time it’s not sure how many kernel you can execute in parallel. One thing is sure however, it will be more than one kernel.

I seem to recall the limit being mentioned in the keynote at the GTC conference, but am not sure. We speculated afterward that the limit was one kernel per SM. Unfortunately, the Fermi white paper doesn’t seem to offer any details.

Yes it does. CUDA Fermi whitepaper V1.1, Page 11, summary table:

Concurrent Kernels: Up to 16

Which is not surprisingly the same number as the number of SM’s

Yep, there it is. I mistakenly assumed that this would be mentioned in the section that discusses concurrent kernels… silly me.

Hehe, you should know by now that you can find information in lots of different places in NVIDIA docs ;)

Seems like quite a feat that each SM can do it’s own kernel and not e.g. each GPC. The architecture really got a lot more parallel with Fermi.

It gets even more complicated and confusing when you look at the graphics architecture of Fermi. There are 4 separate raster engines… so Fermi itself looks a little like 4 GPUs each with 4 SMs. However it may be for CUDA that 4x4 distinction is ignored and it looks more like the scheduling subunits look like 1x16.

Soo my hunch was rite :) … we can run upto 16 kernels… very interesting… ! I wondering how much overhead it will be to switch kernels during execution ? will it be the same as current 10 microsecond … ?

I even read on one of the pages meantioned in this thread that each of the GPC’s can have a different number of SM’s (so you can have 2 with 4, two with 3) and still the load will get distributed evenly over the SM’s (for graphics), so there should be a per SM load distribution somewhere.

10 micro seconds ? if you launch kernels back to back there is a 3 micro second overhead. i suspect that hear there will be none. as it will be the same context switching that occurs today, unless Fermi supports kernel context switching meaning loading a new instruction set to a sm but i highly doubt that .