How concurrent kernel execution works on Fermi?

Is Fermi a true MIMD architecture and each Multiprocessor can execute different kernel? or maybe one kernel per GPC?

Why CUDA Toolkit 3.0 allow us to run only four kernel concurrently?

Or maybe this four kernels are not executed on different Multiprocessors, they are only interleaved?

PS.
Sorry for my not perfect English

According to the white papers each multiprocessor can execute a different kernel. It is limited to 4 at this time in 3.0 though. I don’t know what that has to do with MIMD though.

I believe tmurray said here in the forums that even one MP can run multiple kernels… it’s surprisingly flexible.

But yes there’s a limit of at most 4.

Any idea why there’s that limit?

I’ll make a guess…

Section 3.2.6.3 in the CUDA Programming Guide Version 3.0 says some 2.0 devices support this and: “[font=“Times New Roman”]The maximum number of kernel launches that a device can execute concurrently is four[/font]”.

The documentation is vague on whether or not the set of SMs is partitioned and assigned to kernels or if there is true intra-SM scheduling of kernels.

If there actually is concurrent kernel scheduling within each SM, then I’ll guess two possible reasons for a limit of 4 kernels:

    the warp scheduling and code caches become a little more complicated (at the least).

    it’s probably not practical to run more than four Compute Capability 1.1 kernels at once on 2.0 hardware. If 1.1 kernels are written to work with 8K of registers and 16KB shared memory per SM then 3 or 4 could fit into the 32K of registers and 48KB of shared memory in Fermi. Furthermore, I assume new kernels targeting 2.0 will “expand to fit” the new resource capabilities and reduce the opportunity for running concurrent kernels – assuming you have a problem large enough to keep Fermi busy!

It sure seems like it would be easier to just assign dedicated SMs to each concurrent kernel and keep things simple. But there are fewer SMs in Fermi than there were in Compute Capability <1.3 devices (which probably wouldn’t make a difference anyway).

Perhaps NVIDIA can answer this question? :whistling:

Spoiler alert: it’s up to 16 in 3.1. Sorry to ruin all of your speculation!

D’oh! :D