Section 18.104.22.168 in the CUDA Programming Guide Version 3.0 says some 2.0 devices support this and: “[font=“Times New Roman”]The maximum number of kernel launches that a device can execute concurrently is four[/font]”.
The documentation is vague on whether or not the set of SMs is partitioned and assigned to kernels or if there is true intra-SM scheduling of kernels.
If there actually is concurrent kernel scheduling within each SM, then I’ll guess two possible reasons for a limit of 4 kernels:
the warp scheduling and code caches become a little more complicated (at the least).
it’s probably not practical to run more than four Compute Capability 1.1 kernels at once on 2.0 hardware. If 1.1 kernels are written to work with 8K of registers and 16KB shared memory per SM then 3 or 4 could fit into the 32K of registers and 48KB of shared memory in Fermi. Furthermore, I assume new kernels targeting 2.0 will “expand to fit” the new resource capabilities and reduce the opportunity for running concurrent kernels – assuming you have a problem large enough to keep Fermi busy!
It sure seems like it would be easier to just assign dedicated SMs to each concurrent kernel and keep things simple. But there are fewer SMs in Fermi than there were in Compute Capability <1.3 devices (which probably wouldn’t make a difference anyway).