In CUDA programming guide 3.0 3.2.6.3, it states that for some devices with compute capability 2.0, one can launch a maximum of four kernel launches concurrently. So let’s say, I do two matrix multiplications (C1 = A1B1 and C2 = A2B2).

This means that with everything else being equal, the wall time would be nearly 1/2 of what it used to be (assuming negligible overhead that comes from concurrent launches) for obtaining C1 and C2, correct?

Would there be an issue (e.g. not possible, time delay) if some of the input data are shared by both of the kernel launches (e.g. A1 = A2)?

Is it possible to do asynchronous CPU-GPU (concurrent) operations as well as overlap of CPU data transfer with concurrent GPU operations?

Concurrent kernel launches give no sudden doubling of compute power… it’s just a scheduling convenience, allowing different streams of the same context to run simultaneously. In some cases, it’s possible save a little efficiency from this if you have already-inefficient low block count kernels, mostly by reducing idle SM count while kernels are finishing up. In your example of matrix multiplication, if your matrices are big enough that you’d use say 60+ blocks to compute it, there’s no big benefit to concurrent kernels. If your matrices were very small and you didn’t need many kernel blocks (say 5), then yes, you could get (up to) a 2X savings, but that’s mostly because of the inefficiency of small problems, not some real bonus compute power. In all cases, bigger problems are always more efficient (fewer idle SMs), so the concurrency is just to try to make the inefficient small problems less painful.

We’re also not sure exactly how well Fermi will react to different concurrent kernels situations… if you have two kernels running, and one finishes, will the other concurrent kernel then GROW its active SM count and take over the now-idle SMs? (probably…) What about the opposite and having a running kernel start to leave SMs idle in order to free them up for the second kernel? (probably not…)

Your question #2, it would probably make no big difference if some of the input data was shared or not. It MIGHT be that if one input matrix were the same, Fermi’s L2 cache would be more efficient but that’s a big question of how big an effect it’d be (and it’d only matter if you were memory bandwidth limited anyway.) But you may get that same L2 efficiency from sequential executions too.

For #3, you can do concurrent kernel and memory transfers even on G200 GPUs… it’s a feature of Compute 1.2 and later. Fermi Compute 2.0 adds concurrent bidirectional memory transfers.

There’s no discrete partitioning of SMs for concurrent kernels. You launch blocks from one kernel until there are no more blocks in that kernel. If there’s another pending kernel in a different stream (aka they can be run simultaneously), the second kernel’s blocks will be launched assuming there are free resources. As resources from the first kernel are freed, blocks from the second kernel will replace them.

That’s interesting… it wasn’t clear that scheduling was per-SM and not for some partitioned subgroup, especially because of the “up to 4” limit in the docs seems to match up with the Fermi die’s 4-way symmetry. (That’s a big reach of course, but it explains the “up to 4” documentation if followon lower-SM parts had 2 or 1 “quadrant” of the Fermi die, and each quadrant has its own SM job allocator. It’d also be logical if subgroups of SMs share instruction cache, which is an undocumented behavior even in current parts.)

My answer to BlahCuda still stands, though… you will see efficiency boosts when you have small-block kernels that are inefficient now. Larger block count kernels are already efficient even on current GPUs.

When Tim starts the inevitable “What do you want to see in CUDA 4.0?” thread, I’ll post with “please let us reserve some SMs from scheduling to allow concurrent kernels from DIFFERENT contexts, especially for graphics.” I’d love to tell CUDA to use only 15 out of 16 of my Feynman GPU’s SMs and use the remaining 1 SM just for the display.

I didn’t read it that way at first, but maybe you’re right.

But if a single SM can mix blocks from multiple kernels, there’s opportunities for more efficient processing, but also possibilities of LESS efficient processing.

Imagine your two kernels are independent, in the same context, and one uses (say) 1/3 of an SM’s registers and one uses 1/2 of an SM’s registers.

If one block of each type gets loaded into one SM, then two SMs might run one of each kernel… when it would have been more efficient for one SM to run 2 blocks from the “1/2 registers” kernel and the other SM to run THREE blocks from the “1/3 registers” kernel.

Plus the fact that mixing kernels will thin out the coherence of the instruction cache (which is undocumented anyway but still…)

Conversely, it’s clear there’s cases where an SM has extra resources idle and “cheap” kernel blocks might fit into those gaps, giving you free performance that’d otherwise be wasted… so there could obviously be cases of big wins too.

So, Tim… can one SM run multiple kernels simultaneously? As E.D says, that would be groovy!

Thanks for the info on the scheduler, Tim. NVIDIA is usually so secretive about that…

This info has some pretty far reaching implications, not just for the concurrent kernel launches that everyone is obsessed with, but for single kernel launches that have uneven work load distributions across blocks. Fermi should be much more efficient than Telsa at these kernels if blocks are truly now scheduled dynamically based on the available resources.

Well, none of this is difficult at all to figure out when you’ve got a GF100, and I’d rather set expectations now than have everyone be disappointed later.