Up to 16 kernels can run simultaneously on a Fermi card (ver. 3.1). Does this hold also for executions of a kernel with different streams? The statement in the guide haven’t changed: “Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not guaranteed and should therefore not be relied upon for correctness”.
Is resource distribution for concurrent kernels the same as for concurrent thread blocks?
Simultaneous kernel execution only happens with kernels in different streams. Kernels in a single stream never simultaneously execute. Streams are the method you use to define serial data dependencies between kernel calls, letting the high level GPU scheduler know what kernels/copies can run simultaneously.
But that’s the entire purpose of streams, to explicitly specify GPU copies and executions which are serially dependent. When you have a set of kernels that are not interdependent, then you assign them different streams to specify that fact and allow the GPU scheduler to optimize their execution.
My question was probably not clear enough. Nevertheless, it seems that I got the answer.
I’ll rephrase: If kernel is a ‘kernel function’ then to run 16 kernels concurrently means to run 16 different kernel functions, rather than the same function with different streams.
Its a silly question, I know, but I wanted to be sure.