Kernels from different non-blocking streams are not executed concurrently

Here are some of my understanding of the cuda stream and hardware kernel scheduler:

  1. The hardware will dispatch the kernels sequentially. If the current kernel still has some blocks that are not scheduled to SMs, the scheduler will not try to dispatch another kernel.
  2. The hardware will dispatch the blocks in a kernel into SMs when there are enough compute and memory resources.

And I have made an experiment for these two assumptions on GV100.

Code:

__global__ void sleep_for(long arg) {
    for (int i = 0; i < arg; i++)
        __nanosleep(1000);
}

int main() {
    cudaStream_t s1, s2;
    cudaStreamCreate(&s1);
    cudaStreamCreate(&s2);
    sleep_for<<<240, 32, 10 * 1024, s1>>>(1000);
    sleep_for<<<40, 32, 0, s2>>>(1000);
    cudaDeviceSynchronize();
}

As expected, these two kernels execute concurrently.
Screenshot from 2021-03-26 11-00-55
However, if I exchanged the order of these two kernels like this:

    sleep_for<<<40, 32, 0, s2>>>(1000);
    sleep_for<<<240, 32, 10 * 1024, s1>>>(1000);

Instead, they execute sequentially.
Screenshot from 2021-03-26 11-01-35

So are there any other requirements for the concurrent kernel execution?

It’s an interesting observation. I also witness it, and what I observe is that in the case where things are not running concurrently, if you remove the dynamic shared memory allocation from the 2nd kernel launch, then things run concurrently. Even a 1 byte allocation seems to make the difference. And of course in the concurrent case the first kernel launch is making the same dynamic shared memory allocations. So I’m not suggesting I can explain it, it’s just further observation. I don’t know of anything that would explain it.

If this case is important to you, my suggestion would be to file a bug using the instructions linked to a sticky post at the top of this sub-forum. As a general statement, its usually good practice to test such things on the latest available CUDA version. I happened to test on V100, CUDA 11.1