Here are some of my understanding of the cuda stream and hardware kernel scheduler:
- The hardware will dispatch the kernels sequentially. If the current kernel still has some blocks that are not scheduled to SMs, the scheduler will not try to dispatch another kernel.
- The hardware will dispatch the blocks in a kernel into SMs when there are enough compute and memory resources.
And I have made an experiment for these two assumptions on GV100.
Code:
__global__ void sleep_for(long arg) {
for (int i = 0; i < arg; i++)
__nanosleep(1000);
}
int main() {
cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
sleep_for<<<240, 32, 10 * 1024, s1>>>(1000);
sleep_for<<<40, 32, 0, s2>>>(1000);
cudaDeviceSynchronize();
}
As expected, these two kernels execute concurrently.
However, if I exchanged the order of these two kernels like this:
sleep_for<<<40, 32, 0, s2>>>(1000);
sleep_for<<<240, 32, 10 * 1024, s1>>>(1000);
Instead, they execute sequentially.
So are there any other requirements for the concurrent kernel execution?