Blocking scheduler - Question about the priority of scheduling kernel blocks on concurrent streams

Hi, I’m discovering the rules of blocking scheduler for concurrent streams. There are 2 concurrent streams(stream0, stream1) within 2 kernels(kernel0, kernel1), and kernel0 runs on stream0 while kernel1 runs on stream1. After I tried calling kernels in different orders, the priority to schedule the blocks for kernel seems different.

Observing from the SM id (asm(“mov.u32 %0, %smid;” : “=r”(smid));), when calling in the following order, block scheduler satisfies the kernel1 first by looping SMs in even order and then in odd order.
kernel0 << <1Dimension, XX, 0, stream0>> > ();
kernel1 << <1Dimension, XX, 0, stream1>> > ();
However, if calling order reverses, block scheduler satisfies the kernel0 first. By the way, my env is linux, Tesla V100, compute capacity is 7.0, and CUDA v10.2.

Is there any known rules for the scheduling prority for kernels in concurrent streams? Or it’s out of control? And if it can control, for, example, to make kernel1 satisfied first, how can I code?

There is no priority established by CUDA itself. You should not depend on block scheduling order.

If you want priority, you can investigate the stream priority mechanism.

However, you pretty much need to have “long running” kernels which can execute concurrently. If the kernels do not execute concurrently, perhaps because one kernel fully occupies a GPU, launching a second kernel after the first may not change things much. There are various questions here on these forums of people who have experimented with stream priority, and were questioning the results.

Thanks. ‘cudaStreamCreateWithPriority’ fixes my issue.