Scheduling Warps of different kernels in the same cycle on the same SM

elpaul · December 5, 2024, 1:50pm

Hi all,

Assume I have 2 kernels A and B. I’m using CUDA streams to execute both kernels concurrently. Further assume that I launch both kernels with 64 threads per thread block, i.e 2 warps and I’m running on a H100 with 4 warp schedulers per SM.

When an SM is running a thread block of each kernel concurrently, is it possible for it to schedule warps of both kernels in the same cycle across the 4 subpartitions of the SM? I’m of course assuming that all warps are eligible. For example could it be possible that in the same cycle on the same SM:

SMSP 0 and 1 schedules the 2 warps from kernel A
SMSP 1 and 2 schedules the 2 warps from kernel B

Or is it the case that the warps of different kernels will be scheduled in different cycles, even if that means that some resources will be left unused?

Thanks a lot for help in advance!

Curefab · December 5, 2024, 2:16pm

AFAIK Yes, that should be possible (as long as you do not use MIG to separate the kernels and as long as the kernels can be loaded concurrently, e.g. not a single kernel is using up all the shared memory of the SM).

There is no strong protection between different kernels. There are forum posts that it was possible to access shared memory of the other kernel with a pointer beyond the boundaries (which is possible by accident and should not be relied upon). That further shows that the kernels are not separated at SM level and is a hint that the cycles would not be exclusive to one of the kernels.

Cuda mostly works on warp level with the exception of some synchronization functions and some bookkeeping, if the kernel is still running. It does not matter much, from which kernel the warps originate.

If it would not work, you could alternatively combine both kernels into one and let half of the blocks run function A and half of the blocks run function B.

Greg · December 5, 2024, 6:38pm

When a thread block is launched to a SM the thread block is rasterized into warps and the warps are assigned to a SM sub-partition warp scheduler. On each cycle the warp scheduler can pick from any eligible warp. The SM supports a maximum number of thread blocks. Each thread block can be from a different grid launch.

CUDA does not provide fine graining scheduling control to guarantee concurrent kernel execution. This is left to the developer through use of optimal launch and careful use of resources such as registers/threads, warps/block, shared memory/block, and shared memory configuration.

If you are prototyping you can read PTX special registers %smid and %warpid along with either clock64 or %globaltimer to generate warp trace to determine if two kernels have warps executing concurrently on the same SM.

elpaul · December 6, 2024, 3:14pm

Thanks a lot for both of your answers! From your answer it seems clear to me that it is very well possible for warps of different kernels to be dispatched in the same cycle on the same SM.

I have also actually written a small prototype as you suggested, and it clearly shows that the kernels are executing concurrently on the same SM. But I cannot think of a way on prototyping whether the kernels were also scheduled in the same cycle on the same SM. I assume there is not really a way for doing this even at the PTX level, right?

Robert_Crovella · December 6, 2024, 3:24pm

In my opinion, that would be quite difficult to demonstrate with a directed test or other piece of code. Even without the consideration of two separate kernels, it would be fairly difficult to show that two warps, on two separate SMSPs, were scheduled in the same cycle.

AFAIK the CUDA GPU does not provide that sort of instrumentation possibility to the developer or even to tools like nsight compute, and otherwise being able to infer that from a directed test would be genius level work, IMO.

PTX or other developer mechanisms do not allow you to direct GPU activity at this level (ie. influence which warp will be selected in which cycle, for a particular SMSP).

I guess if I were going to try to write a directed test, I would like it to be as simple as possible. Ideally using a GPU with a single SM, although those generally don’t exist any more (and the ones I have access to are e.g. from Kepler generation which does not support use of the latest tools). In any event, I would try to write a code that deposited only one warp from kernel A, and only one warp from kernel B, on a particular SM. Then I would try to infer from total cycles elapsed, or total cycles issued, in nsight compute, that the only possible conclusion to reach that level was that in one or more cycles, it would have to be the case that both warps were issued in the same cycle. With a long enough running code, I’m assuming this would essentially be obvious. You want want to minimize or eleminate dependencies, so that the warp is nearly always issuable. This would be fairly difficult, but might be possible with careful instruction choice. Since that would be difficult or impossible even with careful instruction choice, you might be forced to go back and add more warps, and factor in average number of warps issuable into your inference. It would be difficult.

elpaul · December 6, 2024, 3:37pm

Thanks a lot for your detailed response and also suggestion of how one might be able to measure this. I see your idea but I agree that it would not be a trivial to do.

From Greg’s response, I get that each warp scheduler simply chooses from all eligible warps without differentiating which kernel it is originally from. This is more than enough to understand the behavior I was observing that lead me to this post in the first place. I was expecting that there is no straight-forward way of measuring this, but its nice to be confirmed in this thought :)

system · December 20, 2024, 3:37pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Beginner's question about concurrent warp execution. CUDA Programming and Performance	3	2504	July 4, 2019
Concurrent kernel CUDA Programming and Performance	8	1686	January 14, 2024
Question about threads per block and warps per SM CUDA Programming and Performance	13	16370	October 6, 2022
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28724	July 4, 2019
About the number of CUDA cores in SMSP, less or gerater than warp threads number(32) CUDA Programming and Performance	8	829	June 17, 2024
Understanding CUDA scheduling CUDA Programming and Performance	4	15580	May 20, 2014
Can threads in a warp from different blocks? CUDA Programming and Performance	17	11846	March 26, 2010
Scheduling Blocks on a Multi-Processor Block Scheduling on Multiprocessor CUDA Programming and Performance	11	6395	December 6, 2007
questions about thread execution & volatile CUDA Programming and Performance	19	16897	December 29, 2008
Run different kernels parallely on different SMs CUDA Programming and Performance	4	1072	June 22, 2018

Scheduling Warps of different kernels in the same cycle on the same SM

Related topics