How to Achieve Tighter Kernel Scheduling Across Multiple CUDA Streams?

I’ve implemented parallel scheduling logic in my custom AI compiler that identifies parallelizable IR sections and assigns them to different CUDA streams for concurrent execution. While I do observe kernel overlap, the nsys timeline shows kernels aren’t tightly packed with visible idle gaps, and some short-duration kernels that should run in parallel are actually executing serially.
(Kernel Execution Status Excerpt)


(Kernel Execution Status Within a Certain Parallel Group)

My implementation uses multiple streams for parallel execution groups, with kernels typically having sub-millisecond execution times. I’ve already implemented stream pool and handle pool logic to avoid repeatedly creating and destroying streams and handles during kernel execution, but scheduling gaps still persist.

Concerning performance: Based on my actual testing, the execution efficiency of my custom compiler with parallel scheduling is still lower than PyTorch’s single default stream execution, which suggests fundamental issues with my scheduling approach.(Although my approach is inherently worse than PyTorch in terms of individual kernel execution time, with an average difference of 1-2x, I believe I should be able to offset this gap through parallel execution and should even be able to perform better than it.)

I have three main questions:

First, how can I make kernels within the same parallel group or across different parallel groups execute with tighter coupling? I’m looking for techniques to minimize the gaps between kernel executions and achieve more seamless transitions.

Second, since I’m calling cuDNN and cuBLAS library functions for some operators, these highly optimized libraries generate many short-duration small kernels as observed in nsys. Is there any way to perform kernel fusion on these predefined cuDNN/cuBLAS API functions, or achieve operator fusion at this level?

Third, is there any approach to optimize cuDNN and cuBLAS API execution efficiency without relying on runtime profiling? I understand that cuDNN’s convolution APIs use algorithm auto-tuning that requires actual execution to determine the optimal algorithm at runtime. This runtime algorithm selection process introduces overhead that interferes with parallel scheduling efficiency. Are there ways to pre-determine or cache optimal algorithms to avoid this runtime overhead?

Of course, if there are other possible issues that could cause the situation described above, I welcome your suggestions as well.

I would greatly appreciate any insights and suggestions from the community to help solve these challenges.

If it were my code, I would probably start by trying to understand resource utilization of each kernel. If you are actually looking for kernel overlap, that will be a necessary consideration. There are lots of internet posts discussing resource utilization as it relates to concurrent (i.e. overlapping) kernel execution.

Beyond that CUDA graphs notionally have these ideas in mind. Tighter scheduling, reduced launch overhead, etc.

These libraries are closed source and generally “opaque”, meaning they are host code routines that call a set of device level work in some unspecified fashion, with no external control over the sequence. So this is not likely to be a productive avenue, IMO. If you find CUBLAS or CUDNN to be problematic, you should first make sure what you are asking for or expecting is reasonable, then file a bug.

In the meantime, CUTLASS is open source, giving you more-or-less complete control. I’m not suggesting there is a one-to-one replacement for every CUBLAS/CUDNN function in CUTLASS. They are quite different, but they cover similar areas.

I would suggest asking detailed CUDNN questions on the CUDNN forum, and specific CUBLAS questions on the CUBLAS (math library) forum.