If it were my code, I would probably start by trying to understand resource utilization of each kernel. If you are actually looking for kernel overlap, that will be a necessary consideration. There are lots of internet posts discussing resource utilization as it relates to concurrent (i.e. overlapping) kernel execution.
Beyond that CUDA graphs notionally have these ideas in mind. Tighter scheduling, reduced launch overhead, etc.
These libraries are closed source and generally “opaque”, meaning they are host code routines that call a set of device level work in some unspecified fashion, with no external control over the sequence. So this is not likely to be a productive avenue, IMO. If you find CUBLAS or CUDNN to be problematic, you should first make sure what you are asking for or expecting is reasonable, then file a bug.
In the meantime, CUTLASS is open source, giving you more-or-less complete control. I’m not suggesting there is a one-to-one replacement for every CUBLAS/CUDNN function in CUTLASS. They are quite different, but they cover similar areas.
I would suggest asking detailed CUDNN questions on the CUDNN forum, and specific CUBLAS questions on the CUBLAS (math library) forum.