Can I Use CUDA graph for while loop

Hello I have couple kernels let’s say A,b and C. They are always executed in this order , no memory allocations between lounces - no problem with graph. Hovewer I execute them in a loop - like


and loop can potentially have hundreds of iterations - currently in order to avoid executing multiple kernel launches I use cooperative groups and sync_grid() so I fused A,B,C kernels and execute while loop inside one big kernel, Hovewer it leads to high register usage and reduces occupancy. Also makes code harder to debug …

Is it possible to utilize CUDA graph for while loops in any way?

Or maybe Stream events (kernels need to be executed sequentially - so I do not need streams, but I am wondering wheather If I will launch each kernal on separate stream and wait for events from other kernels until finish even will happen - but I am not sure wheather is it possible or good idea )

I also posted q on Using loop in CUDA Graph - Stack Overflow