Nvidia Cuda Streams

We are actually trying to solve a recursive problem whose depths is more than 30 where the hardware spec says the limit is 24.
Using streams, can we try to launch the kernel and make sure they are synchronized ? Our desire is kernel launch over head need to be minimized and avoid cudaSynchronization assuming that kernels run inorder execution.

Hi there @svramana1989 and welcome to the NVIDIA developer forums.

I took the liberty of moving this to the CUDA programming category where there are more CUDA experts.

But you should add more detail on your system configuration for them to be able to help.

Thanks!

the nesting limit depth of 24 is coming from CUDA dynamic parallelism docs/specs, specifically CDP1. CDP1 is not a recommended way to avoid kernel launch overhead. A CDP launch has approximately the same overhead as an ordinary host kernel launch.

I’m not aware of a method using streams that would work around the nesting depth limit. You could certainly keep track of the nesting depth, and if it exceeded some level, then take an alternate path, but such a methodology seems roughly equivalent to converting a recursive method to a non-recursive method.

You could possibly also explore the new launch methods in CDP2, fire and forget and tail launch but neither of these conform exactly to a recursion paradigm, in my view. You may also want to explore cuda graphs but again its not an exact recursive paradigm.

Thanks for the reply