Hi!
In my C++ app (running on Orin AGX/CUDA 12.6) I have a CUDA graph captured and created on a dedicated stream/thread that works absolutely fine. I want to make the capture process bulletproof so I created stress test that spawns some other unrelated CUDA work (kernel launches and memory operations) also on separated threads and streams.
Here’s where problems start - if there’s a kernel (even empty) launch on Stream B/Thread B it invalidates my graph capture on Stream A/Thread A even if it seems completely unrelated. The error I’m getting is “operation not permitted when stream is capturing”. Async memory operations on B seem to work just fine without invalidating A.
- why is a kernel launch on a separate stream B affecting capture on stream A? I cannot find the answer in the documentation. “3.2.8.7.3.2. Prohibited and Unhandled Operations” states about default stream and synchronous API but I’m not using that. I also tried different capture modes but with the same result.
- is there a way in a complex multithreaded CUDA environment where a graph capture might happen rarely, but at any time to ensure “isolated” capture? I would like to avoid synchronizing every CUDA call for the time of the capture. There are some libraries that I don’t have the full control of the CUDA code inside.
Update:
Same code base on CUDA 12.6/RTX4070/Ubuntu22.04 does not seem to have this problem at all.
Looks Jetson specific with “cudaStreamCaptureModeThreadLocal” flag not working
Any hints, suggestions would be much appreciated!
Thanks