In my C++ app (running on Orin AGX/CUDA 12.6) I have a CUDA graph captured and created on a dedicated stream/thread that works absolutely fine. I want to make the capture process bulletproof so I created stress test that spawns some other unrelated CUDA work (kernel launches and memory operations) also on separated threads and streams.
Here’s where problems start - if there’s a kernel (even empty) launch on Stream B/Thread B it invalidates my graph capture on Stream A/Thread A even if it seems completely unrelated. The error I’m getting is “operation not permitted when stream is capturing”. Async memory operations on B seem to work just fine without invalidating A.
why is a kernel launch on a separate stream B affecting capture on stream A? I cannot find the answer in the documentation. “3.2.8.7.3.2. Prohibited and Unhandled Operations” states about default stream and synchronous API but I’m not using that. I also tried different capture modes but with the same result.
is there a way in a complex multithreaded CUDA environment where a graph capture might happen rarely, but at any time to ensure “isolated” capture? I would like to avoid synchronizing every CUDA call for the time of the capture. There are some libraries that I don’t have the full control of the CUDA code inside.
Update:
Same code base on CUDA 12.6/RTX4070/Ubuntu22.04 does not seem to have this problem at all.
Looks Jetson specific with “cudaStreamCaptureModeThreadLocal” flag not working
Any hints, suggestions would be much appreciated!
Thanks
thanks for the reply Robert! I think I’ve read all the graph related posts here
My graph doesnt have any dependencies on other streams. It’s captured in a separate thread with “cudaStreamCaptureModeThreadLocal” flag. The goal is to let other threads do their CUDA independent work without interfering with the capture being done at the same time in my dedicated thread/stream for this purpose.
I seem to have found a workaround (Orin) for this problem by calling below on every other thread:
cudaStreamCaptureMode mode = cudaStreamCaptureModeThreadLocal;
cudaThreadExchangeStreamCaptureMode(&mode);
Not fully tested but looks that it does the trick on Orin (Jetpack6).
Somehow it’s not required on my laptop with Ubuntu/RTX4070 with same CUDA version and the code base…
There is the thread that is doing the capture, and there is “every other thread”.
I think what is meant is all threads except the one doing the stream capture.
The important bit which is mentioned in documentation, but not obvious:
cudaStreamCaptureModeGlobal: This is the default mode. …, or if any other thread has a concurrent capture sequence initiated with cudaStreamCaptureModeGlobal, this thread is prohibited from potentially unsafe API calls.
Therefore, if stream A captures with cudaStreamCaptureModeGlobal and stream B tries unsafe operation (cudaMalloc, cudaStreamSynchronize) - it will get error above.
Unfortunately docs don’t say much about unsafe API calls and what are downsides of cudaStreamCaptureModeThreadLocal. It also feels that cudaStreamCaptureModeGlobal is too restrictive to be useful