CUDA Graph capture - work on separated streams invalidates graph capture

Hi!

In my C++ app (running on Orin AGX/CUDA 12.6) I have a CUDA graph captured and created on a dedicated stream/thread that works absolutely fine. I want to make the capture process bulletproof so I created stress test that spawns some other unrelated CUDA work (kernel launches and memory operations) also on separated threads and streams.
Here’s where problems start - if there’s a kernel (even empty) launch on Stream B/Thread B it invalidates my graph capture on Stream A/Thread A even if it seems completely unrelated. The error I’m getting is “operation not permitted when stream is capturing”. Async memory operations on B seem to work just fine without invalidating A.

  • why is a kernel launch on a separate stream B affecting capture on stream A? I cannot find the answer in the documentation. “3.2.8.7.3.2. Prohibited and Unhandled Operations” states about default stream and synchronous API but I’m not using that. I also tried different capture modes but with the same result.
  • is there a way in a complex multithreaded CUDA environment where a graph capture might happen rarely, but at any time to ensure “isolated” capture? I would like to avoid synchronizing every CUDA call for the time of the capture. There are some libraries that I don’t have the full control of the CUDA code inside.

Update:
Same code base on CUDA 12.6/RTX4070/Ubuntu22.04 does not seem to have this problem at all.
Looks Jetson specific with “cudaStreamCaptureModeThreadLocal” flag not working

Any hints, suggestions would be much appreciated!
Thanks

this may be of interest

thanks for the reply Robert! I think I’ve read all the graph related posts here

My graph doesnt have any dependencies on other streams. It’s captured in a separate thread with “cudaStreamCaptureModeThreadLocal” flag. The goal is to let other threads do their CUDA independent work without interfering with the capture being done at the same time in my dedicated thread/stream for this purpose.

I seem to have found a workaround (Orin) for this problem by calling below on every other thread:
cudaStreamCaptureMode mode = cudaStreamCaptureModeThreadLocal;
cudaThreadExchangeStreamCaptureMode(&mode);

Not fully tested but looks that it does the trick on Orin (Jetpack6).
Somehow it’s not required on my laptop with Ubuntu/RTX4070 with same CUDA version and the code base…

Why on every other thread? Are those the only threads involved in the stream capture?

Does the procedure keep working on your laptop, when including those lines?

There is the thread that is doing the capture, and there is “every other thread”.
I think what is meant is all threads except the one doing the stream capture.