Deadlock when using cuStreamWaitValue32/cuStreamWriteValue32 for async cross-stream ordering

Hi,

I’m trying to implement a pattern where a function:

  • submits a set of GPU jobs asynchronously to an internal stream, and
  • guarantees that any work enqueued by the caller onto a user-provided stream after the function returns will not begin executing until all of the internal work has completed.

The essential requirement is the ability to establish this ordering asynchronously, without blocking the function’s caller, and without requiring all submissions to have completed before returning.

Because CUDA events cannot be meaningfully used in a “wait-before-record” configuration, I tried using cuStreamWaitValue32 / cuStreamWriteValue32 to create a cross-stream fence:

  • The user stream waits on a device flag,
  • The internal stream writes the flag after all its work is enqueued.

However, I am encountering deadlocks (at least on my system - Windows 11 Pro, NVIDIA GeForce RTX 2070, CUDA 13.0, nvcc: Cuda compilation tools, release 13.0, V13.0.88 Build cuda_13.0.r13.0/compiler.36424714_0).

The symptoms match the warning in the documentation for cuStreamWaitValue32:

“Improper use of this API may deadlock the application. Synchronization ordering established through this API is not visible to CUDA.”

To better isolate this issue, I created a minimal standalone reproducer that contains only:

  • two streams,
  • a driver-allocated device flag,
  • a cuStreamWaitValue32 on the user stream,
  • a worker thread that enqueues trivial kernels on the internal stream,
  • a cuStreamWriteValue32 at the tail of the internal stream,
  • and a final cudaStreamSynchronize on the user stream.

On certain systems, the program hangs reliably.


Minimal Reproducer

#include <iostream>
#include <thread>
#include <cassert>

#include <cuda_runtime.h>
#include <cuda.h>

__global__ void trivial_kernel(int *data, int value) {
  // Just touch one int so we have *some* kernel work
  if (threadIdx.x == 0 && blockIdx.x == 0) {
    data[0] = value;
  }
}

int main() {
  cuInit(0);
  
  // Create streams
  cudaStream_t user_stream;
  cudaStream_t internal_stream;
  cudaStreamCreateWithFlags(&user_stream, cudaStreamNonBlocking);
  cudaStreamCreateWithFlags(&internal_stream, cudaStreamNonBlocking);

  // Simple device buffe for the kernel to touch
  int *d_data = nullptr;
  cudaMalloc(&d_data, sizeof(int));

  // Allocate the fence flag using the driver API
  CUdeviceptr d_flag = 0;
  cuMemAlloc(&d_flag, sizeof(uint32_t));

  // Initialise flag to 0
  cuMemsetD32(d_flag, 0u, 1);

  constexpr uint32_t done_value = 1u;

  // Add a wait to the user stream: it will stall until flag == done_value
  {
    CUstream cu_user = reinterpret_cast<CUstream>(user_stream);
    cuStreamWaitValue32(cu_user, d_flag, done_value, CU_STREAM_WAIT_VALUE_EQ);
  }

  std::cout << "Enqueued WaitValue on user stream" << std::endl;

  // Start a worker thread that enqueues work onto the internal stream
  std::thread worker{[internal_stream, d_data, d_flag, done_value]() {
    CUstream cu_internal = reinterpret_cast<CUstream>(internal_stream);

    std::cout << "[worker] Launching trivial kernels on internal stream...\n";

    constexpr int NUM_JOBS = 5;

    for (int i = 0; i < NUM_JOBS; ++i) {
      trivial_kernel<<<1, 1, 0, internal_stream>>>(d_data, i);
      cudaError_t err = cudaGetLastError();
      if (err != cudaSuccess) {
        std::cerr << "[worker] Kernel launch failed: "
                  << cudaGetErrorString(err) << '\n';
        return;
      }
    }

    std::cout << "[worker] Enqueuing WriteValue on internal_stream\n";

    cuStreamWriteValue32(cu_internal, d_flag, done_value, 0);

    std::cout << "[worker] Done enqueuing work on internal_stream\n";
  }};

  // Enqueue some work on the user stream *after* the wait.
  // This work should only run once the flag is set to done_value;
  trivial_kernel<<<1, 1, 0, user_stream>>>(d_data, 42);
  std::cout << "Enqueued trivial kernel on user_stream after WaitValue\n";

  // This is where things may hang:
  std::cout << "Synchronising on user_stream (may hang)...\n";
  cudaError_t sync_err = cudaStreamSynchronize(user_stream);
  std::cout << "cudaStreamSynchronize(user_stream) returned: "
            << cudaGetErrorString(sync_err) << '\n';

  worker.join();

  // Clean up
  cudaFree(d_data);
  cuMemFree(d_flag);
  cudaStreamDestroy(user_stream);
  cudaStreamDestroy(internal_stream);

  return 0;
}

Build (Windows/MSVC):

nvcc -std=c++17 wait_write_deadlock_repro.cu -o repro.exe -link cuda.lib

On certain system configurations, the program never reaches the final print and the worker thread appears to block during the first kernel launch.


Questions

  1. Is this deadlock expected behaviour when using cuStreamWaitValue32 / cuStreamWriteValue32 for cross-stream ordering?
  2. If so, why does it occur?
    Is the scheduler free to serialise streams in a way that makes the wait condition unsatisfiable because the dependency is not visible to CUDA?
  3. Is there any recommended way to express the following requirement using CUDA primitives?
    A function submits internal GPU work asynchronously, returns immediately, and must ensure that any GPU operations enqueued later on a user stream will not execute until all of the internal work has completed (but without blocking the caller or requiring host-side waits for submissions to finish).

Any clarification on whether this pattern is intended to work with wait/write-value fences would be very helpful.

Thanks in advance,
Chris

When I run your code on linux (CUDA 13, L4 GPU), I immediately get a hang earlier than what you indicate:

# ./t435
Enqueued WaitValue on user stream
[worker] Launching trivial kernels on internal stream...
^C

This looks to me like a lazy loading issue. Lazy loading requires synchronization, and (I suspect) the early issuance of the cuStreamWaitValue32 is interfering with that, at least in my case.

If I address that by adding a kernel invocation like so:

  ...
  constexpr uint32_t done_value = 1u;
  trivial_kernel<<<1, 1, 0, internal_stream>>>(d_data, 0);  // add
  cudaDeviceSynchronize();  // add

  // Add a wait to the user stream: it will stall until flag == done_value
  ...

then things seem to run to completion with no hangs:

# ./t435
Enqueued WaitValue on user stream
Enqueued trivial kernel on user_stream after WaitValue
Synchronising on user_stream (may hang)...
[worker] Launching trivial kernels on internal stream...
[worker] Enqueuing WriteValue on internal_stream
[worker] Done enqueuing work on internal_stream
cudaStreamSynchronize(user_stream) returned: no error
#

I don’t know how much of this maps to your case on windows. Could you try adding that kernel call to get the pipe primed?

Adding to Robert_Crovella’s answer.

I also found it not working on my Linux machine, and fixed it by disabling lazy loading via environment variable CUDA_MODULE_LOADING=EAGER ./repro .

Very interesting, thanks.

Adding just the kernel launch before the wait, without calling cudaDeviceSynchronize, removes the deadlock. Why would that be?

It seems to be a lazy loading artifact. It’s documented in the programming guide here. You can also find various discussions on it here on this forum.

In a nutshell, lazy loading impacts each specific kernel, the first time you launch that specific kernel. It requires a synchronization (cessation of all device activity) in order to load the kernel code, the first time you invoke each specific kernel. “persistent” device activity prevents the synchronization from taking place. I am guessing that cuStreamWaitValue32 is providing that kind of “persistent” device activity.

I don’t think that is how you set an env var on windows. So its not surprising to me that it doesn’t work. You can research on the web how to set env vars on windows.

Yes, setting CUDA_MODULE_LOADING=EAGER worked for me, too.

Makes sense, thanks. I suppose in a complex application where trivial_kernel is actually a bunch of kernels I can’t easily launch, setting CUDA_MODULE_LOADING=EAGER (this is what worked for me on Windows) might be the way to go.

If you don’t want the requirement to have an env var set, there are other programmatic remediation methods covered in the programming guide.

Excellent, thank you both.