Hi,
I’m trying to implement a pattern where a function:
- submits a set of GPU jobs asynchronously to an internal stream, and
- guarantees that any work enqueued by the caller onto a user-provided stream after the function returns will not begin executing until all of the internal work has completed.
The essential requirement is the ability to establish this ordering asynchronously, without blocking the function’s caller, and without requiring all submissions to have completed before returning.
Because CUDA events cannot be meaningfully used in a “wait-before-record” configuration, I tried using cuStreamWaitValue32 / cuStreamWriteValue32 to create a cross-stream fence:
- The user stream waits on a device flag,
- The internal stream writes the flag after all its work is enqueued.
However, I am encountering deadlocks (at least on my system - Windows 11 Pro, NVIDIA GeForce RTX 2070, CUDA 13.0, nvcc: Cuda compilation tools, release 13.0, V13.0.88 Build cuda_13.0.r13.0/compiler.36424714_0).
The symptoms match the warning in the documentation for cuStreamWaitValue32:
“Improper use of this API may deadlock the application. Synchronization ordering established through this API is not visible to CUDA.”
To better isolate this issue, I created a minimal standalone reproducer that contains only:
- two streams,
- a driver-allocated device flag,
- a
cuStreamWaitValue32on the user stream, - a worker thread that enqueues trivial kernels on the internal stream,
- a
cuStreamWriteValue32at the tail of the internal stream, - and a final
cudaStreamSynchronizeon the user stream.
On certain systems, the program hangs reliably.
Minimal Reproducer
#include <iostream>
#include <thread>
#include <cassert>
#include <cuda_runtime.h>
#include <cuda.h>
__global__ void trivial_kernel(int *data, int value) {
// Just touch one int so we have *some* kernel work
if (threadIdx.x == 0 && blockIdx.x == 0) {
data[0] = value;
}
}
int main() {
cuInit(0);
// Create streams
cudaStream_t user_stream;
cudaStream_t internal_stream;
cudaStreamCreateWithFlags(&user_stream, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&internal_stream, cudaStreamNonBlocking);
// Simple device buffe for the kernel to touch
int *d_data = nullptr;
cudaMalloc(&d_data, sizeof(int));
// Allocate the fence flag using the driver API
CUdeviceptr d_flag = 0;
cuMemAlloc(&d_flag, sizeof(uint32_t));
// Initialise flag to 0
cuMemsetD32(d_flag, 0u, 1);
constexpr uint32_t done_value = 1u;
// Add a wait to the user stream: it will stall until flag == done_value
{
CUstream cu_user = reinterpret_cast<CUstream>(user_stream);
cuStreamWaitValue32(cu_user, d_flag, done_value, CU_STREAM_WAIT_VALUE_EQ);
}
std::cout << "Enqueued WaitValue on user stream" << std::endl;
// Start a worker thread that enqueues work onto the internal stream
std::thread worker{[internal_stream, d_data, d_flag, done_value]() {
CUstream cu_internal = reinterpret_cast<CUstream>(internal_stream);
std::cout << "[worker] Launching trivial kernels on internal stream...\n";
constexpr int NUM_JOBS = 5;
for (int i = 0; i < NUM_JOBS; ++i) {
trivial_kernel<<<1, 1, 0, internal_stream>>>(d_data, i);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
std::cerr << "[worker] Kernel launch failed: "
<< cudaGetErrorString(err) << '\n';
return;
}
}
std::cout << "[worker] Enqueuing WriteValue on internal_stream\n";
cuStreamWriteValue32(cu_internal, d_flag, done_value, 0);
std::cout << "[worker] Done enqueuing work on internal_stream\n";
}};
// Enqueue some work on the user stream *after* the wait.
// This work should only run once the flag is set to done_value;
trivial_kernel<<<1, 1, 0, user_stream>>>(d_data, 42);
std::cout << "Enqueued trivial kernel on user_stream after WaitValue\n";
// This is where things may hang:
std::cout << "Synchronising on user_stream (may hang)...\n";
cudaError_t sync_err = cudaStreamSynchronize(user_stream);
std::cout << "cudaStreamSynchronize(user_stream) returned: "
<< cudaGetErrorString(sync_err) << '\n';
worker.join();
// Clean up
cudaFree(d_data);
cuMemFree(d_flag);
cudaStreamDestroy(user_stream);
cudaStreamDestroy(internal_stream);
return 0;
}
Build (Windows/MSVC):
nvcc -std=c++17 wait_write_deadlock_repro.cu -o repro.exe -link cuda.lib
On certain system configurations, the program never reaches the final print and the worker thread appears to block during the first kernel launch.
Questions
- Is this deadlock expected behaviour when using
cuStreamWaitValue32/cuStreamWriteValue32for cross-stream ordering? - If so, why does it occur?
Is the scheduler free to serialise streams in a way that makes the wait condition unsatisfiable because the dependency is not visible to CUDA? - Is there any recommended way to express the following requirement using CUDA primitives?
A function submits internal GPU work asynchronously, returns immediately, and must ensure that any GPU operations enqueued later on a user stream will not execute until all of the internal work has completed (but without blocking the caller or requiring host-side waits for submissions to finish).
Any clarification on whether this pattern is intended to work with wait/write-value fences would be very helpful.
Thanks in advance,
Chris