system 1:
GPU: RTX A4500 x 4
driver: 525.60.13
cudart API: 11.4
kernel: Linux 6.1
container: nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
system 2:
GPU: RTX 4070
driver: 550.78
cudart API: 11.4
kernel: Linux 6.6.32
container: nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04
In our tests, we use CUDA_VISIBLE_DEVICES to expose only a single device. We do not use MPS.
The error appears to be a “Xid 31”.
Our workload is a small network of TensorRT ExecutionContext
s [0]. We’ve captured this network to a graph via the stream capture API. Pinned memory (via cudaMallocHost
) is used for the IO buffers of each ExecutionContext
. The host does not read or write to these buffers from the point where we enqueue the ExecutionContext
s to the point where we’ve synchronized on the main stream.
We have a minimal program which loops over this operation (writes random inputs, enqueues, synchronizes, reads outputs). If we run two or more instances of this program in parallel on the same GPU, we can consistently reproduce this issue after a seemingly random number of loop iterations. We have not managed to reproduce this issue when only a single instance is running. We have also not been able to reproduce the issue when we run a single instance per GPU (so up to four processes). Once the error occurs, subsequent CUDA API calls, like cudaFree, don’t return successfully.
compute-sanitizer does not report any errors for memcheck, initcheck, synccheck, or racecheck. When using CUDA_LAUNCH_BLOCKING=1, the error still appears when synchronizing, so we have no additional information to go on. What’s interesting is that the 700 error is not encountered if we don’t graph our network.
[0]: What we mean by “network” is that we enqueue a number of ExecutionContext
s across a few streams and declare dependency relations with cudaEventRecord
and cudaStreamWaitEvent
.
Concretely, to run a network we will execute something like the following:
a.enqueueV3(stream0);
cudaEventRecord(event0, stream0);
cudaStreamWaitEvent(stream1, event0, cudaEventWaitDefault);
cudaEventRecord(event1, stream0);
cudaStreamWaitEvent(stream2, event1, cudaEventWaitDefault);
b.enqueueV3(stream0);
c.enqueueV3(stream2);
cudaEventRecord(event2, stream2);
cudaStreamWaitEvent(stream0, event2, cudaEventWaitDefault);
cudaEventRecord(event3, stream0);
cudaStreamWaitEvent(stream3, event3, cudaEventWaitDefault);
d.enqueueV3(stream0);
e.enqueueV3(stream3);
cudaEventRecord(event4, stream3);
cudaStreamWaitEvent(stream0, event4, cudaEventWaitDefault);
f.enqueueV3(stream0);
g.enqueueV3(stream1);
cudaEventRecord(event2, stream1);
cudaStreamWaitEvent(stream0, event2, cudaEventWaitDefault);
which maps to the following flow:
│
a
│
├─────────┐
│ │
├────┐ │
│ │ │
b c │
│ │ │
│◄───┘ │
│ │
├────┐ g
│ │ │
d e │
│ │ │
│◄───┘ │
│ │
f │
│ │
│◄────────┘
│
▼
ExecutionContext
s serialized later may use earlier ExecutionContext
’s output buffers as their input buffers. For example, given the above network, the following dependency relations are allowed:
f: abcde
d: abc
e: abc
b: a
c: a
g: a
so f
may use any output buffer from a
, b
, c
, d
, or e
as an input buffer.
Similarly, ExecutionContext
s serialized later will reuse the workspace memory of those from before. This means that the total workspace memory allocated for the above network is:
max(a, max(b+c, d+e, f) + g)