cudaErrorIllegalAddress (700) from cudaStreamSynchronize only if multiple processes use same GPU

system 1:
GPU: RTX A4500 x 4
driver: 525.60.13
cudart API: 11.4
kernel: Linux 6.1
container: nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04

system 2:
GPU: RTX 4070
driver: 550.78
cudart API: 11.4
kernel: Linux 6.6.32
container: nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04

In our tests, we use CUDA_VISIBLE_DEVICES to expose only a single device. We do not use MPS.

The error appears to be a “Xid 31”.

Our workload is a small network of TensorRT ExecutionContexts [0]. We’ve captured this network to a graph via the stream capture API. Pinned memory (via cudaMallocHost) is used for the IO buffers of each ExecutionContext. The host does not read or write to these buffers from the point where we enqueue the ExecutionContexts to the point where we’ve synchronized on the main stream.

We have a minimal program which loops over this operation (writes random inputs, enqueues, synchronizes, reads outputs). If we run two or more instances of this program in parallel on the same GPU, we can consistently reproduce this issue after a seemingly random number of loop iterations. We have not managed to reproduce this issue when only a single instance is running. We have also not been able to reproduce the issue when we run a single instance per GPU (so up to four processes). Once the error occurs, subsequent CUDA API calls, like cudaFree, don’t return successfully.

compute-sanitizer does not report any errors for memcheck, initcheck, synccheck, or racecheck. When using CUDA_LAUNCH_BLOCKING=1, the error still appears when synchronizing, so we have no additional information to go on. What’s interesting is that the 700 error is not encountered if we don’t graph our network.

[0]: What we mean by “network” is that we enqueue a number of ExecutionContexts across a few streams and declare dependency relations with cudaEventRecord and cudaStreamWaitEvent.

Concretely, to run a network we will execute something like the following:

a.enqueueV3(stream0);
cudaEventRecord(event0, stream0);
cudaStreamWaitEvent(stream1, event0, cudaEventWaitDefault);
cudaEventRecord(event1, stream0);
cudaStreamWaitEvent(stream2, event1, cudaEventWaitDefault);
b.enqueueV3(stream0);
c.enqueueV3(stream2);
cudaEventRecord(event2, stream2);
cudaStreamWaitEvent(stream0, event2, cudaEventWaitDefault);
cudaEventRecord(event3, stream0);
cudaStreamWaitEvent(stream3, event3, cudaEventWaitDefault);
d.enqueueV3(stream0);
e.enqueueV3(stream3);
cudaEventRecord(event4, stream3);
cudaStreamWaitEvent(stream0, event4, cudaEventWaitDefault);
f.enqueueV3(stream0);
g.enqueueV3(stream1);
cudaEventRecord(event2, stream1);
cudaStreamWaitEvent(stream0, event2, cudaEventWaitDefault);

which maps to the following flow:

│          
a          
│          
├─────────┐
│         │
├────┐    │
│    │    │
b    c    │
│    │    │
│◄───┘    │
│         │
├────┐    g
│    │    │
d    e    │
│    │    │
│◄───┘    │
│         │
f         │
│         │
│◄────────┘
│          
▼          

ExecutionContexts serialized later may use earlier ExecutionContext’s output buffers as their input buffers. For example, given the above network, the following dependency relations are allowed:

f: abcde
d: abc
e: abc
b: a
c: a
g: a

so f may use any output buffer from a, b, c, d, or e as an input buffer.

Similarly, ExecutionContexts serialized later will reuse the workspace memory of those from before. This means that the total workspace memory allocated for the above network is:

max(a, max(b+c, d+e, f) + g)

More things I’ve tried:

Graphing each ExecutionContext individually instead of all together results in the same error. If I synchronize after launching each ExecutionContext graph, the error does not appear. It only appears if we launch each ExecutionContext graph then synchronize.

I’ve reduced my network to the following and see the same issue.

│         
├────┐    
│    │    
b    c   
│    │    
│◄───┘    
│         
├────┐    
│    │    
d    e    
│    │    
│◄───┘    
│          
▼

However, when I rearrange it a bit to the following:

│         
├────┐    
│    │    
b    c   
│    │    
│◄───┘    
│    
d
│
e
│          
▼

I can not reproduce the issue. both d and e use output buffers from b and c as their input buffers. Is it invalid to read pinned memory on multiple concurrent streams? Keep in mind that while d and e would be reading from these buffers, nothing is writing to them.

edit: I’ve gone ahead and swapped from pinned memory to device memory and still see the same issue. So this is unrelated to allocation type.

edit: It should be valid to concurrently read “any form of memory”. So it’s very likely this line of inquiry is a dead end.

Edit: TL;DR: CUDA bug. see details in below link

So I’ve reduced this issue to the point where nearly all the context provided above is unnecessary. I would like to close this topic, but it doesn’t appear I have permission. So I’m just going to open another one based on the minified repro.

You can “close” an issue/topic by selecting one of the items in the thread and selecting it as the “solution”. That means, in the icons below a thread entry, select the icon that looks like a checkmark in a box. After two weeks, the thread becomes locked, unable to accept further posts. This is perhaps a bit silly to call something a solution, but it is plainly evident what you are doing from what you have posted in the thread.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.