Xid 31 error when two CudaGraph captured ExecutionContexts are executed concurrently

Description

I have two TensorRT plans compiled from ONNX using the standard TensorRT builder and ONNX parser.

I can successfully capture the ExecutionContexts derived from these plans to CudaGraphs and launch these on Streams (with outputs as expected).

However, when launching these operations repeatedly in a loop, and if certain conditions are met, we will eventually encounter a Xid 31 error after an arbitrary, large number of loop iterations. This error manifests itself in the program as a cuda error 700 (illegal memory access) when synchronizing the first stream.

The following conditions must all be true to trigger the error:

  • The ExectionContexts must be captured to graphs.
  • The two ExectionContexts must be executing in parallel (on two Streams).
  • There must be other compute processes on the same GPU.

compute-sanitizer (all tools) and cuda-memcheck (all tools) report no problems. The issue doesn’t seem to pop up when running with cuda-gdb. when CUDA_LAUNCH_BLOCKING=1 is used, the error is still received when synchronizing.

Environment

TensorRT Version: 8.6.1.6
GPU Type: tested with RTX 4070 and RTX A4500
Nvidia Driver Version: 550.78 (RTX 4070) or 525.60.13 (RTX A4500)
CUDA Version: tested with 11.8 and 12.3.2
CUDNN Version: 8.9.7
Operating System + Version: tested with linux 6.6 and linux 6.1
Python Version (if applicable): N/A
TensorFlow Version (if applicable): N/A
PyTorch Version (if applicable): N/A
Baremetal or Container (if container which image + tag): tested on nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 and nvcr.io/nvidia/tensorrt:24.01-py3

Relevant Files

Steps To Reproduce

git clone git@github.com:soooch/weird-trt-thing.git
cd weird-trt-thing
docker run --gpus all -it --rm -v .:/workspace nvcr.io/nvidia/tensorrt:24.01-py3

once inside container:

apt update
apt-get install -y parallel

make

# need at least 2, but will fail faster if more (hence 16)
parallel -j0 --delay 0.3 ./fuzzer ::: {1..16}
# wait up to ~ 10 minutes. usually much faster

I’ve submitted this as an issue to the TensorRT github as well.

It appears this is a known, tracked CUDA bug.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.