Issue using CUPTI Activity API with CUDA Graph enabled

Description

I have a test on the CUPTI Activity API, which is based on the TensorRT official sample trtexec.

CUPTI_CALL(cuptiActivityRegisterCallbacks(bufferRequested, bufferCompleted));

There are two functions that are registered by cuptiActivityRegisterCallbacks. These functions are responsible for allocating memory and processing data when a CUDA activity occurs.

I encountered an issue when utilizing the CUDA graph and repeating model inference multiple times. Initially, the CUDA activities would successfully trigger bufferRequested and bufferCompleted in the first few iterations. However, in later iterations, the CUDA activities only triggered bufferRequested, resulting in data not being processed.

By the way, this issue cannot be reproduced on the x86 platform.

Environment

TensorRT Version: v8502
GPU Type: Jetson Orin
Nvidia Driver Version: nvidia-jetpack 5.1.1-b56
CUDA Version: 11.4
CUDNN Version: 8.6
Operating System + Version: Ubuntu 20.04.6 LTS
Python Version (if applicable): N/A
TensorFlow Version (if applicable): N/A
PyTorch Version (if applicable): N/A
Baremetal or Container (if container which image + tag): N/A

Relevant Files

trt_samples.tar.gz (838.9 KB)

Steps To Reproduce

  • tar xvzf trt_samples.tar.gz
  • cd trt_samples/trtexec
  • ./complie.sh
  • make -j8
  • sudo ../../bin/trtexec --loadEngine=./resnet.engine --useCudaGraph

I have added logging statements in both the bufferRequested and bufferCompleted functions. When running the test, you will initially see outputs from both functions, and then only the output from bufferRequested will remain.