TensorRT concurrent or parrellel inference in one GPU in jetson platform

Description

TensorRT C/C++ problem: On the Jetson Orin device, I started multiple threads, each with a trt file for cyclic AI inference (apply memory ->inference ->release memory). The context used was enqueueV3’s inference method(context->enqueueV3), and the methods used for applying and releasing memory were cudaMallocManaged() and cudaFree(). After the program runs, the memory in both threads shows continuous growth (no releasing buffers of input and output pointers, maybe the volumes of input (in KB) and output (in Bytes) buffer are too tiny.). That is “Memory Leak” ? !
Whatever Process I or II, they all reuslts in “Memory Leak”. However, the speed of memory leakage in Process II is faster than that in Process I.

Process I:
nvinfer1::IRuntime *runtime=…;
nvinfer1::ICudaEngine engine =…;
while(1) { // do inference in infinite loop
nvinfer1::IExecutionContext
context = engine->createExecutionContext();

cudaStream_t stream;
cudaStreamCreate(&stream);
context->setTensorAddress(INPUT_Name, (void *)inputPtr);
context->setTensorAddress(OUTPUT_Name, (void *)outputPtr);
context->enqueueV3(stream);

context->destroy();
}
engine->destroy();
runtime->destroy();

Process II:
nvinfer1::IRuntime *runtime=…;
nvinfer1::ICudaEngine engine =…;
nvinfer1::IExecutionContext
context = engine->createExecutionContext();
while(1) { // do inference in infinite loop

cudaStream_t stream;
cudaStreamCreate(&stream);
context->setTensorAddress(INPUT_Name, (void *)inputPtr);
context->setTensorAddress(OUTPUT_Name, (void *)outputPtr);
context->enqueueV3(stream);
}
context->destroy();
engine->destroy();
runtime->destroy();

Environment

JetPack Version: 5.1-b147
TensorRT Version: 8.5.2-1
GPU Type: Jetson Orin NX 16GB
Nvidia Driver Version:
CUDA Version: 11.4
CUDNN Version: 8.6
Operating System + Version: Linux orinnx 5.10.104-tegra

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,

The below links might be useful for you.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!