TensorRT concurrent or parrellel inference in one GPU in jetson platform

280650439 · June 29, 2023, 11:07am

Description

TensorRT C/C++ problem: On the Jetson Orin device, I started multiple threads, each with a trt file for cyclic AI inference (apply memory ->inference ->release memory). The context used was enqueueV3’s inference method(context->enqueueV3), and the methods used for applying and releasing memory were cudaMallocManaged() and cudaFree(). After the program runs, the memory in both threads shows continuous growth （no releasing buffers of input and output pointers, maybe the volumes of input (in KB) and output (in Bytes) buffer are too tiny.）. That is “Memory Leak” ? !
Whatever Process I or II, they all reuslts in “Memory Leak”. However, the speed of memory leakage in Process II is faster than that in Process I.

Process I：
nvinfer1::IRuntime *runtime=…;
nvinfer1::ICudaEngine engine =…;
while(1) { // do inference in infinite loop
nvinfer1::IExecutionContext context = engine->createExecutionContext();
…
cudaStream_t stream;
cudaStreamCreate(&stream);
context->setTensorAddress(INPUT_Name, (void *)inputPtr);
context->setTensorAddress(OUTPUT_Name, (void *)outputPtr);
context->enqueueV3(stream);

context->destroy();
}
engine->destroy();
runtime->destroy();

Process II：
nvinfer1::IRuntime *runtime=…;
nvinfer1::ICudaEngine engine =…;
nvinfer1::IExecutionContext context = engine->createExecutionContext();
while(1) { // do inference in infinite loop
…
cudaStream_t stream;
cudaStreamCreate(&stream);
context->setTensorAddress(INPUT_Name, (void *)inputPtr);
context->setTensorAddress(OUTPUT_Name, (void *)outputPtr);
context->enqueueV3(stream);
}
context->destroy();
engine->destroy();
runtime->destroy();

Environment

JetPack Version: 5.1-b147
TensorRT Version: 8.5.2-1
GPU Type: Jetson Orin NX 16GB
Nvidia Driver Version:
CUDA Version: 11.4
CUDNN Version: 8.6
Operating System + Version: Linux orinnx 5.10.104-tegra

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered

AakankshaS · June 29, 2023, 2:07pm

Hi,

The below links might be useful for you.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!

Topic		Replies	Views
Tensorrt Threads affect each other during multithreaded inference TensorRT tensorrt	16	1310	September 6, 2024
TensorRT 7.0 memory leak TensorRT tensorrt	3	1435	January 28, 2021
Speeding up multi-threaded C++ program of TensorRT models TensorRT tensorrt	6	1246	August 2, 2024
TensorRT Concurrent inference in C++ TensorRT cudnn	4	582	February 6, 2024
Multiple threads running inference are causing a slowdown TensorRT jetson , deepstream	1	741	August 1, 2023
Latency when running TensorRT engine on two GPU TensorRT	9	1223	August 24, 2020
how to run trt in multithreading？ Jetson TX2	15	7911	October 18, 2021
Unable to do inference of multiple engines in parallel TensorRT tensorrt , nano	3	1680	May 6, 2022
Concurrent tensorRT engines TensorRT jetson	1	383	December 5, 2022
Multiple threads execution with different engines in tensorrt TensorRT tensorrt	3	2394	December 13, 2022

TensorRT concurrent or parrellel inference in one GPU in jetson platform

Description

Environment

Relevant Files

Steps To Reproduce

Related topics