TensorRT 3.0.2 with multi-streaming

viacheslav.natashyn · May 7, 2018, 8:27pm

I do profile the TRT C++ application using the nvprof tool. My application is instantiating 4 independent parallel streams derived from common TRT execution context:

for ( int stream_idx = 0 ; stream_idx < batchSize; stream_idx++ ){
		trtcontext.enqueue(1, buffers[stream_idx], stream[stream_idx], nullptr);
...

What I expect from this code is to see 4 or more equivalent cuda streams with the TRT cuda kernels executed concurrently and overlapped in time.

But looking at the nvprof log file in the NVIDIA Visual Profiler tool there is only two main streams where the Default one is dedicated for HtoD memory transfers and the other one is for TRT kernels execution that are unfortunately scheduled all in one stream and are not overlapped with HtoD memory transfers at all.

Here are next questions arise:

Is it really an issue with TensorRT so it is unable to utilise CUDA stream capability?
Could it be just nvprof tool problem that it is unable to show up CUDA kernels execution concurrency?
Does it actually make sense to profile TRT application using nvprof tool?
How we can ensure that TensorRT is really using multi-streaming?

NOTE: I am running and profiling my app on NVIDIA Drive PX with TensorRT 3.0.2

Thanks,

mvillmow · June 7, 2018, 6:15pm

There isn’t enough information here to know exactly what might be going wrong, but I can verify that parallel stream execution does work. With my GTC2018 talk, I discussed optimizing OpenNMT and on page 33 & 34, you can see multi-stream execution and parallel kernel execution.

There could be a few reasons for not seeing this.

The kernels are launch bound and the next kernel is not finished launching before the current kernel is done.
The kernels are resource bound and there are no free resources to execute the kernels in parallel.

boris.lesner · June 8, 2018, 3:26pm

Your engine may allocate buffers for it’s intermediate results, therefore concurrent execution will just result in a data race and you’ll end up with a wrong output.

If you want parallelism, for each stream you need a dedicated (CudaEngine, ExecutionContext) pair, then if your kernels are not resource bound you should have parallelism.

Hope it helps, I just spent a lot of time to deduce this as this is not really well documented.

myih · September 10, 2018, 9:58pm

I’m having similar issue with TensorRT4
Separated engines, contexts, streams, buffers:

IExecutionContext* context = engine->createExecutionContext();
    IExecutionContext* context2 = engine2->createExecutionContext();
    cudaStream_t stream, stream2;
    CHECK(cudaStreamCreate(&stream));
    CHECK(cudaStreamCreate(&stream2));

    CHECK(cudaMemset(buffers[inputIdx], 0, inputSize));
    CHECK(cudaMemset(buffers2[inputIdx], 0, inputSize));

    for (int i = 0; i < iteration;i++) //1000
    {
    context ->enqueue(batchSize, buffers, stream, nullptr);
    context2 ->enqueue(batchSize, buffers2, stream2, nullptr);
    }

But they still execute in interleaving,
however if I use MPS with multiple program launches, they will execute concurrently and have faster overall runtime, can be verified with visual profiler.

Topic		Replies	Views
TensorRT on Multiple CUDA-Streams GPU-Accelerated Libraries	1	2455	May 9, 2018
Is multi threaded execution possible with tensorRT? TensorRT	3	2302	April 13, 2020
Issue in making streams concurrent Jetson AGX Xavier	6	952	April 11, 2019
Batch inference parallelization on tensorrt TensorRT tensorrt , cuda	5	1012	May 5, 2021
IExecutionContext::enqueue - Multiple cuda streams NOT parallelized on TX2 but parallelized on host TensorRT	0	405	September 17, 2019
[Question] trtexec understanding issue TensorRT	4	1060	December 6, 2021
Batch inference parallelization on tensorrt DeepStream SDK tensorrt	2	515	October 12, 2021
Multi Stream in TensorRT TensorRT	1	2149	July 28, 2020
Visual Profiler: tracking of concurrent data transfers and kernel executions CUDA Programming and Performance	2	578	January 20, 2011
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5625	April 28, 2012

TensorRT 3.0.2 with multi-streaming

Related topics