TensorRT 3.0.2 with multi-streaming

I do profile the TRT C++ application using the nvprof tool. My application is instantiating 4 independent parallel streams derived from common TRT execution context:

for ( int stream_idx = 0 ; stream_idx < batchSize; stream_idx++ ){
		trtcontext.enqueue(1, buffers[stream_idx], stream[stream_idx], nullptr);

What I expect from this code is to see 4 or more equivalent cuda streams with the TRT cuda kernels executed concurrently and overlapped in time.

But looking at the nvprof log file in the NVIDIA Visual Profiler tool there is only two main streams where the Default one is dedicated for HtoD memory transfers and the other one is for TRT kernels execution that are unfortunately scheduled all in one stream and are not overlapped with HtoD memory transfers at all.

Here are next questions arise:

  1. Is it really an issue with TensorRT so it is unable to utilise CUDA stream capability?
  2. Could it be just nvprof tool problem that it is unable to show up CUDA kernels execution concurrency?
  3. Does it actually make sense to profile TRT application using nvprof tool?
  4. How we can ensure that TensorRT is really using multi-streaming?

NOTE: I am running and profiling my app on NVIDIA Drive PX with TensorRT 3.0.2


There isn’t enough information here to know exactly what might be going wrong, but I can verify that parallel stream execution does work. With my GTC2018 talk, I discussed optimizing OpenNMT and on page 33 & 34, you can see multi-stream execution and parallel kernel execution.

There could be a few reasons for not seeing this.

  1. The kernels are launch bound and the next kernel is not finished launching before the current kernel is done.
  2. The kernels are resource bound and there are no free resources to execute the kernels in parallel.

Your engine may allocate buffers for it’s intermediate results, therefore concurrent execution will just result in a data race and you’ll end up with a wrong output.

If you want parallelism, for each stream you need a dedicated (CudaEngine, ExecutionContext) pair, then if your kernels are not resource bound you should have parallelism.

Hope it helps, I just spent a lot of time to deduce this as this is not really well documented.

I’m having similar issue with TensorRT4
Separated engines, contexts, streams, buffers:

IExecutionContext* context = engine->createExecutionContext();
    IExecutionContext* context2 = engine2->createExecutionContext();
    cudaStream_t stream, stream2;

    CHECK(cudaMemset(buffers[inputIdx], 0, inputSize));
    CHECK(cudaMemset(buffers2[inputIdx], 0, inputSize));

    for (int i = 0; i < iteration;i++) //1000
    context ->enqueue(batchSize, buffers, stream, nullptr);
    context2 ->enqueue(batchSize, buffers2, stream2, nullptr);

But they still execute in interleaving,
however if I use MPS with multiple program launches, they will execute concurrently and have faster overall runtime, can be verified with visual profiler.