TensorRT on Multiple CUDA-Streams


I’m running an inference application with TensorRT 2.1 on multiple CUDA-streams. However, the application has low CUDA-stream concurrency. According to my debugging with Visual Profiler, the trtwell_scudnn_128x32_relu_interior_nn functions for each CUDA-stream does not run in parallel. (It seems that only one trtwell_scudnn_128x32_relu_interior_nn function can run a time.) Does it have any mutual exclusions?

Because it seems that TensorRT has many CPU-GPU interactions, I made POSIX threads for each CUDA-stream, so that the CPU routine inside TensorRT can run in parallel. Each POSIX worker thread repeats the following functions:

  1. sem_wait for a batch input
  2. cudaMemcpyAsync (Host to Device)
  3. nvinfer1::IExecutionContext::enqueue
  4. cudaMemcpyAsync (Device to Host)

I’m seeing the same behavior on both Quadro GP100 and Jetson TX2.


I’m having even worth results with TensorRT 3.2 as it does not create multiple overlapping kernel executions and it uses a lot of HtoD memory transfers associated with the Default stream.