I’m running an inference application with TensorRT 2.1 on multiple CUDA-streams. However, the application has low CUDA-stream concurrency. According to my debugging with Visual Profiler, the trtwell_scudnn_128x32_relu_interior_nn functions for each CUDA-stream does not run in parallel. (It seems that only one trtwell_scudnn_128x32_relu_interior_nn function can run a time.) Does it have any mutual exclusions?
Because it seems that TensorRT has many CPU-GPU interactions, I made POSIX threads for each CUDA-stream, so that the CPU routine inside TensorRT can run in parallel. Each POSIX worker thread repeats the following functions:
- sem_wait for a batch input
- cudaMemcpyAsync (Host to Device)
- cudaMemcpyAsync (Device to Host)
I’m seeing the same behavior on both Quadro GP100 and Jetson TX2.