For the same TRT application source compiled for both TX2 (Jetpack 4.2.1) and x86 host (CUDA 10.0, GeForce 2060), multiple calls of IExecutionContext::enqueue called with different streams result in sequential operations on TX2 but are parallelized on host.
I would expect that the enqueue operations would be executed in parallel on the TX2 as well since that’s the whole point of using multiple streams.
Look at the attached results captured with nvprof.
The observation is also consistent with our internal CUDA profiler using events.
Can anyone explain that horribly suboptimal behavior with TRT and TX2?