I tried to run batch inference executions in parrallel as recommanded in the following tensorrt guide (in the “2.3 Streaming part”).
“In the context of TensorRT and inference, each layer of the optimized final network will require work on the GPU. However, not all layers will be able to fully utilize the computation capabilities of the hardware. Scheduling requests in separate streams allows work to be scheduled immediately as the hardware becomes available without unnecessary synchronization. Even if only some layers can be overlapped, overall performance will improve.” – TensorRT Best practices Guide
Indeed, most of the time, the kernels do not run concurrently… → I gain nothing.
I also read this following post :
In theory, running batch inference in parrallel could save a lot of time.
So, I want to be sure that kernels can’t be parrallelized because GPU ressources are already fully used.
In order to profile batch inference kernels executions, I used Nsight Compute CLI.
I can’t find a solution to know how fully occupied the GPU is because most of the metrics are “per SM”.
sm_efficiency The percentage of time at least one warp is active on a specific multiprocessor
I would like a metric that tells how many SMs are used in average during the kernel execution.