TensorRT batch inference - How to be sure one kernel does use all the GPU ressources?

Hi,

I tried to run batch inference executions in parrallel as recommanded in the following tensorrt guide (in the “2.3 Streaming part”).
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html
“In the context of TensorRT and inference, each layer of the optimized final network will require work on the GPU. However, not all layers will be able to fully utilize the computation capabilities of the hardware. Scheduling requests in separate streams allows work to be scheduled immediately as the hardware becomes available without unnecessary synchronization. Even if only some layers can be overlapped, overall performance will improve.” – TensorRT Best practices Guide

image

Indeed, most of the time, the kernels do not run concurrently… → I gain nothing.
I also read this following post :

image

In theory, running batch inference in parrallel could save a lot of time.
So, I want to be sure that kernels can’t be parrallelized because GPU ressources are already fully used.
In order to profile batch inference kernels executions, I used Nsight Compute CLI.
I can’t find a solution to know how fully occupied the GPU is because most of the metrics are “per SM”.

sm_efficiency The percentage of time at least one warp is active on a specific multiprocessor
I would like a metric that tells how many SMs are used in average during the kernel execution.
Thx,

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Thanks!

Hi,

Thanks for your reply.
Unfortunatly, I’m afraid I can’t share my model.
But do you know a metric that tells how many SMs are used in average during the kernel execution?

Hi @juliefraysse,

Sorry for the delayed response. We recommend you to post your query on Nsight Systems - NVIDIA Developer Forums to get better help.

Thank you.