TensorRT batch inference - How to be sure one kernel does use all the GPU ressources?

juliefraysse · May 1, 2021, 6:23pm

Hi,

I tried to run batch inference executions in parrallel as recommanded in the following tensorrt guide (in the “2.3 Streaming part”).
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html
“In the context of TensorRT and inference, each layer of the optimized final network will require work on the GPU. However, not all layers will be able to fully utilize the computation capabilities of the hardware. Scheduling requests in separate streams allows work to be scheduled immediately as the hardware becomes available without unnecessary synchronization. Even if only some layers can be overlapped, overall performance will improve.” – TensorRT Best practices Guide

Indeed, most of the time, the kernels do not run concurrently… → I gain nothing.
I also read this following post :

In theory, running batch inference in parrallel could save a lot of time.
So, I want to be sure that kernels can’t be parrallelized because GPU ressources are already fully used.
In order to profile batch inference kernels executions, I used Nsight Compute CLI.
I can’t find a solution to know how fully occupied the GPU is because most of the metrics are “per SM”.

sm_efficiency The percentage of time at least one warp is active on a specific multiprocessor
I would like a metric that tells how many SMs are used in average during the kernel execution.
Thx,

NVES · May 3, 2021, 7:22am

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

github.com

onnx/onnx-tensorrt/blob/main/docs/operators.md

<!--- SPDX-License-Identifier: Apache-2.0 -->

# Supported ONNX Operators

TensorRT 8.4 supports operators up to Opset 17. Latest information of ONNX operators can be found [here](https://github.com/onnx/onnx/blob/master/docs/Operators.md)

TensorRT supports the following ONNX data types: DOUBLE, FLOAT32, FLOAT16, INT8, and BOOL

> Note: There is limited support for INT32, INT64, and DOUBLE types. TensorRT will attempt to cast down INT64 to INT32 and DOUBLE down to FLOAT, clamping values to `+-INT_MAX` or `+-FLT_MAX` if necessary.

See below for the support matrix of ONNX operators in ONNX-TensorRT.

## Operator Support Matrix

| Operator                  | Supported  | Supported Types | Restrictions                                                                                                           |
|---------------------------|------------|-----------------|------------------------------------------------------------------------------------------------------------------------|
| Abs                       | Y          | FP32, FP16, INT32 |
| Acos                      | Y          | FP32, FP16 |
| Acosh                     | Y          | FP32, FP16 |
| Add                       | Y          | FP32, FP16, INT32 |

This file has been truncated. show original

Also, request you to share your model and script if not shared already so that we can help you better.

Thanks!

juliefraysse · May 4, 2021, 10:06pm

Hi,

Thanks for your reply.
Unfortunatly, I’m afraid I can’t share my model.
But do you know a metric that tells how many SMs are used in average during the kernel execution?

spolisetty · May 18, 2021, 10:56am

Hi @juliefraysse,

Sorry for the delayed response. We recommend you to post your query on Nsight Systems - NVIDIA Developer Forums to get better help.

Thank you.

Topic		Replies	Views
Batch inference parallelization on tensorrt DeepStream SDK tensorrt	2	515	October 12, 2021
Batch inference parallelization on tensorrt TensorRT tensorrt , cuda	5	1012	May 5, 2021
Is it possible to execute two kernels concurrently? CUDA Programming and Performance	18	6771	July 2, 2010
Kernels launch - parallel or serial? CUDA Programming and Performance	16	7013	January 11, 2010
Stream Concurrency (or lack thereof) on GTX 480 CUDA Programming and Performance	6	2549	July 15, 2010
TensorRT 3.0.2 with multi-streaming TensorRT	3	2853	September 10, 2018
Profiling single-gpu multi-session tf inference Frameworks (archived) tensorflow	0	794	September 22, 2020
Concurrent instances of TensorRT TensorRT	0	737	March 9, 2019
TensorRT Batching Speed scales poorly TensorRT tensorrt , cuda	6	1825	September 30, 2021
Low Compute utilization of converted TensorFlow model during inference Jetson TX2	19	1829	October 18, 2021

TensorRT batch inference - How to be sure one kernel does use all the GPU ressources?

Related topics