How to optimize the tensorRT Engine for Tensor Core?

I tried to use trtexec tool for evaluating my model on Jetson AGX Orin.

The result is not good at execution efficiency.
My model is all most 3x3 convolution and the execution efficiency is 5% only.
(I think my model has dense parameters and cannot use the merit of sparsity.)

So I want to try optimizing my model?
If possible, I want to increase the execution efficiency to be 20%.
Is there any programming tool and libraries for Tensor Core?

Regards,
hiro

Hi,

How do you measure the efficiency?
Do you get it from the GPU utilization of tegrastats?

To run a model with Tensor Core, please infer it with fp16 or int8 mode.
Thanks.

Dear @AastaLLL,

How do you measure the efficiency?
Do you get it from the GPU utilization of tegrastats?

I calculate the efficiency as follows.

Execution efficiency = (The total computational complexity of Convolution) / (processing time * processing speed) * 100

I attached a model and script file for reproduce.
There is a following model, which has 3 layers of 3x3 convolution, in test_model0.py

class TestModel(nn.Module):
    def __init__(self, n_feats, kernel_size):
        super(TestModel, self).__init__()
        self.conv1 = nn.Conv2d(n_feats, n_feats, kernel_size, padding=kernel_size//2, bias=True)
        self.conv2 = nn.Conv2d(n_feats, n_feats, kernel_size, padding=kernel_size//2, bias=True)
        self.conv3 = nn.Conv2d(n_feats, n_feats, kernel_size, padding=kernel_size//2, bias=True)
    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        return x

And I create onnx model as follows.

size_x = 1920
size_y = 1080
kernel_size = 3
n_feats = 16

model = TestModel(n_feats, kernel_size).eval()
ar = torch.randn(1, n_feats, size_x, size_y)

torch.onnx.export(model, ar, "test_model0.onnx", verbose=False)

The total computational complexity of Convolution is 0.0287 FP16 Tera Floating point operations as follows.

computational complexity = 1920(width) x 1080(height) x 3 x 3 (3x3 convolution) x 2 (multiply-add) x 16 (in_channels) x 16 (out_channels) * 3(layers)/ (10^12) = 0.0287 (FP16 Tera Floating point operations)

When I tested on Jetson AGX Orin Developer Kit with Jetson AGX Orin 32GB emulate mode,
the processing speed on Tensor Core is 47.3 FP16 TFLOPS because I set NVP Model clock as 40W.

Processing speed on Tensor Core = 54 FP16 TFLOPS * 816 MHz/ 930 MHz = 47.3 FP16 TFLOPS.

When I run the trt engine as follows, the GPU Compute Time is around 9 ms.

# creating test_model0.trt
trtexec --buildOnly --fp16 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --precisionConstraints=obey --layerPrecisions=*:fp16 --layerOutputTypes=*:fp16 --sparsity=disable --onnx=test_model0.onnx --saveEngine=test_model0.trt --verbose

# Run test_model0.trt
trtexec --loadEngine=test_model0.trt --verbose

[05/25/2023-18:31:49] [I] GPU Compute Time: min = 8.94312 ms, max = 16.5773 ms, mean = 9.02886 ms, median = 8.97861 ms, percentile(90%) = 8.98486 ms, percentile(95%) = 8.98621 ms, percentile(99%) = 10.574 ms

So I calculate the efficiency(5.46%) as follwos.

Execution Efficiency = 0.0287 / (47.3(TFLOPS) * 0.009(s)) * 100 = 5.46(%)

Please use test_model0_exec.sh for reproduce.

And please give me any advise how to optimize this model.

Regards,
hiro

test_model0.py (826 Bytes)
test_model0_exec.sh (404 Bytes)type or paste code here