Dear @AastaLLL,

How do you measure the efficiency?

Do you get it from the GPU utilization of tegrastats?

I calculate the efficiency as follows.

Execution efficiency = (The total computational complexity of Convolution) / (processing time * processing speed) * 100

I attached a model and script file for reproduce.

There is a following model, which has 3 layers of 3x3 convolution, in test_model0.py

```
class TestModel(nn.Module):
def __init__(self, n_feats, kernel_size):
super(TestModel, self).__init__()
self.conv1 = nn.Conv2d(n_feats, n_feats, kernel_size, padding=kernel_size//2, bias=True)
self.conv2 = nn.Conv2d(n_feats, n_feats, kernel_size, padding=kernel_size//2, bias=True)
self.conv3 = nn.Conv2d(n_feats, n_feats, kernel_size, padding=kernel_size//2, bias=True)
def forward(self, x):
x = self.conv1(x)
x = self.conv2(x)
x = self.conv3(x)
return x
```

And I create onnx model as follows.

```
size_x = 1920
size_y = 1080
kernel_size = 3
n_feats = 16
model = TestModel(n_feats, kernel_size).eval()
ar = torch.randn(1, n_feats, size_x, size_y)
torch.onnx.export(model, ar, "test_model0.onnx", verbose=False)
```

The total computational complexity of Convolution is **0.0287 FP16 Tera Floating point operations** as follows.

computational complexity = 1920(width) x 1080(height) x 3 x 3 (3x3 convolution) x 2 (multiply-add) x 16 (in_channels) x 16 (out_channels) * 3(layers)/ (10^12) = 0.0287 (FP16 Tera Floating point operations)

When I tested on Jetson AGX Orin Developer Kit with Jetson AGX Orin 32GB emulate mode,

the processing speed on Tensor Core is **47.3 FP16 TFLOPS** because I set NVP Model clock as 40W.

Processing speed on Tensor Core = 54 FP16 TFLOPS * 816 MHz/ 930 MHz = 47.3 FP16 TFLOPS.

When I run the trt engine as follows, the GPU Compute Time is around **9 ms**.

```
# creating test_model0.trt
trtexec --buildOnly --fp16 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --precisionConstraints=obey --layerPrecisions=*:fp16 --layerOutputTypes=*:fp16 --sparsity=disable --onnx=test_model0.onnx --saveEngine=test_model0.trt --verbose
# Run test_model0.trt
trtexec --loadEngine=test_model0.trt --verbose
```

[05/25/2023-18:31:49] [I] GPU Compute Time: **min = 8.94312** ms, max = 16.5773 ms, **mean = 9.02886 ms**, median = 8.97861 ms, percentile(90%) = 8.98486 ms, percentile(95%) = 8.98621 ms, percentile(99%) = 10.574 ms

So I calculate the efficiency(**5.46%**) as follwos.

Execution Efficiency = 0.0287 / (47.3(TFLOPS) * 0.009(s)) * 100 = **5.46(%)**

Please use test_model0_exec.sh for reproduce.

And please give me any advise how to optimize this model.

Regards,

hiro

test_model0.py (826 Bytes)

test_model0_exec.sh (404 Bytes)type or paste code here