Dear @AastaLLL,
How do you measure the efficiency?
Do you get it from the GPU utilization of tegrastats?
I calculate the efficiency as follows.
Execution efficiency = (The total computational complexity of Convolution) / (processing time * processing speed) * 100
I attached a model and script file for reproduce.
There is a following model, which has 3 layers of 3x3 convolution, in test_model0.py
class TestModel(nn.Module):
def __init__(self, n_feats, kernel_size):
super(TestModel, self).__init__()
self.conv1 = nn.Conv2d(n_feats, n_feats, kernel_size, padding=kernel_size//2, bias=True)
self.conv2 = nn.Conv2d(n_feats, n_feats, kernel_size, padding=kernel_size//2, bias=True)
self.conv3 = nn.Conv2d(n_feats, n_feats, kernel_size, padding=kernel_size//2, bias=True)
def forward(self, x):
x = self.conv1(x)
x = self.conv2(x)
x = self.conv3(x)
return x
And I create onnx model as follows.
size_x = 1920
size_y = 1080
kernel_size = 3
n_feats = 16
model = TestModel(n_feats, kernel_size).eval()
ar = torch.randn(1, n_feats, size_x, size_y)
torch.onnx.export(model, ar, "test_model0.onnx", verbose=False)
The total computational complexity of Convolution is 0.0287 FP16 Tera Floating point operations as follows.
computational complexity = 1920(width) x 1080(height) x 3 x 3 (3x3 convolution) x 2 (multiply-add) x 16 (in_channels) x 16 (out_channels) * 3(layers)/ (10^12) = 0.0287 (FP16 Tera Floating point operations)
When I tested on Jetson AGX Orin Developer Kit with Jetson AGX Orin 32GB emulate mode,
the processing speed on Tensor Core is 47.3 FP16 TFLOPS because I set NVP Model clock as 40W.
Processing speed on Tensor Core = 54 FP16 TFLOPS * 816 MHz/ 930 MHz = 47.3 FP16 TFLOPS.
When I run the trt engine as follows, the GPU Compute Time is around 9 ms.
# creating test_model0.trt
trtexec --buildOnly --fp16 --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --precisionConstraints=obey --layerPrecisions=*:fp16 --layerOutputTypes=*:fp16 --sparsity=disable --onnx=test_model0.onnx --saveEngine=test_model0.trt --verbose
# Run test_model0.trt
trtexec --loadEngine=test_model0.trt --verbose
[05/25/2023-18:31:49] [I] GPU Compute Time: min = 8.94312 ms, max = 16.5773 ms, mean = 9.02886 ms, median = 8.97861 ms, percentile(90%) = 8.98486 ms, percentile(95%) = 8.98621 ms, percentile(99%) = 10.574 ms
So I calculate the efficiency(5.46%) as follwos.
Execution Efficiency = 0.0287 / (47.3(TFLOPS) * 0.009(s)) * 100 = 5.46(%)
Please use test_model0_exec.sh for reproduce.
And please give me any advise how to optimize this model.
Regards,
hiro
test_model0.py (826 Bytes)
test_model0_exec.sh (404 Bytes)type or paste code here