Since then, the topic has been closed for lack of activity (because of new years holidays). But since a new jetpack version has been released, I was wondering if there was news regarding this issue (even to say that it isn’t one). I am not seeing any information about this in the release notes. I ask to check if it’s worth it to run my tests again or if no changes are to be expected.
The latest suggestion of the topic was to increase the batchsize, which I did until a point with no performance gain. Here is the last message I wrote about this in the other post :
I can’t handle those batchsize for a resnet50 or 101. Furthermore, my usecase is strictly limited to bacthsize 1 for realtime. However, just to try as I’m curious about this behavior, I increased to a bachsize of 32 and observed the exact same results as before : no increase between FP16 and INT8/best.
I did some analysis with trex and I see that most of my latency comes from the int8 layers (models generated with --best, --exportProfile, --exportLayerInfo).
Half of my model gets converted to int8, and int8 represents 60% of the total latency :
Regarding the suggestion about softmax running in FP32: I checked this point, but in the case of ResNet50/101 it doesn’t seem to explain the behavior I’m seeing. I am pretty sure ResNet does not have internal softmax layers, and the final softmax should be extremely cheap compared to the convolution layers. Even if it stays in FP32, it shouldn’t account for the fact that INT8 layers represent ~60% of total latency and are consistently slower than their FP16 counterparts. Furthermore, from what I’ve seen, none of the Softmax layers run in FP32 – they all run in FP16 or get quantized immediately afterwards (if I didn’t miss anything). Conv (as expected) and casts are the main sources of latency in my model. I attached the trace so you can check on your side aswell.
More importantly, as shown in my graph above, even the INT8 GEMM/MatMul layers themselves are not faster than their FP16 versions in my profiling – sometimes executing slower by a factor of almost 2. I find this odd since GEMM is the main operation that should benefit from INT8 tensor cores. Maybe the conversion layers scattered around everywhere are the main reason for that ?
Tests have been made on latest jetpack version and CUDA 13.1 + TensorRT 10.15.1 GA. No changes, same issue. I’ll go deeper when I have time but right now Jetson Thor seems like a deadend.
Have you at least been able to reproduce the issue on your side or is it only on my end ? My usecase seems pretty straightforward :
default weights on a RTDETR
int8 quantization (implicit or explicit ; it doesn’t matter)
no improvement or even worst performances rather than FP16 counterpart for --best or --int8
on analysis, converted layers becomes sometimes twice as slow
I find the result of my graph above quite concerning. But maybe I’m doing something wrong or misunderstanding something. Please tell.