Jetson Thor - INT8 quantization show no performance gain over FP16 (2)

Hello,

I opened a post in december of last year regarding a lack of performance increase between FP32/FP16 and INT8/FP4 or a Thor ( Jetson Thor - INT8 quantization show no performance gain over FP16 ).

Since then, the topic has been closed for lack of activity (because of new years holidays). But since a new jetpack version has been released, I was wondering if there was news regarding this issue (even to say that it isn’t one). I am not seeing any information about this in the release notes. I ask to check if it’s worth it to run my tests again or if no changes are to be expected.

The latest suggestion of the topic was to increase the batchsize, which I did until a point with no performance gain. Here is the last message I wrote about this in the other post :

I can’t handle those batchsize for a resnet50 or 101. Furthermore, my usecase is strictly limited to bacthsize 1 for realtime. However, just to try as I’m curious about this behavior, I increased to a bachsize of 32 and observed the exact same results as before : no increase between FP16 and INT8/best.

I did some analysis with trex and I see that most of my latency comes from the int8 layers (models generated with --best, --exportProfile, --exportLayerInfo).

Half of my model gets converted to int8, and int8 represents 60% of the total latency :

A green and orange pie chart  AI-generated content may be incorrect.

A green and orange pie chart AI-generated content may be incorrect.1088×441 33.9 KB

And the converted layers are slower than their fp16 counterparts :

A graph with different colored bars  AI-generated content may be incorrect.

A graph with different colored bars AI-generated content may be incorrect.1088×525 32.1 KB

I’m curious as to what is happening or if I am doing something wrong even so my use seems pretty standard and straightforward.

Cordially,

Hi,

Could you check the below comment to see if a similar use case?

In their sample, softmax is done in FP32 so the speedup is limited.
You can check this by adding the below argument:

$ ... --dumpLayerInfo --dumpProfile --profilingVerbosity=detailed --separateProfileRun --useCudaGraph --noDataTransfers

Thanks.

Hello,

Regarding the suggestion about softmax running in FP32: I checked this point, but in the case of ResNet50/101 it doesn’t seem to explain the behavior I’m seeing. I am pretty sure ResNet does not have internal softmax layers, and the final softmax should be extremely cheap compared to the convolution layers. Even if it stays in FP32, it shouldn’t account for the fact that INT8 layers represent ~60% of total latency and are consistently slower than their FP16 counterparts. Furthermore, from what I’ve seen, none of the Softmax layers run in FP32 – they all run in FP16 or get quantized immediately afterwards (if I didn’t miss anything). Conv (as expected) and casts are the main sources of latency in my model. I attached the trace so you can check on your side aswell.

More importantly, as shown in my graph above, even the INT8 GEMM/MatMul layers themselves are not faster than their FP16 versions in my profiling – sometimes executing slower by a factor of almost 2. I find this odd since GEMM is the main operation that should benefit from INT8 tensor cores. Maybe the conversion layers scattered around everywhere are the main reason for that ?

trace.txt (281.9 KB)

Cordially,

Hi,

Could you try the model with CUDA 13.1 + TensorRT 10.15.1 GA?

CUDA 13.1: CUDA Toolkit 13.1 Update 1 Downloads | NVIDIA Developer

TensorRT 10.15.x: TensorRT 10.x Download | NVIDIA Developer

Thanks.

Ok, I will check and keep informed.

Cordially,

Hello,

Tests have been made on latest jetpack version and CUDA 13.1 + TensorRT 10.15.1 GA. No changes, same issue. I’ll go deeper when I have time but right now Jetson Thor seems like a deadend.

Have you at least been able to reproduce the issue on your side or is it only on my end ? My usecase seems pretty straightforward :

  • default weights on a RTDETR
  • int8 quantization (implicit or explicit ; it doesn’t matter)
  • no improvement or even worst performances rather than FP16 counterpart for --best or --int8
  • on analysis, converted layers becomes sometimes twice as slow

I find the result of my graph above quite concerning. But maybe I’m doing something wrong or misunderstanding something. Please tell.

Cordially,