Jetson AGX Xavier INT8 Performance

Hi, I’m running inference on a CV image detection network on Xavier in INT8 on batch size 1. I’m converting from an Onnx model to TensorRT using the sample function provided. When I ran inference through nvprof, I saw around the same range of performance between the FP16 and INT8 versions, and I also noticed an incredibly high number of memcpy calls in the INT8 version (but same total times.) INT8 is supported by Xavier, but I don’t see any speedup? Using TRT, cuda10.

Clarification appreciated, thank you.


Have you maximized the CPU/GPU clocks first?

sudo ./

Could you share which model do you test with us?
Here is our benchmark result for the Jetson Xavier:

You can check if you have the similar result with ours first.

I have maximized clocks.

The benchmarks results for resnet-50 using caffe models and bs=1 are fine, I get ~2ms for int8 and ~3ms for fp16. It says there is no onnx model support for int8.

I cannot share the model right now, but I went and checked the gpu trace and it seems like I’m getting a lot of tensor conversions (cuInt8::nchwToNcqhw4 and vice versa). About four times as many on int8 as on fp16. Is there a function that forces the conversion for int8? Each conversion takes about as long as the computation.

I see the conversion between some cudnn calls (trt_volta_int8x4_icudnn_int8x4) and this (ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fIPKfPfNS5_14no_stencil_tagEZN21FancyActivationPlugin9doEnqueueIfEEiiPKPKvPPvSH_P11CUstream_stEUlfE_NS5_21always_true_predicateEEElEESN_lEEvT0_T1), which seems like a input enqueue? Could I solve the problem by restructuring my data inputs?

Thanks for the prompt response.


For Caffe frameworks, you can feed the caffemodel into TensorRT directly.
It’s no required to convert it into onnx model.

Please try to convert the TensorRT engine from caffemodel first. It may give you a better performance.