Hi, I’m running inference on a CV image detection network on Xavier in INT8 on batch size 1. I’m converting from an Onnx model to TensorRT using the sample function provided. When I ran inference through nvprof, I saw around the same range of performance between the FP16 and INT8 versions, and I also noticed an incredibly high number of memcpy calls in the INT8 version (but same total times.) INT8 is supported by Xavier, but I don’t see any speedup? Using TRT 5.0.2.6, cuda10.
The benchmarks results for resnet-50 using caffe models and bs=1 are fine, I get ~2ms for int8 and ~3ms for fp16. It says there is no onnx model support for int8.
I cannot share the model right now, but I went and checked the gpu trace and it seems like I’m getting a lot of tensor conversions (cuInt8::nchwToNcqhw4 and vice versa). About four times as many on int8 as on fp16. Is there a function that forces the conversion for int8? Each conversion takes about as long as the computation.
I see the conversion between some cudnn calls (trt_volta_int8x4_icudnn_int8x4) and this (ZN6thrust8cuda_cub4core13_kernel_agentINS0_14__parallel_for16ParallelForAgentINS0_11__transform17unary_transform_fIPKfPfNS5_14no_stencil_tagEZN21FancyActivationPlugin9doEnqueueIfEEiiPKPKvPPvSH_P11CUstream_stEUlfE_NS5_21always_true_predicateEEElEESN_lEEvT0_T1), which seems like a input enqueue? Could I solve the problem by restructuring my data inputs?