We have experimented running inference on our ResNet-like tensorflow model (exported to uff) using TensorRT with bit depths of float, half, and int8. In benchmarking inference latency for each of these bit depths, we see about a 60% decrease in latency going from float to half which is great and makes sense as theoretical flops increase by about 2x here. Going from half to int8, we only observed about a 10-15% decrease in inference latency. Comparing theoretical operations/sec here, fp16 should do about 11 TFLOPS and int8 should do about 22 TOPS (https://developer.nvidia.com/embedded/faq#xavier-performance), so I suppose we expect a larger improvement in inference time for int8 here.
We will continue to benchmark and profile our inference app, but I wanted to see if nvidia or anyone can provide benchmarks comparing float16 vs int8 performance using tensorrt on xavier, and if nvidia could provide any insight or advice on the lack of speed up going from float16 to int8.