Performance of TensorRT conversion of ResNet50 on Quadro P6000

I am running your image_classification example from your docker image nvcr.io/nvidia/tensorflow:19.11-tf2-py3 as follows

export CUDA_VISIBLE_DEVICES=“0”
python image_classification.py
–data_dir /mytf/imagenet
–input_saved_model_dir /mytf/1
–output_saved_model_dir /mytf/temp
–mode validation
–num_warmup_iterations 50
–use_trt
–optimize_offline
–precision INT8
–max_workspace_size $((2**32))
–batch_size 128
–target_duration 10
–calib_data_dir /mytf/imagenet
–num_calib_inputs 128

The tensorrt conversion completes successfully, but I see no speedup relative to FP32. Upon closer examination of the generated model, the graph nodes retain FP32 types, so the result is not surprising. Given that this is running on a Compute capability 6.1 (Quadro P6000 GPU), why did the converted model not use INT8 as requested above? How do I demonstrate the INT8 performance on this model that is described on your documentation?

Hi,

Specifying the precision for a network defines the minimum acceptable precision for the application. Higher precision kernels may be chosen if they are faster for some particular set of kernel parameters, or if no lower-precision kernel exists.
You can set the builder config flag BuilderFlag::kSTRICT_TYPES to force the network or layer precision, which may not have optimal performance. Usage of this flag is only recommended for debugging purposes.

Please refer below link for more details:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-601/tensorrt-developer-guide/index.html#enable_fp16_c

Thanks