How to reduce differences in inference output across gpus when using fp16?

daaguirre · December 22, 2023, 8:22am

Hi,

I have successfully used different models with TensorRT using FP16 data type, but I’ve found that there are differences in inference output across different GPUs architectures (T4 gpu and A2000 laptop gpu), this was somewhat expected from what I have read in other threads.

So my question is what would be the best strategy to reduce these differences across different GPU architectures as much as possible? any suggestions on best practices to achieve consistent inference results across GPUs would be greatly appreciated.

Environment

TensorRT 8.6.1
Nvidia Driver Version: 525
CUDA Version: 11.7
CUDNN Version: 8.5.0
Operating System + Version: ubuntu 20
container: nvcr.io/nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04

Topic		Replies	Views
Inference fp16 engine in c++ get Nan output but inference fp32 engine can get correct result TensorRT	13	1253	October 10, 2023
Does TensorRT exploit parallelism in a computational graph during inference? TensorRT	2	524	April 18, 2023
Detectron2: faster inferencing TensorRT	2	1393	April 29, 2022
Conv inference problem fp32 to fp16 TensorRT	1	363	September 11, 2023
TensorRT, result error in fp16 TensorRT	1	697	October 19, 2021
Inference TensorRT randomly returns nan TensorRT tensorrt	2	522	April 27, 2023
Tensorrt Execution Provider TensorRT tensorrt , cudnn , onnx	1	756	November 27, 2023
GPU Utilization TensorRT tensorrt	3	722	August 29, 2023
Outputs of tensorrt are too different according to the compute capabilities TensorRT	1	428	November 2, 2022
TensorRT mix precision with supported hard drives TensorRT tensorrt	3	624	April 19, 2021

How to reduce differences in inference output across gpus when using fp16?

Environment

Related topics