TRT inference fp32 vs fp16

Description

Would u please help me?

I try to execute the sample codes (sampleMNIST, sampleOnnxMNIST) in order to compare the memory consumption under the fp32 and fp16 mode. The result is that both fp32 and fp16 mode almost take up the same memory.

I just compile the project, and use the command like this “./bin/sample_mnist” and “./bin/sample_mnist --fp16”, and then i type the "watch -n 0.1 -d nvidia-smi " in another terminal to record the GPU memory.

Is this situation normal? or maybe i missed some important details?

Environment

TensorRT Version: 6.0.1.5
GPU Type: Tesla V100-SXM2
Nvidia Driver Version: 410.79
CUDA Version: 10.0
CUDNN Version: 7.6.3
Operating System + Version: ubuntu 16.04

The memory usage is dependent on the device and kernel used to optimize the model based on precision and other factors.
To determine the amount of memory a model will use, please below link question How do I determine how much device memory will be required by my network?
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-710-ea/developer-guide/index.html#faq

In this case it might be possible that more optimized kernel is being used to optimize the performance of model.

Thanks

Thx for ur reply. However, your suggestions seem not settle my problem. Is there any intuitive example to show the memory occupied difference, when infer with different mode fp32 vs fp16?

When allowed to use FP16, sometimes TRT will use FP32 anyway if it’s faster, which it can be for small networks.
If you really wants to have TRT use fp16 anyway, you should set type constraints on the layers and tell the builder to use strict type constraints.
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-700/tensorrt-developer-guide/index.html#set_layer_mp_c

For large networks you should see significant memory reductions - or usually more importantly, significantly increased performance.

Thanks

Thx for your patience, I will follow your instructions.