Subnormal FP16 values detected

Description

A clear and concise description of the bug or issue.

Environment

TensorRT Version: 8.4.1.5
GPU Type: discrete
Nvidia Driver Version: 460.73.01
CUDA Version: 11.2
CUDNN Version: 8.2
Operating System + Version: ubuntu 20.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): nvidia-tensorrt==8.4.1.5

When converting to tensorrt FP16 I see this:

Weights [name=Conv_0 + Relu_1.weight] had the following issues when converted to FP16:
[07/07/2022-18:30:26] [TRT] [W]  - Subnormal FP16 values detected. 
[07/07/2022-18:30:26] [TRT] [W]  - Values less than smallest positive FP16 Subnormal value detected. Converting to FP16 minimum subnormalized value. 
[07/07/2022-18:30:26] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights.
[07/07/2022-18:30:26] [TRT] [W] Weights [name=Conv_0 + Relu_1.bias] had the following issues when converted to FP16:
[07/07/2022-18:30:26] [TRT] [W]  - Subnormal FP16 values detected. 
[07/07/2022-18:30:26] [TRT] [W]  - Values less than smallest positive FP16 Subnormal value detected. Converting to FP16 minimum subnormalized value. 
[07/07/2022-18:30:26] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights.
[07/07/2022-18:30:26] [TRT] [V] Conv_0 + Relu_1 Set Tactic Name: trt_turing_cutlass_image_network_first_layer_hmma_fprop_f16f16f32_nhwc_nhwc_k64r7s7c4_stride2x2 Tactic: 0xe2222883a6602489
[07/07/2022-18:30:26] [TRT] [W] Weights [name=Conv_3 + Relu_4.weight] had the following issues when converted to FP16:
[07/07/2022-18:30:26] [TRT] [W]  - Subnormal FP16 values detected. 
[07/07/2022-18:30:26] [TRT] [W]  - Values less than smallest positive FP16 Subnormal value detected. Converting to FP16 minimum subnormalized value. 
[07/07/2022-18:30:26] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights.
[07/07/2022-18:30:26] [TRT] [V] Conv_3 + Relu_4 Set Tactic Name: turing_h1688cudnn_256x64_ldg8_relu_exp_small_nhwc_tn_v1 Tactic: 0x1dcf9babce3d9b3b
[07/07/2022-18:30:26] [TRT] [W] Weights [name=Conv_5 + Add_6 + Relu_7.weight] had the following issues when converted to FP16:
[07/07/2022-18:30:26] [TRT] [W]  - Subnormal FP16 values detected. 

Weights are out of range ?
I think this causes FP16 to underperform. Is there anyway to resolve this either by retraining or modifyying weights?
Complete logs if that helps.
logs.txt (2.4 MB)

Thanks

1 Like

Hi,

If accuracy is not affected, you can ignore this warning.
Just as the logs, Weights [name=xxx] has some FP32 values that are in the subnormal range of FP16.
We are working on improving these logs. This may be fixed in future releases.

Thank you.

2 Likes

Is there any way to calibrate or “quantize” the model to make it more FP16 friendly?

I am getting this same issue and my accuracy is affected. How should I got about handling this? I have another computer with a similiar setup (slightly different versions of CUDA, cuDNN, etc). And it does not raise this warning. It also gives different results when I run both torch_tensorrt models.

Also, when training my model, I do apply L2 regularization to all of my weights.

Here is a sample of my warnings:

WARNING: [Torch-TensorRT TorchScript Conversion Context] - Weights [name=%1030 : Tensor = aten::_convolution(%result.41, %self.res_18.conv2.weight, %self.conv.conv1.bias, %5, %5, %5, %1028, %1029, %7, %1028, %1028, %1028, %1028) + %out.13 : Tensor = aten::batch_norm(%1030, %self.res_18.bn2.weight, %self.res_18.bn2.bias, %self.res_18.bn2.running_mean, %self.res_18.bn2.running_var, %self.conv.bn1.training, %549, %550, %551) # /home/user/venv/lib/python3.8/site-packages/torch/nn/functional.py:2282:11 + %892 : Tensor = aten::add(%out.13, %s.77, %7) # /home/user/Net.py:89:8 + %s.81 : Tensor = aten::relu(%892) # /home/user/venv/lib/python3.8/site-packages/torch/nn/functional.py:1299:17.weight] had the following issues when converted to FP16:
WARNING: [Torch-TensorRT TorchScript Conversion Context] - - Subnormal FP16 values detected.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - - Values less than smallest positive FP16 Subnormal value detected. Converting to FP16 minimum subnormalized value.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Weights [name=%1036 : Tensor = aten::_convolution(%s.81, %self.outblock.policy_conv1.weight, %self.conv.conv1.bias, %5, %4, %5, %1034, %1035, %7, %1034, %1034, %1034, %1034) + %902 : Tensor = aten::batch_norm(%1036, %self.outblock.policy_bn1.weight, %self.outblock.policy_bn1.bias, %self.outblock.policy_bn1.running_mean, %self.outblock.policy_bn1.running_var, %self.conv.bn1.training, %549, %550, %551) # /home/user/venv/lib/python3.8/site-packages/torch/nn/functional.py:2282:11 + %result.3 : Tensor = aten::relu(%902) # /home/user/venv/lib/python3.8/site-packages/torch/nn/functional.py:1299:17 || %1033 : Tensor = aten::_convolution(%s.81, %self.outblock.value_conv.weight, %self.conv.conv1.bias, %5, %4, %5, %1031, %1032, %7, %1031, %1031, %1031, %1031) + %895 : Tensor = aten::batch_norm(%1033, %self.outblock.value_bn.weight, %self.outblock.value_bn.bias, %self.outblock.value_bn.running_mean, %self.outblock.value_bn.running_var, %self.conv.bn1.training, %549, %550, %551) # /home/user/venv/lib/python3.8/site-packages/torch/nn/functional.py:2282:11 + %result.4 : Tensor = aten::relu(%895) # /home/user/venv/lib/python3.8/site-packages/torch/nn/functional.py:1299:17.weight] had the following issues when converted to FP16:
WARNING: [Torch-TensorRT TorchScript Conversion Context] - - Subnormal FP16 values detected.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Weights [name=%1039 : Tensor = aten::_convolution(%result.3, %self.outblock.policy_conv2.weight, %self.outblock.policy_conv2.bias, %5, %4, %5, %1037, %1038, %7, %1037, %1037, %1037, %1037).weight] had the following issues when converted to FP16:
WARNING: [Torch-TensorRT TorchScript Conversion Context] - - Subnormal FP16 values detected.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Weights [name=%1039 : Tensor = aten::_convolution(%result.3, %self.outblock.policy_conv2.weight, %self.outblock.policy_conv2.bias, %5, %4, %5, %1037, %1038, %7, %1037, %1037, %1037, %1037).bias] had the following issues when converted to FP16:
WARNING: [Torch-TensorRT TorchScript Conversion Context] - - Subnormal FP16 values detected.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - If this is not the desired behavior, please modify the weights or retrain with regularization to reduce the magnitude of the weights.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
WARNING: [Torch-TensorRT] - The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
WARNING: [Torch-TensorRT] - The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.

Update:
I found that the two machines had differnet versions of CUDA/cudnn/tensorrt.

When I reverted to the older machines 11.4 CUDA (cudnn 8.2 and tensorrt 8.2) then was the same results.

So it seems that CUDA 11.7 vs 11.4 (or somethign in between) is causing this warning which in turn is giving different results in my model.

Hi, I’m facing the same issue and it also decreasing my accuracy.
My setup: CUDA 11.3 / torch 1.10 / cudnn 8.4.1.50 / Tensorrt 8.4.1.5 (but also tested in 8.4.2.4)
I didn’t encounter this problem with my old setup (CUDA 11.1 / torch 1.9.1 / Tensorrt 8.2.3.0).

@rrrr Can you share the detailed setup that work for you please? Thanks a lot

I was able to avoid the significant accuracy loss of FP16 by backporting to CUDA11.6 + TensorRT 8.4.0 EA + onnx-tensorrt 9f82b2b6072be6c01f65306388e5c07621d3308f. I backported from the latest package to the pre-release package, so fixes that are included in the latest version are omitted in this process.

I am not sure if this information is helpful, but I will share an environment that I have created and confirmed that FP16 accuracy degradation does not occur in some models. Since the Dockerfile is a package of my own making, it contains many sequences that are essentially irrelevant to this issue, so you can ignore most of it.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.