Use FP16 regardless if it is slower or not

horvathpeter.th · May 10, 2022, 9:55am

Description

My goal is to run every layer with fp16 in a two-layer MLP. When I build the engine with the fp16 and strict_types flags I get the following:

[05/10/2022-10:45:57] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 221 MiB, GPU 2798 MiB
[05/10/2022-10:45:57] [I] [TRT] ---------- Layers Running on DLA ----------
[05/10/2022-10:45:57] [I] [TRT] ---------- Layers Running on GPU ----------
[05/10/2022-10:45:57] [I] [TRT] [GpuLayer] shuffle_between_input_tensor_and_(Unnamed Layer* 0) [Fully Connected]
[05/10/2022-10:45:57] [I] [TRT] [GpuLayer] 2-layer MLP: (Unnamed Layer* 0) [Fully Connected] + (Unnamed Layer* 1) [Activation] -> (Unnamed Layer* 3) [Activation]
[05/10/2022-10:45:58] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +158, GPU +176, now: CPU 379, GPU 2975 (MiB)
[05/10/2022-10:46:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +241, GPU +224, now: CPU 620, GPU 3199 (MiB)
[05/10/2022-10:46:00] [W] [TRT] Detected invalid timing cache, setup a local cache instead
[05/10/2022-10:46:00] [W] [TRT] No implementation of layer shuffle_between_input_tensor_and_(Unnamed Layer* 0) [Fully Connected] obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[05/10/2022-10:46:00] [W] [TRT] No implementation obeys reformatting-free rules, at least 1 reformatting nodes are needed, now picking the fastest path instead.
[05/10/2022-10:46:00] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[05/10/2022-10:46:00] [I] [TRT] Total Host Persistent Memory: 0
[05/10/2022-10:46:00] [I] [TRT] Total Device Persistent Memory: 0
[05/10/2022-10:46:00] [I] [TRT] Total Scratch Memory: 128
[05/10/2022-10:46:00] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 4 MiB
[05/10/2022-10:46:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 620, GPU 3204 (MiB)
[05/10/2022-10:46:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 620, GPU 3204 (MiB)
[05/10/2022-10:46:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 620, GPU 3204 (MiB)
[05/10/2022-10:46:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 620, GPU 3204 (MiB)
[05/10/2022-10:46:00] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 620 MiB, GPU 3204 MiB

from which the interesting part:

[05/10/2022-10:46:00] [W] [TRT] No implementation of layer shuffle_between_input_tensor_and_(Unnamed Layer* 0) [Fully Connected] obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[05/10/2022-10:46:00] [W] [TRT] No implementation obeys reformatting-free rules, at least 1 reformatting nodes are needed, now picking the fastest path instead.

My question: is the operation of the first fully connected layer executed with 16-bit floats or 32-bit floats are used?

Environment

Jetpack 4.6
TensorRT Version: 8.0.2
GPU Type: Jetson Nano GPU (Maxwell)
CUDA Version: 10.2
CUDNN Version: cuDNN 8.2.1
Operating System + Version: Ubuntu 18.04

NVES · May 10, 2022, 10:08am

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#measure-performance

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#model-accuracy

Thanks!

horvathpeter.th · May 10, 2022, 12:05pm

The defined model:

auto input = network->addInput("input_tensor", nvinfer1::DataType::kFLOAT, nvinfer1::Dims4(1, 1, input_size, 1));
    
auto fc1 = network->addFullyConnected(*input, fc1_output_size, kernel1, biasm1);
fc1->setPrecision(nvinfer1::DataType::kHALF);
auto relu1 = network->addActivation(*fc1->getOutput(0), ActivationType::kRELU);
relu1->setPrecision(nvinfer1::DataType::kHALF);

auto fc2 = network->addFullyConnected(*relu1->getOutput(0), fc2_output_size, kernel2, biasm2);
fc2->setPrecision(nvinfer1::DataType::kHALF);
auto relu2 = network->addActivation(*fc2->getOutput(0), ActivationType::kRELU);
relu2->setPrecision(nvinfer1::DataType::kHALF);

relu2->getOutput(0)->setName("output_tensor");
network->markOutput(*relu2->getOutput(0));

The weights and biases are drawn from the (-1,1) interval uniformly.

The defined flags for the engine:

config->setFlag(BuilderFlag::kREFIT);

config->setFlag(BuilderFlag::kFP16);

config->setFlag(BuilderFlag::kSTRICT_TYPES);

horvathpeter.th · May 10, 2022, 1:21pm

However, my question is regarding the logs. Therefore, I think my question can be answered without any additional information like model, etc…
What I would like to now if the logs I show in my post indicate that the operation (matrix multiplication) in the first layer is executed with 16-bit floats or 32-bit floats?

spolisetty · May 16, 2022, 12:00pm

Hi,

horvathpeter.th:

[05/10/2022-10:46:00] [W] [TRT] No implementation of layer shuffle_between_input_tensor_and_(Unnamed Layer* 0) [Fully Connected] obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[05/10/2022-10:46:00] [W] [TRT] No implementation obeys reformatting-free rules, at least 1 reformatting nodes are needed, now picking the fastest path instead.

The warning means that TRT does not have any FP16 tactic for that layer, so it will fall back to FP32. But actually, Nano supports FP16.

Could you please share with us the ONNX model to try from our end for better debugging.

Thank you.