Description
My goal is to run every layer with fp16 in a two-layer MLP. When I build the engine with the fp16 and strict_types flags I get the following:
[05/10/2022-10:45:57] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 221 MiB, GPU 2798 MiB
[05/10/2022-10:45:57] [I] [TRT] ---------- Layers Running on DLA ----------
[05/10/2022-10:45:57] [I] [TRT] ---------- Layers Running on GPU ----------
[05/10/2022-10:45:57] [I] [TRT] [GpuLayer] shuffle_between_input_tensor_and_(Unnamed Layer* 0) [Fully Connected]
[05/10/2022-10:45:57] [I] [TRT] [GpuLayer] 2-layer MLP: (Unnamed Layer* 0) [Fully Connected] + (Unnamed Layer* 1) [Activation] -> (Unnamed Layer* 3) [Activation]
[05/10/2022-10:45:58] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +158, GPU +176, now: CPU 379, GPU 2975 (MiB)
[05/10/2022-10:46:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +241, GPU +224, now: CPU 620, GPU 3199 (MiB)
[05/10/2022-10:46:00] [W] [TRT] Detected invalid timing cache, setup a local cache instead
[05/10/2022-10:46:00] [W] [TRT] No implementation of layer shuffle_between_input_tensor_and_(Unnamed Layer* 0) [Fully Connected] obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[05/10/2022-10:46:00] [W] [TRT] No implementation obeys reformatting-free rules, at least 1 reformatting nodes are needed, now picking the fastest path instead.
[05/10/2022-10:46:00] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[05/10/2022-10:46:00] [I] [TRT] Total Host Persistent Memory: 0
[05/10/2022-10:46:00] [I] [TRT] Total Device Persistent Memory: 0
[05/10/2022-10:46:00] [I] [TRT] Total Scratch Memory: 128
[05/10/2022-10:46:00] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 4 MiB
[05/10/2022-10:46:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 620, GPU 3204 (MiB)
[05/10/2022-10:46:00] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 620, GPU 3204 (MiB)
[05/10/2022-10:46:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 620, GPU 3204 (MiB)
[05/10/2022-10:46:00] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 620, GPU 3204 (MiB)
[05/10/2022-10:46:00] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 620 MiB, GPU 3204 MiB
from which the interesting part:
[05/10/2022-10:46:00] [W] [TRT] No implementation of layer shuffle_between_input_tensor_and_(Unnamed Layer* 0) [Fully Connected] obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[05/10/2022-10:46:00] [W] [TRT] No implementation obeys reformatting-free rules, at least 1 reformatting nodes are needed, now picking the fastest path instead.
My question: is the operation of the first fully connected layer executed with 16-bit floats or 32-bit floats are used?
Environment
Jetpack 4.6
TensorRT Version: 8.0.2
GPU Type: Jetson Nano GPU (Maxwell)
CUDA Version: 10.2
CUDNN Version: cuDNN 8.2.1
Operating System + Version: Ubuntu 18.04