I’m running into an issue where unless I use the LayerNorm ONNX op and the corresponding TRT plugin, I see FP16 overflows in my network. I’ve tried forcing FP32 precision on problematic layers but TRT seems to ignore this and fuse almost the entire network into a ForeignNode.
I can get this to work on my desktop with TRT 8.5.6, but I think this version is not supported on the Xavier NX (JP 5.1.5). Is there a way to install TRT 8.5.6, or at least port the plugin to 8.5.2.2?
You will need JetPack 6 to get TensorRT 8.6, but JetPack 6 doesn’t support Xavier.
Do you deploy on the DLA? Could you try to run it on the GPU first?
There is an option to force a layer with predefined precision in trtexec.
Could you give it a try?
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec
...
--layerPrecisions=spec Control per-layer precision constraints. Effective only when precisionConstraints is set to
"obey" or "prefer". (default = none)
The specs are read left-to-right, and later ones override earlier ones. "*" can be used as a
layerName to specify the default precision for all the unspecified layers.
Per-layer precision spec ::= layerPrecision[","spec]
layerPrecision ::= layerName":"precision
precision ::= "fp32"|"fp16"|"int32"|"int8"
...
Running on GPU. Forcing precision doesn’t work because of Myelin fusion. I’ve tried messing with the onnx graph to break the fusion, but then I get a slower network, defeats the purpose of FP16 inference.
Hello, what does WAR stand for?
I’m trying to convert dinov3 based models. Here is an ONNX file:
Not sure if this will affect the fusion but I also have plans to expose some of the intermediate layers as outputs. So far it looks like TensorRT still likes to create one big Foreign node with multiple outputs.
So I theorize that there are two components that cause the overflow:
The first attention block: /backbone/blocks.0/attn/MatMul followed by /backbone/blocks.0/attn/Softmax should operate in fp32 so the numbers don’t get big. Other attention blocks don’t have large activations.
Almost every layernorm operation: /backbone/blocks.X/norm1/Pow operation is the problem where X is 1 to 12. I believe we have three options:
We can either clip the output to 65500 to stabilize. This seems to work, but there might be an accuracy drop, also I’m afraid this blocks the fusion of the attention layers and produce a model that is not faster than full FP32.
Put the entire layernorm (anything with norm in layer name) operation in FP32. I couldn’t force this because TensorRT kept fusing everything and ignored my FP32 constraint.
We can utilize ONNX opset 17 and use the LayerNorm TRT plugin, which doesn’t seem to suffer from this issue.
If option 2.3 is a possibility on the XavierNX, that would be best. Or maybe you would have another idea.
I was able to follow this blog post after translating to English and write a custom layernorm plugin. This allowed me to run the model in FP16 mode, but only after setting the first matmul/softmax pair in attn0 to fp32. Nothing else needs to be set to FP32, but the custom plugin is necessary to avoid layernorm from overflowing.