Segmentation fault. Softmax+Split+Concat+TopK+Gather

Description

I have a model that was successfully converted with trtexec with TensorRT 7.1.3.
After updating to TensorRT 8.5.2 the conversion process starts to fail with a segmentation fault.
After two days of debugging, I finally managed to create a minimal example that triggers the segmentation fault.
If you remove any of the layers, the model will convert successfully.

Code to obtain the minimal ONNX model:

import os  
import torch.nn  
import torch.nn.functional as F  
  
  
class ModelSegFault(torch.nn.Module):  
    def __call__(self, x):  
        x = F.softmax(x, 2)  
        x = torch.cat(x.split(1, 2), 2)  
        x = x.reshape(6400)  
        _, topk_inds = x.topk(1000, 0)  
        x = x[topk_inds].reshape(1, 1000)  
        return x  
  
  
model = ModelSegFault()  
inp = torch.rand((1, 800, 8), dtype=torch.float32)  
path = os.path.expanduser('~/bug.onnx')  
torch.onnx.export(  
    model, inp, path, input_names=['input'], output_names=['output']  
)

We found that the segfault can be avoided by adding an epsilon value to the output of softmax:

class ModelSegFault(torch.nn.Module):  
    def __call__(self, x):  
        x = F.softmax(x, 2) + 1e-5
        x -= 1e-5
        x = torch.cat(x.split(1, 2), 2)  
        x = x.reshape(6400)  
        _, topk_inds = x.topk(1000, 0)  
        x = x[topk_inds].reshape(1, 1000)  
        return x  

Environment

We have tested it on a few computers, including Jetson Orin.

TensorRT Version: 8.5.2
GPU Type: Nvidia GeForce Titan X (Maxwell) / 2080Ti / Jetson Orin NX
Nvidia Driver Version: 510.85.02-0ubuntu0.20.04.1 / ? / ?
CUDA Version: 10.2.89-1 / 11.5 / 11.4.19
CUDNN Version: 8.3.1.22-1+cuda10.2 / ? / 8.6.0
Operating System + Version: Ubuntu 20.04 / Ubuntu 22.04 / NVIDIA Jetson Linux 35.4.1
Python Version (if applicable): 3.8 / - / -
PyTorch Version (if applicable): 1.10.2 / - / -
Baremetal or Container (if container which image + tag): Baremetal

Relevant Files

model.onnx (558 Bytes)

Hi,

We are unable to reproduce the error and able to build the TensorRT engine successfully.

[09/27/2023-14:58:06] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=model.onnx

We recommend you to try the latest TensorRT version 8.6.1.

Thank you.

Yes, you are right, there is no issue with TensorRT 8.6+, but TensorRT 8.5.2 now is the latest version that can be installed on Jetson, as far as I know. Is there any way to update it?

Actually, the workaround was found and this topic is just a bug report.

We are moving this post to the Jetson forum to get better help on the above.

Thank you,

OK, thank you for your assistance!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.