TensorRT freezes while building a TensorRT engine from ONNX model

I have a very odd problem that I cannot solve on my own so I need your help.

I want to serve a model I have with Triton. For best performance I am trying to use the TensorRT backend.
I have an ONNX model of the network (I have tested and verified that the model is valid, exported from pytorch, using opset11). I am using trtexec to convert the ONNX file I have into a TensorRT engine, but during the conversion process trtexec gets stuck and the process continues forever. I have not found so far a way to kill the process and I have to reboot the system to get control over the environment.

My setup:
Ubuntu 20.04 running in WSL2 on a Windows 11
Docker for Windows v4.16.1 with WSL2 support
Container : nvcr.io/nvidia/tensorrt:22.10-py3
Container environment:
TRT_VERSION=“8.5.0.12”
CUBLAS_VERSION=“11.11.3.6”
CUDA_CACHE_DISABLE=“1”
CUDA_DRIVER_VERSION=“520.61.05”
CUDA_VERSION=“11.8.0.065”
CUDNN_VERSION=“8.6.0.163”
CUFFT_VERSION=“10.9.0.58”
CURAND_VERSION=“10.3.0.86”
CUSOLVER_VERSION=“11.4.1.48”
CUSPARSE_VERSION=“11.7.5.86”
CUTENSOR_VERSION=“1.6.1.5”

NVidia Driver installed on the Windows base: 526.47
GPU: RTX 2080TI

To convert the model I use trtexec from the docker container.
The command I run is :

trtexec --onnx=./model.onnx --verbose --workspace=16000 --minShapes=normalized_image:1x3x1792x3168 --optShapes=normalized_image:2x3x1792x3168 --maxShapes=normalized_image:2x3x1792x3168 --saveEngine=model2.plan

This runs for about 1-2 minutes and then stops producing messages.

The last messages I see are :

[01/18/2023-16:02:13] [V] [TRT] >>>>>>>>>>>>>>> Chose Runner Type: CaskConvolution Tactic: 0x946eca69f99ddcb4
[01/18/2023-16:02:13] [V] [TRT] *************** Autotuning format combination: Float(1419264,1:4,6336,16) -> Float(8515584,1:4,38016,96) ***************
[01/18/2023-16:02:13] [V] [TRT] --------------- Timing Runner: Conv_8 + Relu_9 (CudaDepthwiseConvolution)
[01/18/2023-16:02:13] [V] [TRT] CudaDepthwiseConvolution has no valid tactics for this config, skipping
[01/18/2023-16:02:13] [V] [TRT] --------------- Timing Runner: Conv_8 + Relu_9 (CublasConvolution)
[01/18/2023-16:02:13] [V] [TRT] CublasConvolution has no valid tactics for this config, skipping
[01/18/2023-16:02:13] [V] [TRT] --------------- Timing Runner: Conv_8 + Relu_9 (CaskGemmConvolution)
[01/18/2023-16:02:13] [V] [TRT] CaskGemmConvolution has no valid tactics for this config, skipping
[01/18/2023-16:02:13] [V] [TRT] =============== Computing costs for
[01/18/2023-16:02:13] [V] [TRT] *************** Autotuning format combination: Float(34062336,88704,396,1) -> Float(8515584,22176,198,1) ***************
[01/18/2023-16:02:13] [V] [TRT] --------------- Timing Runner: Conv_10 + Relu_11 (CudaDepthwiseConvolution)
[01/18/2023-16:02:13] [V] [TRT] Tactic: 0xffffffffffffffff Time: 0.781518
[01/18/2023-16:02:13] [V] [TRT] Fastest Tactic: 0xffffffffffffffff Time: 0.781518
[01/18/2023-16:02:13] [V] [TRT] --------------- Timing Runner: Conv_10 + Relu_11 (CudnnConvolution)
[01/18/2023-16:02:13] [V] [TRT] Tactic: 0x0000000000000000 Time: 1.14542
[01/18/2023-16:02:13] [V] [TRT] Tactic: 0x0000000000000001 Time: 1.1456
[01/18/2023-16:02:13] [V] [TRT] Tactic: 0x0000000000000002 Time: 1.14571

These were printed more than 1 hour ago. trtexec is still running.

I ran ps -aux in another console and I see

root       102 99.7 14.9 52012052 2445980 pts/0 Rl+ 16:01  70:01 trtexec --onnx=./model.onnx --verbose --workspace=16000 --minShapes=normalized_image:1x3x1792x3168 --optShapes=normalized_image:2x3x1792x316

So it reports trtexec using 99.7% CPU , but nothing visible is happening.
If I try to exit the process with CTRL+C, nothing happens. If I try to kill the process, it won’t die. If I try to quit the docker container, it won’t quit. If I force kill the docker container, next time I try to launch it it gets stuck and won’t start. So far only a reboot will bring the system back up.

I tried to convert the same ONNX file with the same command line on Windows directly with TensorRT-8.2.1.8 and in about 2 and a half minutes it successfully produced an engine file.

I am baffled what’s going on and how I can address this. What’s more strange is that I tried to convert another model - YOLO and it converted successfully (on the same WSL instance and the same docker container) but only on the 3rd try. However this model is bigger than yolo and so far I’ve been unable to convert it. I even left it over night yesterday and in the morning it was still hanging there.

Hi,

Please try on the latest TensorRT NGC container version http://nvcr.io/nvidia/tensorrt:22.12-py3,
Could you please share with us the ONNX model to try from our end for better debugging if you’re still facing this issue.

Thank you.