TensorRT does not see all GPU memory

Description

I have a pipeline for converting an ONNX model into the TRT engine. It was working OK, until I changed to another machine.

Now TensorRT simply crashes silently without outputting any error message. The last output I can see is something like:

[11/18/2022-15:27:13] [V] [TRT] *************** Autotuning format combination: Float(10240000,256,32,1), Int32(2,1), Int32(1), Float(57600,64,8,8,2,1), Float(28800,32,4,4,1) -> Float(230400,256,1) ***************
[11/18/2022-15:27:13] [V] [TRT] =============== Computing costs for 
[11/18/2022-15:27:13] [V] [TRT] *************** Autotuning format combination: Float(256,256,1), Float(256,256,1), Float(256,256,1), Float(256,256,1), Float(256,256,1), Float(230400,256,1), Float(256,256,1), Float(2700,3,1), Float(2700,3,1), Float(2700,3,1), Float(2700,3,1), Float(2700,3,1), Float(2700,3,1) -> Float(9000,9000,10,1), Float(9000,9000,10,1) ***************
[11/18/2022-15:27:13] [V] [TRT] --------------- Timing Runner: {ForeignNode[onnx::MatMul_6277...Concat_6145]} (Myelin)

Interestingly, the conversion happens in a docker container. I use the same docker image under the same OS (Ubuntu 22.04) and the same input model, so I struggle to understand why doesn’t it work.

The only difference between the runs I noticed consists in the amount of GPU memory allocated by CUDA. Normally, a large portion of free memory is allocated initially:

[11/18/2022-14:33:40] [TRT] [I] [MemUsageChange] Init CUDA: CPU +329, GPU +0, now: CPU 339, GPU 26979 (MiB)
[11/18/2022-14:33:43] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +327, GPU +104, now: CPU 685, GPU 27083 (MiB)

However, in the ‘crash case’ only 443 MiB are used:

[11/18/2022-15:32:58] [I] [TRT] [MemUsageChange] Init CUDA: CPU +328, GPU +0, now: CPU 336, GPU 443 (MiB)
[11/18/2022-15:32:58] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +328, GPU +104, now: CPU 683, GPU 547 (MiB)

I checked the memory pool limit with self.config.get_memory_pool_limit(trt.MemoryPoolType.WORKSPACE), and it tells me that the limit is the 25GB, the whole amount of GPU memory on board.

I am wondering, does CUDA/TensorRT see only a small portion of GPU memory? Can it be the reason for silent crash?

I provide here some verbose logging output for trtexec:

Environment

TensorRT Version: 8.4.1.5
GPU Type: GeForce RTX 3090
Nvidia Driver Version: 520.61.05
CUDA Version: 11.7 Update 1 Preview
CUDNN Version:
Operating System + Version: Ubuntu 22.04
Python Version (if applicable): 3.8.13
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/pytorch:22.07
PyTorch Release Notes :: NVIDIA Deep Learning Frameworks Documentation

Logging excerpt:

&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # trtexec --verbose --buildOnly --plugins=libms_deformable_attn.so --plugins=libgrid_sampler.so --onnx=model.onnx --saveEngine=engine.fp32.trt
[11/18/2022-15:32:57] [I] === Model Options ===
[11/18/2022-15:32:57] [I] Format: ONNX
[11/18/2022-15:32:57] [I] Model: model.onnx
[11/18/2022-15:32:57] [I] Output:
[11/18/2022-15:32:57] [I] === Build Options ===
[11/18/2022-15:32:57] [I] Max batch: explicit batch
[11/18/2022-15:32:57] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[11/18/2022-15:32:57] [I] minTiming: 1
[11/18/2022-15:32:57] [I] avgTiming: 8
[11/18/2022-15:32:57] [I] Precision: FP32
[11/18/2022-15:32:57] [I] LayerPrecisions: 
[11/18/2022-15:32:57] [I] Calibration: 
[11/18/2022-15:32:57] [I] Refit: Disabled
[11/18/2022-15:32:57] [I] Sparsity: Disabled
[11/18/2022-15:32:57] [I] Safe mode: Disabled
[11/18/2022-15:32:57] [I] DirectIO mode: Disabled
[11/18/2022-15:32:57] [I] Restricted mode: Disabled
[11/18/2022-15:32:57] [I] Build only: Enabled
[11/18/2022-15:32:57] [I] Save engine: engine.fp32.trt
[11/18/2022-15:32:57] [I] Load engine: 
[11/18/2022-15:32:57] [I] Profiling verbosity: 0
[11/18/2022-15:32:57] [I] Tactic sources: Using default tactic sources
[11/18/2022-15:32:57] [I] timingCacheMode: local
[11/18/2022-15:32:57] [I] timingCacheFile: 
[11/18/2022-15:32:57] [I] Input(s)s format: fp32:CHW
[11/18/2022-15:32:57] [I] Output(s)s format: fp32:CHW
[11/18/2022-15:32:57] [I] Input build shapes: model
[11/18/2022-15:32:57] [I] Input calibration shapes: model
[11/18/2022-15:32:57] [I] === System Options ===
[11/18/2022-15:32:57] [I] Device: 0
[11/18/2022-15:32:57] [I] DLACore: 
[11/18/2022-15:32:57] [I] Plugins: libgrid_sampler.so libms_deformable_attn.so
[11/18/2022-15:32:57] [I] === Inference Options ===
[11/18/2022-15:32:57] [I] Batch: Explicit
[11/18/2022-15:32:57] [I] Input inference shapes: model
[11/18/2022-15:32:57] [I] Iterations: 10
[11/18/2022-15:32:57] [I] Duration: 3s (+ 200ms warm up)
[11/18/2022-15:32:57] [I] Sleep time: 0ms
[11/18/2022-15:32:57] [I] Idle time: 0ms
[11/18/2022-15:32:57] [I] Streams: 1
[11/18/2022-15:32:57] [I] ExposeDMA: Disabled
[11/18/2022-15:32:57] [I] Data transfers: Enabled
[11/18/2022-15:32:57] [I] Spin-wait: Disabled
[11/18/2022-15:32:57] [I] Multithreading: Disabled
[11/18/2022-15:32:57] [I] CUDA Graph: Disabled
[11/18/2022-15:32:57] [I] Separate profiling: Disabled
[11/18/2022-15:32:57] [I] Time Deserialize: Disabled
[11/18/2022-15:32:57] [I] Time Refit: Disabled
[11/18/2022-15:32:57] [I] Inputs:
[11/18/2022-15:32:57] [I] === Reporting Options ===
[11/18/2022-15:32:57] [I] Verbose: Enabled
[11/18/2022-15:32:57] [I] Averages: 10 inferences
[11/18/2022-15:32:57] [I] Percentile: 99
[11/18/2022-15:32:57] [I] Dump refittable layers:Disabled
[11/18/2022-15:32:57] [I] Dump output: Disabled
[11/18/2022-15:32:57] [I] Profile: Disabled
[11/18/2022-15:32:57] [I] Export timing to JSON file: 
[11/18/2022-15:32:57] [I] Export output to JSON file: 
[11/18/2022-15:32:57] [I] Export profile to JSON file: 
[11/18/2022-15:32:57] [I] 
[11/18/2022-15:32:57] [I] === Device Information ===
[11/18/2022-15:32:57] [I] Selected Device: NVIDIA GeForce RTX 3090
[11/18/2022-15:32:57] [I] Compute Capability: 8.6
[11/18/2022-15:32:57] [I] SMs: 82
[11/18/2022-15:32:57] [I] Compute Clock Rate: 1.695 GHz
[11/18/2022-15:32:57] [I] Device Global Memory: 24265 MiB
[11/18/2022-15:32:57] [I] Shared Memory per SM: 100 KiB
[11/18/2022-15:32:57] [I] Memory Bus Width: 384 bits (ECC disabled)
[11/18/2022-15:32:57] [I] Memory Clock Rate: 9.751 GHz
[11/18/2022-15:32:57] [I] 
[11/18/2022-15:32:57] [I] TensorRT version: 8.4.1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::CropAndResizeDynamic version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::DecodeBbox3DPlugin version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::EfficientNMS_Explicit_TF_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::EfficientNMS_Implicit_TF_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::MultiscaleDeformableAttnPlugin_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::NMSDynamic_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::PillarScatterPlugin version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::Proposal version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::ProposalDynamic version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::Split version 1
[11/18/2022-15:32:57] [V] [TRT] Registered plugin creator - ::VoxelGeneratorPlugin version 1
[11/18/2022-15:32:57] [I] Loading supplied plugin library: libgrid_sampler.so
[11/18/2022-15:32:57] [I] Loading supplied plugin library: libms_deformable_attn.so
[11/18/2022-15:32:58] [I] [TRT] [MemUsageChange] Init CUDA: CPU +328, GPU +0, now: CPU 336, GPU 443 (MiB)
[11/18/2022-15:32:58] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +328, GPU +104, now: CPU 683, GPU 547 (MiB)
[11/18/2022-15:32:58] [I] Start parsing network model
[11/18/2022-15:32:58] [I] [TRT] ----------------------------------------------------------------
[11/18/2022-15:32:58] [I] [TRT] Input filename:   model.onnx
[11/18/2022-15:32:58] [I] [TRT] ONNX IR version:  0.0.8
[11/18/2022-15:32:58] [I] [TRT] Opset version:    16
[11/18/2022-15:32:58] [I] [TRT] Producer name:    pytorch
[11/18/2022-15:32:58] [I] [TRT] Producer version: 1.13.0
[11/18/2022-15:32:58] [I] [TRT] Domain:           
[11/18/2022-15:32:58] [I] [TRT] Model version:    0
[11/18/2022-15:32:58] [I] [TRT] Doc string:       
[11/18/2022-15:32:58] [I] [TRT] ----------------------------------------------------------------

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.

In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!