My Jetson AGX Xavier 32GB JetPack 4.6.1 Docker (TensorRT 8.2.1) was able to build your model by reducing the max of keypoints and scores.1 from 8000 to 3000.
command:
trtexec --onnx=model2_folded.onnx --saveEngine=model2_folded.engine --minShapes="keypoints:1x1x2,scores.1:1x1,score_map:1x1x320x256,dense_feat_map:1x128x80x64" --optShapes="keypoints:1x2565x2,scores.1:1x2565,score_map:1x1x320x256,dense_feat_map:1x128x80x64" --maxShapes="keypoints:1x3000x2,scores.1:1x3000,score_map:1x1x320x256,dense_feat_map:1x128x190x173" --workspace=30000
log
&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # trtexec --onnx=model2_folded.onnx --saveEngine=model2_folded.engine --minShapes=keypoints:1x1x2,scores.1:1x1,score_map:1x1x320x256,dense_feat_map:1x128x80x64 --optShapes=keypoints:1x2565x2,scores.1:1x2565,score_map:1x1x320x256,dense_feat_map:1x128x80x64 --maxShapes=keypoints:1x3000x2,scores.1:1x3000,score_map:1x1x320x256,dense_feat_map:1x128x190x173 --workspace=30000
[08/01/2022-10:27:47] [I] === Model Options ===
[08/01/2022-10:27:47] [I] Format: ONNX
[08/01/2022-10:27:47] [I] Model: model2_folded.onnx
[08/01/2022-10:27:47] [I] Output:
[08/01/2022-10:27:47] [I] === Build Options ===
[08/01/2022-10:27:47] [I] Max batch: explicit batch
[08/01/2022-10:27:47] [I] Workspace: 30000 MiB
[08/01/2022-10:27:47] [I] minTiming: 1
[08/01/2022-10:27:47] [I] avgTiming: 8
[08/01/2022-10:27:47] [I] Precision: FP32
[08/01/2022-10:27:47] [I] Calibration:
[08/01/2022-10:27:47] [I] Refit: Disabled
[08/01/2022-10:27:47] [I] Sparsity: Disabled
[08/01/2022-10:27:47] [I] Safe mode: Disabled
[08/01/2022-10:27:47] [I] DirectIO mode: Disabled
[08/01/2022-10:27:47] [I] Restricted mode: Disabled
[08/01/2022-10:27:47] [I] Save engine: model2_folded.engine
[08/01/2022-10:27:47] [I] Load engine:
[08/01/2022-10:27:47] [I] Profiling verbosity: 0
[08/01/2022-10:27:47] [I] Tactic sources: Using default tactic sources
[08/01/2022-10:27:47] [I] timingCacheMode: local
[08/01/2022-10:27:47] [I] timingCacheFile:
[08/01/2022-10:27:47] [I] Input(s)s format: fp32:CHW
[08/01/2022-10:27:47] [I] Output(s)s format: fp32:CHW
[08/01/2022-10:27:47] [I] Input build shape: score_map=1x1x320x256+1x1x320x256+1x1x320x256
[08/01/2022-10:27:47] [I] Input build shape: dense_feat_map=1x128x80x64+1x128x80x64+1x128x190x173
[08/01/2022-10:27:47] [I] Input build shape: keypoints=1x1x2+1x2565x2+1x3000x2
[08/01/2022-10:27:47] [I] Input build shape: scores.1=1x1+1x2565+1x3000
[08/01/2022-10:27:47] [I] Input calibration shapes: model
[08/01/2022-10:27:47] [I] === System Options ===
[08/01/2022-10:27:47] [I] Device: 0
[08/01/2022-10:27:47] [I] DLACore:
[08/01/2022-10:27:47] [I] Plugins:
[08/01/2022-10:27:47] [I] === Inference Options ===
[08/01/2022-10:27:47] [I] Batch: Explicit
[08/01/2022-10:27:47] [I] Input inference shape: score_map=1x1x320x256
[08/01/2022-10:27:47] [I] Input inference shape: scores.1=1x2565
[08/01/2022-10:27:47] [I] Input inference shape: keypoints=1x2565x2
[08/01/2022-10:27:47] [I] Input inference shape: dense_feat_map=1x128x80x64
[08/01/2022-10:27:47] [I] Iterations: 10
[08/01/2022-10:27:47] [I] Duration: 3s (+ 200ms warm up)
[08/01/2022-10:27:47] [I] Sleep time: 0ms
[08/01/2022-10:27:47] [I] Idle time: 0ms
[08/01/2022-10:27:47] [I] Streams: 1
[08/01/2022-10:27:47] [I] ExposeDMA: Disabled
[08/01/2022-10:27:47] [I] Data transfers: Enabled
[08/01/2022-10:27:47] [I] Spin-wait: Disabled
[08/01/2022-10:27:47] [I] Multithreading: Disabled
[08/01/2022-10:27:47] [I] CUDA Graph: Disabled
[08/01/2022-10:27:47] [I] Separate profiling: Disabled
[08/01/2022-10:27:47] [I] Time Deserialize: Disabled
[08/01/2022-10:27:47] [I] Time Refit: Disabled
[08/01/2022-10:27:47] [I] Skip inference: Disabled
[08/01/2022-10:27:47] [I] Inputs:
[08/01/2022-10:27:47] [I] === Reporting Options ===
[08/01/2022-10:27:47] [I] Verbose: Disabled
[08/01/2022-10:27:47] [I] Averages: 10 inferences
[08/01/2022-10:27:47] [I] Percentile: 99
[08/01/2022-10:27:47] [I] Dump refittable layers:Disabled
[08/01/2022-10:27:47] [I] Dump output: Disabled
[08/01/2022-10:27:47] [I] Profile: Disabled
[08/01/2022-10:27:47] [I] Export timing to JSON file:
[08/01/2022-10:27:47] [I] Export output to JSON file:
[08/01/2022-10:27:47] [I] Export profile to JSON file:
[08/01/2022-10:27:47] [I]
[08/01/2022-10:27:47] [I] === Device Information ===
[08/01/2022-10:27:47] [I] Selected Device: Xavier
[08/01/2022-10:27:47] [I] Compute Capability: 7.2
[08/01/2022-10:27:47] [I] SMs: 8
[08/01/2022-10:27:47] [I] Compute Clock Rate: 1.377 GHz
[08/01/2022-10:27:47] [I] Device Global Memory: 31928 MiB
[08/01/2022-10:27:47] [I] Shared Memory per SM: 96 KiB
[08/01/2022-10:27:47] [I] Memory Bus Width: 256 bits (ECC disabled)
[08/01/2022-10:27:47] [I] Memory Clock Rate: 1.377 GHz
[08/01/2022-10:27:47] [I]
[08/01/2022-10:27:47] [I] TensorRT version: 8.2.1
[08/01/2022-10:27:48] [I] [TRT] [MemUsageChange] Init CUDA: CPU +362, GPU +0, now: CPU 381, GPU 3064 (MiB)
[08/01/2022-10:27:48] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 381 MiB, GPU 3064 MiB
[08/01/2022-10:27:48] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 486 MiB, GPU 3171 MiB
[08/01/2022-10:27:48] [I] Start parsing network model
[08/01/2022-10:27:48] [I] [TRT] ----------------------------------------------------------------
[08/01/2022-10:27:48] [I] [TRT] Input filename: model2_folded.onnx
[08/01/2022-10:27:48] [I] [TRT] ONNX IR version: 0.0.7
[08/01/2022-10:27:48] [I] [TRT] Opset version: 13
[08/01/2022-10:27:48] [I] [TRT] Producer name:
[08/01/2022-10:27:48] [I] [TRT] Producer version:
[08/01/2022-10:27:48] [I] [TRT] Domain:
[08/01/2022-10:27:48] [I] [TRT] Model version: 0
[08/01/2022-10:27:48] [I] [TRT] Doc string:
[08/01/2022-10:27:48] [I] [TRT] ----------------------------------------------------------------
[08/01/2022-10:27:48] [W] [TRT] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/01/2022-10:27:49] [W] [TRT] Output type must be INT32 for shape outputs
[08/01/2022-10:27:49] [W] [TRT] Output type must be INT32 for shape outputs
[08/01/2022-10:27:49] [W] [TRT] Output type must be INT32 for shape outputs
[08/01/2022-10:27:49] [W] [TRT] Output type must be INT32 for shape outputs
[08/01/2022-10:27:49] [I] Finish parsing network model
[08/01/2022-10:27:49] [W] [TRT] DLA requests all profiles have same min, max, and opt value. All dla layers are falling back to GPU
[08/01/2022-10:27:49] [I] [TRT] ---------- Layers Running on DLA ----------
[08/01/2022-10:27:49] [I] [TRT] ---------- Layers Running on GPU ----------
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] Conv_10
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] Conv_8
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] Conv_6
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] Conv_4
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] Conv_2
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] {ForeignNode[19 + (Unnamed Layer* 29) [Shuffle]...Concat_45]}
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] Transpose_242 + Flatten_243
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] Transpose_91
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] [HostToDeviceCopy]
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] PWN(Clip_46)
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] {ForeignNode[55...Div_261]}
[08/01/2022-10:27:50] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +226, GPU +381, now: CPU 713, GPU 3554 (MiB)
[08/01/2022-10:27:51] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +308, GPU +510, now: CPU 1021, GPU 4064 (MiB)
[08/01/2022-10:27:51] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[08/01/2022-10:28:18] [W] [TRT] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are:
[08/01/2022-10:28:18] [W] [TRT] (# 3 (SHAPE dense_feat_map))
[08/01/2022-10:28:18] [W] [TRT] (# 1 (SHAPE keypoints))
[08/01/2022-10:28:18] [W] [TRT] (# 2 (SHAPE dense_feat_map))
[08/01/2022-10:28:35] [I] [TRT] Detected 4 inputs and 3 output network tensors.
[08/01/2022-10:28:35] [W] [TRT] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are:
[08/01/2022-10:28:35] [W] [TRT] (# 3 (SHAPE dense_feat_map))
[08/01/2022-10:28:35] [W] [TRT] (# 1 (SHAPE keypoints))
[08/01/2022-10:28:35] [W] [TRT] (# 2 (SHAPE dense_feat_map))
[08/01/2022-10:28:35] [I] [TRT] Total Host Persistent Memory: 11440
[08/01/2022-10:28:35] [I] [TRT] Total Device Persistent Memory: 0
[08/01/2022-10:28:35] [I] [TRT] Total Scratch Memory: 18433596128
[08/01/2022-10:28:35] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 17667 MiB
[08/01/2022-10:28:35] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 1.21227ms to assign 12 blocks to 19 nodes requiring 18452722176 bytes.
[08/01/2022-10:28:35] [I] [TRT] Total Activation Memory: 18452722176
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1481, GPU 9003 (MiB)
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1481, GPU 9003 (MiB)
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +4, now: CPU 0, GPU 4 (MiB)
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1480, GPU 8983 (MiB)
[08/01/2022-10:28:35] [I] [TRT] Loaded engine size: 6 MiB
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1487, GPU 8983 (MiB)
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1487, GPU 8983 (MiB)
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[08/01/2022-10:28:35] [I] Engine built in 48.4199 sec.
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1375, GPU 8986 (MiB)
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1375, GPU 8986 (MiB)
[08/01/2022-10:28:38] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +17597, now: CPU 0, GPU 17597 (MiB)
[08/01/2022-10:28:38] [I] Using random values for input keypoints
[08/01/2022-10:28:38] [I] Created input binding for keypoints with dimensions 1x2565x2
[08/01/2022-10:28:38] [I] Using random values for input scores.1
[08/01/2022-10:28:38] [I] Created input binding for scores.1 with dimensions 1x2565
[08/01/2022-10:28:38] [I] Using random values for input score_map
[08/01/2022-10:28:38] [I] Created input binding for score_map with dimensions 1x1x320x256
[08/01/2022-10:28:38] [I] Using random values for input dense_feat_map
[08/01/2022-10:28:38] [I] Created input binding for dense_feat_map with dimensions 1x128x80x64
[08/01/2022-10:28:38] [I] Using random values for output scores
[08/01/2022-10:28:38] [I] Created output binding for scores with dimensions 2565x1
[08/01/2022-10:28:38] [I] Using random values for output descs
[08/01/2022-10:28:38] [I] Created output binding for descs with dimensions 1x2565x128
[08/01/2022-10:28:38] [I] Using random values for output kpts
[08/01/2022-10:28:38] [I] Created output binding for kpts with dimensions 1x2565x2
[08/01/2022-10:28:38] [I] Starting inference
[08/01/2022-10:28:42] [I] Warmup completed 1 queries over 200 ms
[08/01/2022-10:28:42] [I] Timing trace has 17 queries over 3.52191 s
[08/01/2022-10:28:42] [I]
[08/01/2022-10:28:42] [I] === Trace details ===
[08/01/2022-10:28:42] [I] Trace averages of 10 runs:
[08/01/2022-10:28:42] [I] Average on 10 runs - GPU latency: 207.042 ms - Host latency: 207.221 ms (end to end 207.23 ms, enqueue 0.532111 ms)
[08/01/2022-10:28:42] [I]
[08/01/2022-10:28:42] [I] === Performance summary ===
[08/01/2022-10:28:42] [I] Throughput: 4.82692 qps
[08/01/2022-10:28:42] [I] Latency: min = 205.671 ms, max = 208.697 ms, mean = 207.162 ms, median = 207.218 ms, percentile(99%) = 208.697 ms
[08/01/2022-10:28:42] [I] End-to-End Host Latency: min = 205.679 ms, max = 208.708 ms, mean = 207.171 ms, median = 207.223 ms, percentile(99%) = 208.708 ms
[08/01/2022-10:28:42] [I] Enqueue Time: min = 0.421387 ms, max = 0.639282 ms, mean = 0.496025 ms, median = 0.485596 ms, percentile(99%) = 0.639282 ms
[08/01/2022-10:28:42] [I] H2D Latency: min = 0.120239 ms, max = 0.126587 ms, mean = 0.122923 ms, median = 0.122711 ms, percentile(99%) = 0.126587 ms
[08/01/2022-10:28:42] [I] GPU Compute Time: min = 205.492 ms, max = 208.514 ms, mean = 206.984 ms, median = 207.036 ms, percentile(99%) = 208.514 ms
[08/01/2022-10:28:42] [I] D2H Latency: min = 0.0415039 ms, max = 0.0603027 ms, mean = 0.0559046 ms, median = 0.0562744 ms, percentile(99%) = 0.0603027 ms
[08/01/2022-10:28:42] [I] Total Host Walltime: 3.52191 s
[08/01/2022-10:28:42] [I] Total GPU Compute Time: 3.51872 s
[08/01/2022-10:28:42] [I] Explanations of the performance metrics are printed in the verbose logs.
[08/01/2022-10:28:42] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # trtexec --onnx=model2_folded.onnx --saveEngine=model2_folded.engine --minShapes=keypoints:1x1x2,scores.1:1x1,score_map:1x1x320x256,dense_feat_map:1x128x80x64 --optShapes=keypoints:1x2565x2,scores.1:1x2565,score_map:1x1x320x256,dense_feat_map:1x128x80x64 --maxShapes=keypoints:1x3000x2,scores.1:1x3000,score_map:1x1x320x256,dense_feat_map:1x128x190x173 --workspace=30000
Can you try TensorRT 8.2.1 on your PC?
There are several fixes between 8.2.1 and 8.4.0.
If it is simply a version issue, it may work with the JetPack 5.0.1 you are targeting.