Thanks for your replay, i have found the trtexec tool, and use it to build an engine. But the prediciton is still wrong when using fp16. Following is the repo and the script.
&&&& RUNNING TensorRT.trtexec [TensorRT v8400] # ./trtexec --onnx=/home/trtexec_test/nnUNet_model_best.onnx --explicitBatch --saveEngine=/home/trtexec_test/nnUNet_model_best.trt --fp16
[06/20/2022-11:15:49] [W] --explicitBatch flag has been deprecated and has no effect!
[06/20/2022-11:15:49] [W] Explicit batch dim is automatically enabled if input model is ONNX or if dynamic shapes are provided when the engine is built.
[06/20/2022-11:15:49] [I] === Model Options ===
[06/20/2022-11:15:49] [I] Format: ONNX
[06/20/2022-11:15:49] [I] Model: /home/trtexec_test/nnUNet_model_best.onnx
[06/20/2022-11:15:49] [I] Output:
[06/20/2022-11:15:49] [I] === Build Options ===
[06/20/2022-11:15:49] [I] Max batch: explicit batch
[06/20/2022-11:15:49] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[06/20/2022-11:15:49] [I] minTiming: 1
[06/20/2022-11:15:49] [I] avgTiming: 8
[06/20/2022-11:15:49] [I] Precision: FP32+FP16
[06/20/2022-11:15:49] [I] LayerPrecisions:
[06/20/2022-11:15:49] [I] Calibration:
[06/20/2022-11:15:49] [I] Refit: Disabled
[06/20/2022-11:15:49] [I] Sparsity: Disabled
[06/20/2022-11:15:49] [I] Safe mode: Disabled
[06/20/2022-11:15:49] [I] DirectIO mode: Disabled
[06/20/2022-11:15:49] [I] Restricted mode: Disabled
[06/20/2022-11:15:49] [I] Save engine: /home/hurwa/trtexec_test/nnUNet_model_best.trt
[06/20/2022-11:15:49] [I] Load engine:
[06/20/2022-11:15:49] [I] Profiling verbosity: 0
[06/20/2022-11:15:49] [I] Tactic sources: Using default tactic sources
[06/20/2022-11:15:49] [I] timingCacheMode: local
[06/20/2022-11:15:49] [I] timingCacheFile:
[06/20/2022-11:15:49] [I] Input(s)s format: fp32:CHW
[06/20/2022-11:15:49] [I] Output(s)s format: fp32:CHW
[06/20/2022-11:15:49] [I] Input build shapes: model
[06/20/2022-11:15:49] [I] Input calibration shapes: model
[06/20/2022-11:15:49] [I] === System Options ===
[06/20/2022-11:15:49] [I] Device: 0
[06/20/2022-11:15:49] [I] DLACore:
[06/20/2022-11:15:49] [I] Plugins:
[06/20/2022-11:15:49] [I] === Inference Options ===
[06/20/2022-11:15:49] [I] Batch: Explicit
[06/20/2022-11:15:49] [I] Input inference shapes: model
[06/20/2022-11:15:49] [I] Iterations: 10
[06/20/2022-11:15:49] [I] Duration: 3s (+ 200ms warm up)
[06/20/2022-11:15:49] [I] Sleep time: 0ms
[06/20/2022-11:15:49] [I] Idle time: 0ms
[06/20/2022-11:15:49] [I] Streams: 1
[06/20/2022-11:15:49] [I] ExposeDMA: Disabled
[06/20/2022-11:15:49] [I] Data transfers: Enabled
[06/20/2022-11:15:49] [I] Spin-wait: Disabled
[06/20/2022-11:15:49] [I] Multithreading: Disabled
[06/20/2022-11:15:49] [I] CUDA Graph: Disabled
[06/20/2022-11:15:49] [I] Separate profiling: Disabled
[06/20/2022-11:15:49] [I] Time Deserialize: Disabled
[06/20/2022-11:15:49] [I] Time Refit: Disabled
[06/20/2022-11:15:49] [I] Skip inference: Disabled
[06/20/2022-11:15:49] [I] Inputs:
[06/20/2022-11:15:49] [I] === Reporting Options ===
[06/20/2022-11:15:49] [I] Verbose: Disabled
[06/20/2022-11:15:49] [I] Averages: 10 inferences
[06/20/2022-11:15:49] [I] Percentile: 99
[06/20/2022-11:15:49] [I] Dump refittable layers:Disabled
[06/20/2022-11:15:49] [I] Dump output: Disabled
[06/20/2022-11:15:49] [I] Profile: Disabled
[06/20/2022-11:15:49] [I] Export timing to JSON file:
[06/20/2022-11:15:49] [I] Export output to JSON file:
[06/20/2022-11:15:49] [I] Export profile to JSON file:
[06/20/2022-11:15:49] [I]
[06/20/2022-11:15:49] [I] === Device Information ===
[06/20/2022-11:15:49] [I] Selected Device: NVIDIA GeForce RTX 3090
[06/20/2022-11:15:49] [I] Compute Capability: 8.6
[06/20/2022-11:15:49] [I] SMs: 82
[06/20/2022-11:15:49] [I] Compute Clock Rate: 1.695 GHz
[06/20/2022-11:15:49] [I] Device Global Memory: 24259 MiB
[06/20/2022-11:15:49] [I] Shared Memory per SM: 100 KiB
[06/20/2022-11:15:49] [I] Memory Bus Width: 384 bits (ECC disabled)
[06/20/2022-11:15:49] [I] Memory Clock Rate: 9.751 GHz
[06/20/2022-11:15:49] [I]
[06/20/2022-11:15:49] [I] TensorRT version: 8.4.0
[06/20/2022-11:15:50] [I] [TRT] [MemUsageChange] Init CUDA: CPU +357, GPU +0, now: CPU 365, GPU 860 (MiB)
[06/20/2022-11:15:50] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 384 MiB, GPU 860 MiB
[06/20/2022-11:15:50] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 759 MiB, GPU 982 MiB
[06/20/2022-11:15:50] [I] Start parsing network model
[06/20/2022-11:15:50] [I] [TRT] ----------------------------------------------------------------
[06/20/2022-11:15:50] [I] [TRT] Input filename: /home/trtexec_test/nnUNet_model_best.onnx
[06/20/2022-11:15:50] [I] [TRT] ONNX IR version: 0.0.5
[06/20/2022-11:15:50] [I] [TRT] Opset version: 10
[06/20/2022-11:15:50] [I] [TRT] Producer name: pytorch
[06/20/2022-11:15:50] [I] [TRT] Producer version: 1.11.0
[06/20/2022-11:15:50] [I] [TRT] Domain:
[06/20/2022-11:15:50] [I] [TRT] Model version: 0
[06/20/2022-11:15:50] [I] [TRT] Doc string:
[06/20/2022-11:15:50] [I] [TRT] ----------------------------------------------------------------
[06/20/2022-11:15:52] [I] Finish parsing network model
[06/20/2022-11:15:52] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[06/20/2022-11:15:52] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1830, GPU 1863 (MiB)
[06/20/2022-11:15:52] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +13, now: CPU 1830, GPU 1876 (MiB)
[06/20/2022-11:15:52] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[06/20/2022-11:18:55] [I] [TRT] Detected 1 inputs and 6 output network tensors.
[06/20/2022-11:18:56] [I] [TRT] Total Host Persistent Memory: 102496
[06/20/2022-11:18:56] [I] [TRT] Total Device Persistent Memory: 10820608
[06/20/2022-11:18:56] [I] [TRT] Total Scratch Memory: 2397440
[06/20/2022-11:18:56] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 85 MiB, GPU 7319 MiB
[06/20/2022-11:18:56] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 10.6542ms to assign 9 blocks to 125 nodes requiring 654170112 bytes.
[06/20/2022-11:18:56] [I] [TRT] Total Activation Memory: 654170112
[06/20/2022-11:18:56] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3140, GPU 2826 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3140, GPU 2834 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +85, GPU +96, now: CPU 85, GPU 96 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 3217, GPU 2286 (MiB)
[06/20/2022-11:18:56] [I] [TRT] Loaded engine size: 86 MiB
[06/20/2022-11:18:56] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3225, GPU 2594 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 3225, GPU 2604 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +95, now: CPU 0, GPU 95 (MiB)
[06/20/2022-11:18:56] [I] Engine built in 187.03 sec.
[06/20/2022-11:18:56] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2502, GPU 2280 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2502, GPU 2288 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +634, now: CPU 0, GPU 729 (MiB)
[06/20/2022-11:18:56] [I] Using random values for input input
[06/20/2022-11:18:56] [I] Created input binding for input with dimensions 1x1x96x256x96
[06/20/2022-11:18:56] [I] Using random values for output output6
[06/20/2022-11:18:56] [I] Created output binding for output6 with dimensions 1x3x6x8x6
[06/20/2022-11:18:56] [I] Using random values for output output5
[06/20/2022-11:18:56] [I] Created output binding for output5 with dimensions 1x3x6x16x6
[06/20/2022-11:18:56] [I] Using random values for output output4
[06/20/2022-11:18:56] [I] Created output binding for output4 with dimensions 1x3x12x32x12
[06/20/2022-11:18:56] [I] Using random values for output output3
[06/20/2022-11:18:56] [I] Created output binding for output3 with dimensions 1x3x24x64x24
[06/20/2022-11:18:56] [I] Using random values for output output2
[06/20/2022-11:18:56] [I] Created output binding for output2 with dimensions 1x3x48x128x48
[06/20/2022-11:18:56] [I] Using random values for output output1
[06/20/2022-11:18:56] [I] Created output binding for output1 with dimensions 1x3x96x256x96
[06/20/2022-11:18:56] [I] Starting inference
[06/20/2022-11:18:59] [I] Warmup completed 10 queries over 200 ms
[06/20/2022-11:18:59] [I] Timing trace has 137 queries over 3.07714 s
[06/20/2022-11:18:59] [I]
[06/20/2022-11:18:59] [I] === Trace details ===
[06/20/2022-11:18:59] [I] Trace averages of 10 runs:
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.6876 ms - Host latency: 24.8693 ms (end to end 44.7577 ms, enqueue 1.72647 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 23.0524 ms - Host latency: 25.3096 ms (end to end 46.1373 ms, enqueue 1.93383 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.14 ms - Host latency: 23.3075 ms (end to end 41.8735 ms, enqueue 1.66331 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.833 ms - Host latency: 24.0656 ms (end to end 43.392 ms, enqueue 1.86876 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 23.1491 ms - Host latency: 25.4065 ms (end to end 45.819 ms, enqueue 1.83251 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.6759 ms - Host latency: 24.9184 ms (end to end 45.1935 ms, enqueue 1.77281 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.3631 ms - Host latency: 23.5985 ms (end to end 42.2557 ms, enqueue 1.77977 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 23.7853 ms - Host latency: 26.0267 ms (end to end 47.2052 ms, enqueue 1.77408 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.4162 ms - Host latency: 24.6669 ms (end to end 44.816 ms, enqueue 1.80068 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.3482 ms - Host latency: 23.5172 ms (end to end 42.3275 ms, enqueue 1.89629 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.2345 ms - Host latency: 23.4628 ms (end to end 42.1894 ms, enqueue 1.91763 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.0117 ms - Host latency: 24.2404 ms (end to end 43.5459 ms, enqueue 1.95603 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.8791 ms - Host latency: 25.1237 ms (end to end 45.6101 ms, enqueue 1.76414 ms)
[06/20/2022-11:18:59] [I]
[06/20/2022-11:18:59] [I] === Performance summary ===
[06/20/2022-11:18:59] [I] Throughput: 44.5219 qps
[06/20/2022-11:18:59] [I] Latency: min = 23.0624 ms, max = 27.8524 ms, mean = 24.5346 ms, median = 23.9123 ms, percentile(99%) = 27.2732 ms
[06/20/2022-11:18:59] [I] End-to-End Host Latency: min = 41.616 ms, max = 48.0995 ms, mean = 44.2912 ms, median = 43.2527 ms, percentile(99%) = 48.0986 ms
[06/20/2022-11:18:59] [I] Enqueue Time: min = 0.677887 ms, max = 2.58347 ms, mean = 1.81925 ms, median = 1.81586 ms, percentile(99%) = 2.40967 ms
[06/20/2022-11:18:59] [I] H2D Latency: min = 0.386871 ms, max = 0.479736 ms, mean = 0.437658 ms, median = 0.438965 ms, percentile(99%) = 0.464966 ms
[06/20/2022-11:18:59] [I] GPU Compute Time: min = 20.8108 ms, max = 25.9625 ms, mean = 22.3062 ms, median = 21.6658 ms, percentile(99%) = 24.9928 ms
[06/20/2022-11:18:59] [I] D2H Latency: min = 1.44308 ms, max = 1.88623 ms, mean = 1.7908 ms, median = 1.81104 ms, percentile(99%) = 1.88232 ms
[06/20/2022-11:18:59] [I] Total Host Walltime: 3.07714 s
[06/20/2022-11:18:59] [I] Total GPU Compute Time: 3.05595 s
[06/20/2022-11:18:59] [W] * GPU compute time is unstable, with coefficient of variance = 5.60913%.
[06/20/2022-11:18:59] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[06/20/2022-11:18:59] [I] Explanations of the performance metrics are printed in the verbose logs.
[06/20/2022-11:18:59] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8400] # ./trtexec --onnx=/home/trtexec_test/nnUNet_model_best.onnx --explicitBatch --saveEngine=/home/trtexec_test/nnUNet_model_best.trt --fp16