Why different input size causes different performance?

Description

I transfer the pth model (PyTorch) to onnx model, and convert to engine file, which can be applied in TensorRT inference based on c++ in Windows10. We find the input size of the onnx will greatly affects the performance of TensorRT. For example, if the network is trained using the image sequence size of 150x150x40 on PyTorch, when we convert our model to onnx file (input size :200x200x40 or 100x100x300), the performance is great. However, when the onnx file size is 150x150x40 (the code is shown below), the performance is much worse than before. The original dark background become light. We also test the performance using different input image sequence size in pytorch, and all the performance is good. The whole image sequence is 16 bit tiff data (min:26520 max: 49546 size:512x512). The Why could this happen? What should I do? Looking forward to your reply. Thank you.

Environment

TensorRT Version: 7.2.1
GPU Type: GeForce RTX 3090
CUDA Version: 11.0
CUDNN Version: 8.0.5
Operating System + Version: windows 10
Python Version (if applicable): 3.6
PyTorch Version (if applicable): 1.7.1

Relevant Files

convert pth to onnx (python):

    if is instance(denoise_generator, nn.DataParallel):
        denoise_generator.module.load_state_dict(torch.load(model_name))  # parallel
        denoise_generator.eval()
    else:
        denoise_generator.load_state_dict(torch.load(model_name))  # not parallel
        denoise_generator.eval()  
    model = denoise_generator.cuda()
    input_name = ['input']
    output_name = ['output']
    input = torch.randn(1, 1, 40, 150, 150).cuda()
    torch.onnx.export(model.module, input, 'NP02_150_40_1.onnx', export_params=True, opset_version=11, do_constant_folding=True, input_names=input_name, output_names=output_name, verbose=True)

convert onnx to engine file (command):

TensorRT-7.2.1.6.Windows10.x86_64.cuda-11.0.cudnn8.0\TensorRT-7.2.1.6\bin\trtexec.exe --onnx=NP02_150_40_1.onnx --explicitBatch --saveEngine=NP02_150_40_1.engine --workspace=2000 --fp16

Inference process (c++):

   char *trtModelStream{ nullptr };
std::ifstream file(model_name, std::ios::binary);
file.seekg(0, file.end);
int lengthh = file.tellg();
file.seekg(0, file.beg);
trtModelStream = new char[lengthh];

file.read(trtModelStream, lengthh);

file.close();
      IRuntime* runtime = createInferRuntime(gLogger);
assert(runtime != nullptr);
ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, lengthh, nullptr);
assert(engine != nullptr);
IExecutionContext* context = engine->createExecutionContext();
assert(context != nullptr);
delete[] trtModelStream;

const ICudaEngine& engine = context.getEngine();
	assert(engine.getNbBindings() == 2);
	void* buffers[2];
	cudaSetDevice(0);
	int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);
	int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);
          INPUT_S=40; INPUT_H=150; INPUT_W=150;BATCH_SIZE=1;
	// create GPU buffers and a stream
	cudaMalloc(&buffers[inputIndex], BATCH_SIZE * INPUT_S * INPUT_H * INPUT_W * sizeof(float));
	cudaMalloc(&buffers[outputIndex], BATCH_SIZE * INPUT_S * INPUT_H * INPUT_W * sizeof(float));
	// Create CUDA stream for the execution of this inference.
	cudaStream_t stream;
	cudaStreamCreate(&stream);
	cudaMemcpyAsync(buffers[inputIndex], data, BATCH_SIZE * INPUT_S * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream);
	context.enqueueV2(buffers, stream, nullptr);
	cudaMemcpyAsync(output, buffers[outputIndex], BATCH_SIZE * INPUT_S * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyDeviceToHost, stream);
	cudaStreamSynchronize(stream);

	// Release stream
	cudaStreamDestroy(stream);
	cudaFree(buffers[inputIndex]);

cudaFree(buffers[outputIndex]);

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec
In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

run trtexec.exe --onnx=NP02_150_40.onnx
verbose:
&&&& RUNNING TensorRT.trtexec # E:\01-LYX\TensorRT-7.2.1.6.Windows10.x86_64.cuda-11.0.cudnn8.0\TensorRT-7.2.1.6\bin\trtexec.exe --onnx=NP02_150_40.onnx
[07/02/2021-21:14:33] [I] === Model Options ===
[07/02/2021-21:14:33] [I] Format: ONNX
[07/02/2021-21:14:33] [I] Model: NP02_150_40.onnx
[07/02/2021-21:14:33] [I] Output:
[07/02/2021-21:14:33] [I] === Build Options ===
[07/02/2021-21:14:33] [I] Max batch: explicit
[07/02/2021-21:14:33] [I] Workspace: 16 MiB
[07/02/2021-21:14:33] [I] minTiming: 1
[07/02/2021-21:14:33] [I] avgTiming: 8
[07/02/2021-21:14:33] [I] Precision: FP32
[07/02/2021-21:14:33] [I] Calibration:
[07/02/2021-21:14:33] [I] Refit: Disabled
[07/02/2021-21:14:33] [I] Safe mode: Disabled
[07/02/2021-21:14:33] [I] Save engine:
[07/02/2021-21:14:33] [I] Load engine:
[07/02/2021-21:14:33] [I] Builder Cache: Enabled
[07/02/2021-21:14:33] [I] NVTX verbosity: 0
[07/02/2021-21:14:33] [I] Tactic sources: Using default tactic sources
[07/02/2021-21:14:33] [I] Input(s)s format: fp32:CHW
[07/02/2021-21:14:33] [I] Output(s)s format: fp32:CHW
[07/02/2021-21:14:33] [I] Input build shapes: model
[07/02/2021-21:14:33] [I] Input calibration shapes: model
[07/02/2021-21:14:33] [I] === System Options ===
[07/02/2021-21:14:33] [I] Device: 0
[07/02/2021-21:14:33] [I] DLACore:
[07/02/2021-21:14:33] [I] Plugins:
[07/02/2021-21:14:33] [I] === Inference Options ===
[07/02/2021-21:14:33] [I] Batch: Explicit
[07/02/2021-21:14:33] [I] Input inference shapes: model
[07/02/2021-21:14:33] [I] Iterations: 10
[07/02/2021-21:14:33] [I] Duration: 3s (+ 200ms warm up)
[07/02/2021-21:14:33] [I] Sleep time: 0ms
[07/02/2021-21:14:33] [I] Streams: 1
[07/02/2021-21:14:33] [I] ExposeDMA: Disabled
[07/02/2021-21:14:33] [I] Data transfers: Enabled
[07/02/2021-21:14:33] [I] Spin-wait: Disabled
[07/02/2021-21:14:33] [I] Multithreading: Disabled
[07/02/2021-21:14:33] [I] CUDA Graph: Disabled
[07/02/2021-21:14:33] [I] Separate profiling: Disabled
[07/02/2021-21:14:33] [I] Skip inference: Disabled
[07/02/2021-21:14:33] [I] Inputs:
[07/02/2021-21:14:33] [I] === Reporting Options ===
[07/02/2021-21:14:33] [I] Verbose: Disabled
[07/02/2021-21:14:33] [I] Averages: 10 inferences
[07/02/2021-21:14:33] [I] Percentile: 99
[07/02/2021-21:14:33] [I] Dump refittable layers:Disabled
[07/02/2021-21:14:33] [I] Dump output: Disabled
[07/02/2021-21:14:33] [I] Profile: Disabled
[07/02/2021-21:14:33] [I] Export timing to JSON file:
[07/02/2021-21:14:33] [I] Export output to JSON file:
[07/02/2021-21:14:33] [I] Export profile to JSON file:
[07/02/2021-21:14:33] [I]
[07/02/2021-21:14:33] [I] === Device Information ===
[07/02/2021-21:14:33] [I] Selected Device: GeForce RTX 3090
[07/02/2021-21:14:33] [I] Compute Capability: 8.6
[07/02/2021-21:14:33] [I] SMs: 82
[07/02/2021-21:14:33] [I] Compute Clock Rate: 1.695 GHz
[07/02/2021-21:14:33] [I] Device Global Memory: 24576 MiB
[07/02/2021-21:14:33] [I] Shared Memory per SM: 100 KiB
[07/02/2021-21:14:33] [I] Memory Bus Width: 384 bits (ECC disabled)
[07/02/2021-21:14:33] [I] Memory Clock Rate: 9.751 GHz
[07/02/2021-21:14:33] [I]

Input filename: NP02_150_40.onnx
ONNX IR version: 0.0.6
Opset version: 11
Producer name: pytorch
Producer version: 1.7
Domain:
Model version: 0
Doc string:

[07/02/2021-21:14:34] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT
64. Attempting to cast down to INT32.
[07/02/2021-21:14:35] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.2.0 but loaded cuBLAS/cuBLAS LT 11.1.0
[07/02/2021-21:14:37] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please che
ck verbose output.
[07/02/2021-21:14:39] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[07/02/2021-21:14:39] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.2.0 but loaded cuBLAS/cuBLAS LT 11.1.0
[07/02/2021-21:14:39] [I] Engine built in 5.92113 sec.
[07/02/2021-21:14:39] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.2.0 but loaded cuBLAS/cuBLAS LT 11.1.0
[07/02/2021-21:14:39] [I] Starting inference
[07/02/2021-21:14:42] [I] Warmup completed 0 queries over 200 ms
[07/02/2021-21:14:42] [I] Timing trace has 0 queries over 3.07133 s
[07/02/2021-21:14:42] [I] Trace averages of 10 runs:
[07/02/2021-21:14:42] [I] Average on 10 runs - GPU latency: 22.9292 ms - Host latency: 23.6067 ms (end to end 44.7186 ms, enqueue 4.23376 ms)
[07/02/2021-21:14:42] [I] Average on 10 runs - GPU latency: 22.5352 ms - Host latency: 23.2247 ms (end to end 44.5199 ms, enqueue 4.31635 ms)
[07/02/2021-21:14:42] [I] Average on 10 runs - GPU latency: 22.8595 ms - Host latency: 23.5326 ms (end to end 44.8566 ms, enqueue 4.55482 ms)
[07/02/2021-21:14:42] [I] Average on 10 runs - GPU latency: 22.7895 ms - Host latency: 23.4797 ms (end to end 44.9019 ms, enqueue 1.92739 ms)
[07/02/2021-21:14:42] [I] Average on 10 runs - GPU latency: 22.4299 ms - Host latency: 23.1133 ms (end to end 44.695 ms, enqueue 1.31323 ms)
[07/02/2021-21:14:42] [I] Average on 10 runs - GPU latency: 23.0786 ms - Host latency: 23.7411 ms (end to end 45.3994 ms, enqueue 4.25176 ms)
[07/02/2021-21:14:42] [I] Average on 10 runs - GPU latency: 22.7826 ms - Host latency: 23.4822 ms (end to end 44.6816 ms, enqueue 4.02943 ms)
[07/02/2021-21:14:42] [I] Average on 10 runs - GPU latency: 22.6778 ms - Host latency: 23.3548 ms (end to end 44.9457 ms, enqueue 1.32758 ms)
[07/02/2021-21:14:42] [I] Average on 10 runs - GPU latency: 22.6492 ms - Host latency: 23.3111 ms (end to end 44.6989 ms, enqueue 3.79288 ms)
[07/02/2021-21:14:42] [I] Average on 10 runs - GPU latency: 22.7203 ms - Host latency: 23.4206 ms (end to end 45.1155 ms, enqueue 1.03423 ms)
[07/02/2021-21:14:42] [I] Average on 10 runs - GPU latency: 23.0206 ms - Host latency: 23.689 ms (end to end 44.9721 ms, enqueue 4.12368 ms)
[07/02/2021-21:14:42] [I] Average on 10 runs - GPU latency: 22.2535 ms - Host latency: 22.9387 ms (end to end 43.8511 ms, enqueue 4.16018 ms)
[07/02/2021-21:14:42] [I] Average on 10 runs - GPU latency: 23.093 ms - Host latency: 23.7927 ms (end to end 45.4809 ms, enqueue 3.56311 ms)
[07/02/2021-21:14:42] [I] Host Latency
[07/02/2021-21:14:42] [I] min: 22.6407 ms (end to end 43.0844 ms)
[07/02/2021-21:14:42] [I] max: 25.8501 ms (end to end 48.1566 ms)
[07/02/2021-21:14:42] [I] mean: 23.443 ms (end to end 44.8546 ms)
[07/02/2021-21:14:42] [I] median: 23.2444 ms (end to end 44.6142 ms)
[07/02/2021-21:14:42] [I] percentile: 25.8099 ms at 99% (end to end 48.0836 ms at 99%)
[07/02/2021-21:14:42] [I] throughput: 0 qps
[07/02/2021-21:14:42] [I] walltime: 3.07133 s
[07/02/2021-21:14:42] [I] Enqueue Time
[07/02/2021-21:14:42] [I] min: 0.787109 ms
[07/02/2021-21:14:42] [I] max: 6.34814 ms
[07/02/2021-21:14:42] [I] median: 4.07324 ms
[07/02/2021-21:14:42] [I] GPU Compute
[07/02/2021-21:14:42] [I] min: 21.9853 ms
[07/02/2021-21:14:42] [I] max: 25.1914 ms
[07/02/2021-21:14:42] [I] mean: 22.7598 ms
[07/02/2021-21:14:42] [I] median: 22.5603 ms
[07/02/2021-21:14:42] [I] percentile: 25.1525 ms at 99%
[07/02/2021-21:14:42] [I] total compute time: 3.04981 s
&&&& PASSED TensorRT.trtexec # E:\01-LYX\TensorRT-7.2.1.6.Windows10.x86_64.cuda-11.0.cudnn8.0\TensorRT-7.2.1.6\bin\trtexec.exe --onnx=NP02_150_40.onnx

I find out the reason myself. I change the version from TensorRT 7.2.1 to 8.x. The problem finally solved.