Hello,
I’m trying to execute some engines using a custom TensorRT script and I’m having issues dealing with different batchsizes.
Firstly, I’m refering to a Resnet50 model, of which I’ve downloaded the onnx and created an engine using trtexec. By default (not specifing input sizes) I get a 1x3x224x224, so batchsize of 1, since the onnx input layer is of these dimension. specifying a different input size (for example 32x3x224x224) i get an API error.
running:
/usr/src/tensorrt/bin/trtexec --onnx=onnx/resnet50_Opset17.onnx --shapes=x:32x3x224x224 --int8 --fp16 --useDLACore=0 --allowGPUFallback --useSpinWait --separateProfileRun
[03/05/2025-16:10:24] [E] Static model does not take explicit shapes since the shape of inference tensors will be determined by the model itself
[03/05/2025-16:10:24] [E] Network And Config setup failed
[03/05/2025-16:10:24] [E] Building engine failed
[03/05/2025-16:10:24] [E] Failed to create engine from model or file.
[03/05/2025-16:10:24] [E] Engine set up failed
After building an engine using a batchsize of 1 (omitting the --shapes) I get a throughput of about 153 qps
I then change the batchsize of the onnx so that the default configuration is 32x3x224x224 and running the same trtexec command I get a throughput of about 5 qps. [1] Firstly I would like to ask why this is the case, I would imagine that a bigger batchsize would improve the throughput but I don’t see any improvement, instead 5*32 is 160 which is very close to the throughput of 153 qps that I’ve measured before.
Moving on, I take the bs1 and bs32 engine and I run a custom inference loop using the following code:
images = []
for _ in range(1000):
image, _ = next(iter(torchvision.datasets.FakeData(size=1, image_size=(3,224,224))))
images.append(image)
num_batches = 0
start_time = time.time()
while time.time() - start_time < args.duration:
image = images[i % 1000]
image = preprocess(image).unsqueeze(0).repeat(batch_size, 1, 1, 1) # Repeat to match batch size
input_buffer.copy_(image)
context.execute_async_v2(
bindings,
torch.cuda.current_stream().cuda_stream
)
torch.cuda.current_stream().synchronize()
num_batches += 1
i += 1
end_time = time.time()
total_time = end_time - start_time
throughput = num_batches * batch_size / total_time
Now,
- setting a batchsize of 1 with a bs1 engine I get a throughput of 109 inf/sec
- setting a batchsize of 32 with a bs1 engine I get an error (reported after this paragraph) but 2715 inf/sec
- setting a batchsize of 1 with a bs32 engine I get a CUDA Memory alignment error
- setting a batchsize of 32 with a bs32 engine I get 152 inf/sec
The error reported is the following
[E] 3: [executionContext.cpp::setInputShape::2013] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2013, condition: engineDims.d[i] == dims.d[i]. Static dimension mismatch while setting input shape.
)
The reason I’m using a bs32 with a bs1 engine is because it seems to be what the following official Nvidia Jetson DLA Trt tutorial is doing (creating a bs1 engine and then arbitrarily changing its batchsize) GitHub - NVIDIA-AI-IOT/jetson_dla_tutorial: A tutorial for getting started with the Deep Learning Accelerator (DLA) on NVIDIA Jetson
My question, other than [1], is: How should I interpret these numbers? what am I doing right and what am I doing wrong? why does bs32 with a bs1 engine achieve such huge results?
Thanks