TensorRT engine batchsize confusion using NVDLA

Hello,

I’m trying to execute some engines using a custom TensorRT script and I’m having issues dealing with different batchsizes.
Firstly, I’m refering to a Resnet50 model, of which I’ve downloaded the onnx and created an engine using trtexec. By default (not specifing input sizes) I get a 1x3x224x224, so batchsize of 1, since the onnx input layer is of these dimension. specifying a different input size (for example 32x3x224x224) i get an API error.

running:

/usr/src/tensorrt/bin/trtexec --onnx=onnx/resnet50_Opset17.onnx --shapes=x:32x3x224x224 --int8 --fp16 --useDLACore=0 --allowGPUFallback --useSpinWait --separateProfileRun
[03/05/2025-16:10:24] [E] Static model does not take explicit shapes since the shape of inference tensors will be determined by the model itself
[03/05/2025-16:10:24] [E] Network And Config setup failed
[03/05/2025-16:10:24] [E] Building engine failed
[03/05/2025-16:10:24] [E] Failed to create engine from model or file.
[03/05/2025-16:10:24] [E] Engine set up failed

After building an engine using a batchsize of 1 (omitting the --shapes) I get a throughput of about 153 qps

I then change the batchsize of the onnx so that the default configuration is 32x3x224x224 and running the same trtexec command I get a throughput of about 5 qps. [1] Firstly I would like to ask why this is the case, I would imagine that a bigger batchsize would improve the throughput but I don’t see any improvement, instead 5*32 is 160 which is very close to the throughput of 153 qps that I’ve measured before.

Moving on, I take the bs1 and bs32 engine and I run a custom inference loop using the following code:

images = []
for _ in range(1000):
    image, _ = next(iter(torchvision.datasets.FakeData(size=1, image_size=(3,224,224))))
    images.append(image)

num_batches = 0
start_time = time.time()
while time.time() - start_time < args.duration:
    image = images[i % 1000]
    image = preprocess(image).unsqueeze(0).repeat(batch_size, 1, 1, 1)  # Repeat to match batch size
    input_buffer.copy_(image)
    context.execute_async_v2(
        bindings,   
        torch.cuda.current_stream().cuda_stream
    )
    torch.cuda.current_stream().synchronize()
    num_batches += 1
    i += 1

end_time = time.time()
total_time = end_time - start_time

throughput = num_batches * batch_size / total_time

Now,

  • setting a batchsize of 1 with a bs1 engine I get a throughput of 109 inf/sec
  • setting a batchsize of 32 with a bs1 engine I get an error (reported after this paragraph) but 2715 inf/sec
  • setting a batchsize of 1 with a bs32 engine I get a CUDA Memory alignment error
  • setting a batchsize of 32 with a bs32 engine I get 152 inf/sec

The error reported is the following

[E] 3: [executionContext.cpp::setInputShape::2013] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2013, condition: engineDims.d[i] == dims.d[i]. Static dimension mismatch while setting input shape.
)

The reason I’m using a bs32 with a bs1 engine is because it seems to be what the following official Nvidia Jetson DLA Trt tutorial is doing (creating a bs1 engine and then arbitrarily changing its batchsize) GitHub - NVIDIA-AI-IOT/jetson_dla_tutorial: A tutorial for getting started with the Deep Learning Accelerator (DLA) on NVIDIA Jetson

My question, other than [1], is: How should I interpret these numbers? what am I doing right and what am I doing wrong? why does bs32 with a bs1 engine achieve such huge results?

Thanks

Hi,

This might relate to memory.

Ideally, bs32 engine should have a better throughput.
But it might also require extra memcopy to accommodate the batchsize and impact the performance.

More, to test DLA performance, it’s recommended to try the model that can fully run on the DLA (without using --allowGPUFallback)
So the DLA ↔ GPU data transfer won’t affect the results.

Thanks.

I understand, is the default amount of memory given by trtexec not enough for 32x3x224x224? should I increase the amount of memory there or are you referring only to my custom script?

Furthermore, could you tell me how I could interpret the numbers when given bs32 to a bs1 engine using my custom script?

I’m also interested in doing experiments with other models that don’t necessarily fully run on the DLA, so I’m okay with taking DLA-GPU data transfer into account without measuring the full capabilities of the DLA, for this thread I’m only interested in understanding my confusions with batchsize.

Thank you so much!

Edit: I also wanted to add that I didn’t consider 2700 inferences a second to be too big of a number because of these benchmark reports: Jetson Benchmarks | NVIDIA Developer. Now I do think that a 50% increase from 109 to 153 with increasing the batchsize seems more reasonable, but if that’s the case, how was the reported over 2900 samples/s reached?

Hi,

You can change the usage of DLA memory via--memPoolSize flag (“dlaSRAM”|“dlaLocalDRAM”|“dlaGlobalDRAM”)

$ /usr/src/tensorrt/bin/trtexec -h
  ...
  --memPoolSize=poolspec             Specify the size constraints of the designated memory pool(s)
                                     Supports the following base-2 suffixes: B (Bytes), G (Gibibytes), K (Kibibytes), M (Mebibytes).
                                     If none of suffixes is appended, the defualt unit is in MiB.
                                     Note: Also accepts decimal sizes, e.g. 0.25M. Will be rounded down to the nearest integer bytes.
                                     In particular, for dlaSRAM the bytes will be rounded down to the nearest power of 2.
                                   Pool constraint: poolspec ::= poolfmt[","poolspec]
                                                      poolfmt ::= pool:size
                                                    pool ::= "workspace"|"dlaSRAM"|"dlaLocalDRAM"|"dlaGlobalDRAM"|"tacticSharedMem"

You can find more details about DLA’s memory in the below document:

Thanks.