Trtexec multi-source (streams) and multi-batch performance test failed

Description

I want to test the performance of the model in multiple sources (streams) and batches with the trtexec command, and I test it with the following command

/usr/src/tensorrt/bin/trtexec --loadEngine=yolov7_b16_int8_qat_640.engine --shapes=images:4x3x640x640 --streams=4

ps:
The .engine model source is converted by the following command(dynamic batch)

/usr/src/tensorrt/bin/trtexec --verbose --onnx=yolov7_qat_640.onnx --workspace=4096 --minShapes=images:1x3x640x640 --optShapes=images:12x3x640x640 --maxShapes=images:16x3x640x640 --saveEngine=yolov7_b16_int8_qat_640.engine --fp16 --int8

but the following error occurs.

[06/02/2023-09:24:37] [I] === Model Options ===
[06/02/2023-09:24:37] [I] Format: *
[06/02/2023-09:24:37] [I] Model: 
[06/02/2023-09:24:37] [I] Output:
[06/02/2023-09:24:37] [I] === Build Options ===
[06/02/2023-09:24:37] [I] Max batch: explicit batch
[06/02/2023-09:24:37] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[06/02/2023-09:24:37] [I] minTiming: 1
[06/02/2023-09:24:37] [I] avgTiming: 8
[06/02/2023-09:24:37] [I] Precision: FP32
[06/02/2023-09:24:37] [I] LayerPrecisions: 
[06/02/2023-09:24:37] [I] Calibration: 
[06/02/2023-09:24:37] [I] Refit: Disabled
[06/02/2023-09:24:37] [I] Sparsity: Disabled
[06/02/2023-09:24:37] [I] Safe mode: Disabled
[06/02/2023-09:24:37] [I] DirectIO mode: Disabled
[06/02/2023-09:24:37] [I] Restricted mode: Disabled
[06/02/2023-09:24:37] [I] Build only: Disabled
[06/02/2023-09:24:37] [I] Save engine: 
[06/02/2023-09:24:37] [I] Load engine: yolov7_b16_int8_qat_640.engine
[06/02/2023-09:24:37] [I] Profiling verbosity: 0
[06/02/2023-09:24:37] [I] Tactic sources: Using default tactic sources
[06/02/2023-09:24:37] [I] timingCacheMode: local
[06/02/2023-09:24:37] [I] timingCacheFile: 
[06/02/2023-09:24:37] [I] Heuristic: Disabled
[06/02/2023-09:24:37] [I] Preview Features: Use default preview flags.
[06/02/2023-09:24:37] [I] Input(s)s format: fp32:CHW
[06/02/2023-09:24:37] [I] Output(s)s format: fp32:CHW
[06/02/2023-09:24:37] [I] Input build shape: images=4x3x640x640+4x3x640x640+4x3x640x640
[06/02/2023-09:24:37] [I] Input calibration shapes: model
[06/02/2023-09:24:37] [I] === System Options ===
[06/02/2023-09:24:37] [I] Device: 0
[06/02/2023-09:24:37] [I] DLACore: 
[06/02/2023-09:24:37] [I] Plugins:
[06/02/2023-09:24:37] [I] === Inference Options ===
[06/02/2023-09:24:37] [I] Batch: Explicit
[06/02/2023-09:24:37] [I] Input inference shape: images=4x3x640x640
[06/02/2023-09:24:37] [I] Iterations: 10
[06/02/2023-09:24:37] [I] Duration: 3s (+ 200ms warm up)
[06/02/2023-09:24:37] [I] Sleep time: 0ms
[06/02/2023-09:24:37] [I] Idle time: 0ms
[06/02/2023-09:24:37] [I] Streams: 4
[06/02/2023-09:24:37] [I] ExposeDMA: Disabled
[06/02/2023-09:24:37] [I] Data transfers: Enabled
[06/02/2023-09:24:37] [I] Spin-wait: Disabled
[06/02/2023-09:24:37] [I] Multithreading: Disabled
[06/02/2023-09:24:37] [I] CUDA Graph: Disabled
[06/02/2023-09:24:37] [I] Separate profiling: Disabled
[06/02/2023-09:24:37] [I] Time Deserialize: Disabled
[06/02/2023-09:24:37] [I] Time Refit: Disabled
[06/02/2023-09:24:37] [I] NVTX verbosity: 0
[06/02/2023-09:24:37] [I] Persistent Cache Ratio: 0
[06/02/2023-09:24:37] [I] Inputs:
[06/02/2023-09:24:37] [I] === Reporting Options ===
[06/02/2023-09:24:37] [I] Verbose: Disabled
[06/02/2023-09:24:37] [I] Averages: 10 inferences
[06/02/2023-09:24:37] [I] Percentiles: 90,95,99
[06/02/2023-09:24:37] [I] Dump refittable layers:Disabled
[06/02/2023-09:24:37] [I] Dump output: Disabled
[06/02/2023-09:24:37] [I] Profile: Disabled
[06/02/2023-09:24:37] [I] Export timing to JSON file: 
[06/02/2023-09:24:37] [I] Export output to JSON file: 
[06/02/2023-09:24:37] [I] Export profile to JSON file: 
[06/02/2023-09:24:37] [I] 
[06/02/2023-09:24:37] [I] === Device Information ===
[06/02/2023-09:24:37] [I] Selected Device: Xavier
[06/02/2023-09:24:37] [I] Compute Capability: 7.2
[06/02/2023-09:24:37] [I] SMs: 8
[06/02/2023-09:24:37] [I] Compute Clock Rate: 1.377 GHz
[06/02/2023-09:24:37] [I] Device Global Memory: 31002 MiB
[06/02/2023-09:24:37] [I] Shared Memory per SM: 96 KiB
[06/02/2023-09:24:37] [I] Memory Bus Width: 256 bits (ECC disabled)
[06/02/2023-09:24:37] [I] Memory Clock Rate: 1.377 GHz
[06/02/2023-09:24:37] [I] 
[06/02/2023-09:24:37] [I] TensorRT version: 8.5.2
[06/02/2023-09:24:38] [I] Engine loaded in 0.0275892 sec.
[06/02/2023-09:24:38] [I] [TRT] Loaded engine size: 39 MiB
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +41, now: CPU 0, GPU 41 (MiB)
[06/02/2023-09:24:39] [I] Engine deserialized in 1.04122 sec.
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +364, now: CPU 0, GPU 405 (MiB)
[06/02/2023-09:24:39] [I] Setting persistentCacheLimit to 0 bytes.
[06/02/2023-09:24:39] [I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +363, now: CPU 0, GPU 768 (MiB)
[06/02/2023-09:24:39] [I] Setting persistentCacheLimit to 0 bytes.
[06/02/2023-09:24:39] [E] Error[1]: Unexpected exception cannot create std::vector larger than max_size()
[06/02/2023-09:24:39] [I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +363, now: CPU 0, GPU 1131 (MiB)
[06/02/2023-09:24:39] [I] Setting persistentCacheLimit to 0 bytes.
[06/02/2023-09:24:39] [E] Error[1]: Unexpected exception cannot create std::vector larger than max_size()
[06/02/2023-09:24:39] [I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.
[06/02/2023-09:24:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +364, now: CPU 1, GPU 1495 (MiB)
[06/02/2023-09:24:39] [I] Setting persistentCacheLimit to 0 bytes.
[06/02/2023-09:24:39] [E] Error[1]: Unexpected exception cannot create std::vector larger than max_size()
[06/02/2023-09:24:39] [I] Using random values for input images
[06/02/2023-09:24:39] [I] Created input binding for images with dimensions 4x3x640x640
[06/02/2023-09:24:39] [I] Using random values for input images
[06/02/2023-09:24:39] [I] Created input binding for images with dimensions 4x3x640x640
[06/02/2023-09:24:39] [I] Using random values for input images
[06/02/2023-09:24:39] [I] Created input binding for images with dimensions 4x3x640x640
[06/02/2023-09:24:39] [I] Using random values for input images
[06/02/2023-09:24:39] [I] Created input binding for images with dimensions 4x3x640x640
[06/02/2023-09:24:39] [I] Using random values for output outputs
[06/02/2023-09:24:39] [I] Created output binding for outputs with dimensions 4x25200x85
[06/02/2023-09:24:39] [I] Using random values for output outputs
[06/02/2023-09:24:39] [I] Created output binding for outputs with dimensions 4x25200x85
[06/02/2023-09:24:39] [I] Using random values for output outputs
[06/02/2023-09:24:39] [I] Created output binding for outputs with dimensions 4x25200x85
[06/02/2023-09:24:39] [I] Using random values for output outputs
[06/02/2023-09:24:39] [I] Created output binding for outputs with dimensions 4x25200x85
[06/02/2023-09:24:39] [I] Starting inference
[06/02/2023-09:24:39] [E] Error[2]: [executionContext.cpp::enqueueV3::2386] Error Code 2: Internal Error (Assertion mOptimizationProfile >= 0 failed. )
[06/02/2023-09:24:39] [E] Error occurred during inference

Environment

TensorRT Version : 8.5.2
GPU Type : J etson AGX Xavier
Nvidia Driver Version :
CUDA Version : 11.4.315
CUDNN Version : 8.6.0.166
Operating System + Version : 35.2.1 ( Jetpack: 5.1)
Python Version (if applicable) : Python 3.8.10
TensorFlow Version (if applicable) :
PyTorch Version (if applicable) : 1.12.0a0+2c916ef.nv22.3
Baremetal or Container (if container which image + tag) :

Relevant Files

Steps To Reproduce

Hi,

Could you please try the latest TensorRT version 8.6 and let us know if you still face the same issue?
Please share with us the repro ONNX model to try from our end.

For easy setup, you can also use the TensorRT NGC container.

1 Like

Hi,@ spolisetty

I tried “nvcr.io/nvidia/tensorrt:23.05-py3
got tensorrt-dev version is 8.6.1.2-1+cuda12.0

I get an error message when running the trtexec command

Cuda failure: CUDA driver version is insufficient for CUDA runtime version

My platform is ‘AGX Xavier’, shoud i use other version?

Which version of CUDA are you using? Please refer to the container release notes and make sure the driver version requirements are satisfied.

Thank you.

My cuda version should not support TensorRT 8.6 or higher

Does this problem solved on TensorRT 8.6.1 version?