Description
Inference core dumped with multiple execution contexts parallel.
Model type is onnx dynamic shape. I created same count profiles with execution contexts, and for each execution context, called context->setOptimizationProfile(i)
before inference. From the log out, you can see the binding index for each profile and context is correct, but I never made the inference success.
Log shows illegal memory access was encountered. Then, I checked the device buffer and host buffer, both of them were allocated correct memory size. I cannot find any clue as of now. Please help!
Thanks a lot!
The attachment is full code and model for reproducing this issue.
Environment
TensorRT Version : 8.0.1.6, C++ API
GPU Type : Tesla P4
Nvidia Driver Version : 440.33.01
CUDA Version : 10.2
CUDNN Version : 8.2
Operating System + Version : Ubuntu 16.04
Python Version (if applicable) : NA
TensorFlow Version (if applicable) : NA
PyTorch Version (if applicable) : NA
Baremetal or Container (if container which image + tag) : NA
Relevant Files
The attachment has full code, model, CMakeLists. Just modify the TensorRT path in CMakeLists and the building should work.
trt-conc.zip (15.0 MB)
Steps To Reproduce
compile and produced bin: concurrency_test
mkdir build && cd build
cmake ..
make -j4
Then run:
./concurrency_test ../mobilenetv1/params image softmax_0.tmp_0 1 1 2
NVES
April 6, 2022, 1:37pm
2
Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:
validating your model with the below snippet
check_model.py
import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.
In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!
Tried trtexec, I found some clues.
For dynamic shape model with multiple execution contexts, the --minShapes
, --optShapes
, --maxShapes
should have same batch_size (explicit batch).
for example:
this is working well
./trtexec --onnx=./mobilenetv1/params --minShapes=image:2x3x224x224 --optShapes=image:2x3x224x224 --maxShapes=image:2x3x224x224 --streams=2 --explicitBatch --shapes=image:2x3x224x224
this is not
./trtexec --onnx=./mobilenetv1/params --minShapes=image:1x3x224x224 --optShapes=image:2x3x224x224 --maxShapes=image:4x3x224x224 --streams=2 --explicitBatch --shapes=image:2x3x224x224
with error msg:
[04/07/2022-12:39:36] [E] Error[3]: [executionContext.cpp::setBindingDimensions::949] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::949, condition: mOptimizationProfile >= 0 && mOptimizationProfile < mEngine.getNbOptimizationProfiles()
)
[04/07/2022-12:39:36] [E] Inference set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8001] # ./trtexec --onnx=./mobilenetv1/params --minShapes=image:1x3x224x224 --optShapes=image:2x3x224x224 --maxShapes=image:4x3x224x224 --streams=2 --explicitBatch --shapes=image:2x3x224x224
From the trtexec source code, I found multiple profiles actually is not supported as of now, even in the latest version of TensorRT(8.4).
if (nOptProfiles > 1)
{
sample::gLogWarning << “Multiple profiles are currently not supported. Running with one profile.” << std::endl;
}
Isn’t it correct to say: dynamic shape model serialied engine has fixed batch_size, not a range, I cannot use this engine for less batch_size inference but the same batch_size?
Yes. Please refer following similar issue. Currently, --streams
with dynamic shapes not supported in TRT.
opened 07:44PM - 02 Jul 21 UTC
closed 01:11AM - 01 Mar 22 UTC
Samples
triaged
## Description
I'm trying to run benchmarking using TensorRT 8.0.1 using `trtex… ec`, and I receive the following error when setting more than one stream.
Command:
`trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2`
Error:
`Error[3]: [executionContext.cpp::setBindingDimensions::949] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::949, condition: mOptimizationProfile >= 0 && mOptimizationProfile < mEngine.getNbOptimizationProfiles()`
I can email the model files if needed.
## Environment
**TensorRT Version**: 8.0.1-1+cuda11.3
**NVIDIA GPU**: NVIDIA T4
**NVIDIA Driver Version**: 450.80.02
**CUDA Version**: 11.3
**CUDNN Version**:
**Operating System**: Ubuntu 20.04
**Python Version (if applicable)**:
**Tensorflow Version (if applicable)**:
**PyTorch Version (if applicable)**:
**Baremetal or Container (if so, version)**: nvcr.io/nvidia/tensorrt:21.06-py3
## Steps To Reproduce
The pipeline involves converting from ONNX->TRT and the benchmarking the engine file.
**Step 1: Run Docker**
`docker run --rm -it nvcr.io/nvidia/tensorrt:21.06-py3`
**Step 2: Upgrade to TensorRT 8.0.1**
A. Download from https://developer.nvidia.com/nvidia-tensorrt-8x-download
B. Run installation
```
dpkg -i nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.1.6-ga-20210626_1-1_amd64.deb
apt-get update
apt-get install tensorrt libcudnn8
```
**Step 3: Convert ONNX model to TensorRT in 8.0.1**
From ONNX->TRT:
```
trtexec --onnx=model.onnx --saveEngine=model-fp32.engine \
--workspace=4096 \
--minShapes=input_tensor:0:1x300x300x3 \
--maxShapes=input_tensor:0:32x300x300x3 \
--optShapes=input_tensor:0:8x300x300x3 \
--buildOnly
```
**Step 4: Run Benchmarking**
`trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2 --verbose`
Output:
```
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2 --verbose
[07/02/2021-15:05:16] [I] === Model Options ===
[07/02/2021-15:05:16] [I] Format: *
[07/02/2021-15:05:16] [I] Model:
[07/02/2021-15:05:16] [I] Output:
[07/02/2021-15:05:16] [I] === Build Options ===
[07/02/2021-15:05:16] [I] Max batch: explicit
[07/02/2021-15:05:16] [I] Workspace: 16 MiB
[07/02/2021-15:05:16] [I] minTiming: 1
[07/02/2021-15:05:16] [I] avgTiming: 8
[07/02/2021-15:05:16] [I] Precision: FP32
[07/02/2021-15:05:16] [I] Calibration:
[07/02/2021-15:05:16] [I] Refit: Disabled
[07/02/2021-15:05:16] [I] Sparsity: Disabled
[07/02/2021-15:05:16] [I] Safe mode: Disabled
[07/02/2021-15:05:16] [I] Restricted mode: Disabled
[07/02/2021-15:05:16] [I] Save engine:
[07/02/2021-15:05:16] [I] Load engine: model-fp32.engine
[07/02/2021-15:05:16] [I] NVTX verbosity: 0
[07/02/2021-15:05:16] [I] Tactic sources: Using default tactic sources
[07/02/2021-15:05:16] [I] timingCacheMode: local
[07/02/2021-15:05:16] [I] timingCacheFile:
[07/02/2021-15:05:16] [I] Input(s)s format: fp32:CHW
[07/02/2021-15:05:16] [I] Output(s)s format: fp32:CHW
[07/02/2021-15:05:16] [I] Input build shape: input_tensor:0=1x300x300x3+1x300x300x3+1x300x300x3
[07/02/2021-15:05:16] [I] Input calibration shapes: model
[07/02/2021-15:05:16] [I] === System Options ===
[07/02/2021-15:05:16] [I] Device: 0
[07/02/2021-15:05:16] [I] DLACore:
[07/02/2021-15:05:16] [I] Plugins:
[07/02/2021-15:05:16] [I] === Inference Options ===
[07/02/2021-15:05:16] [I] Batch: Explicit
[07/02/2021-15:05:16] [I] Input inference shape: input_tensor:0=1x300x300x3
[07/02/2021-15:05:16] [I] Iterations: 10
[07/02/2021-15:05:16] [I] Duration: 3s (+ 200ms warm up)
[07/02/2021-15:05:16] [I] Sleep time: 0ms
[07/02/2021-15:05:16] [I] Streams: 2
[07/02/2021-15:05:16] [I] ExposeDMA: Disabled
[07/02/2021-15:05:16] [I] Data transfers: Enabled
[07/02/2021-15:05:16] [I] Spin-wait: Disabled
[07/02/2021-15:05:16] [I] Multithreading: Disabled
[07/02/2021-15:05:16] [I] CUDA Graph: Disabled
[07/02/2021-15:05:16] [I] Separate profiling: Disabled
[07/02/2021-15:05:16] [I] Time Deserialize: Disabled
[07/02/2021-15:05:16] [I] Time Refit: Disabled
[07/02/2021-15:05:16] [I] Skip inference: Disabled
[07/02/2021-15:05:16] [I] Inputs:
[07/02/2021-15:05:16] [I] === Reporting Options ===
[07/02/2021-15:05:16] [I] Verbose: Enabled
[07/02/2021-15:05:16] [I] Averages: 10 inferences
[07/02/2021-15:05:16] [I] Percentile: 99
[07/02/2021-15:05:16] [I] Dump refittable layers:Disabled
[07/02/2021-15:05:16] [I] Dump output: Disabled
[07/02/2021-15:05:16] [I] Profile: Disabled
[07/02/2021-15:05:16] [I] Export timing to JSON file:
[07/02/2021-15:05:16] [I] Export output to JSON file:
[07/02/2021-15:05:16] [I] Export profile to JSON file:
[07/02/2021-15:05:16] [I]
[07/02/2021-15:05:16] [I] === Device Information ===
[07/02/2021-15:05:16] [I] Selected Device: Tesla T4
[07/02/2021-15:05:16] [I] Compute Capability: 7.5
[07/02/2021-15:05:16] [I] SMs: 40
[07/02/2021-15:05:16] [I] Compute Clock Rate: 1.59 GHz
[07/02/2021-15:05:16] [I] Device Global Memory: 15109 MiB
[07/02/2021-15:05:16] [I] Shared Memory per SM: 64 KiB
[07/02/2021-15:05:16] [I] Memory Bus Width: 256 bits (ECC enabled)
[07/02/2021-15:05:16] [I] Memory Clock Rate: 5.001 GHz
[07/02/2021-15:05:16] [I]
[07/02/2021-15:05:16] [I] TensorRT version: 8001
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Proposal version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Split version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[07/02/2021-15:05:17] [I] [TRT] [MemUsageChange] Init CUDA: CPU +328, GPU +0, now: CPU 355, GPU 250 (MiB)
[07/02/2021-15:05:17] [I] [TRT] Loaded engine size: 19 MB
[07/02/2021-15:05:17] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 355 MiB, GPU 250 MiB
[07/02/2021-15:05:18] [V] [TRT] Using cublasLt a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +482, GPU +206, now: CPU 838, GPU 476 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Using cuDNN as a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +394, GPU +172, now: CPU 1232, GPU 648 (MiB)
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1232, GPU 630 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Deserialization required 1204936 microseconds.
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1232 MiB, GPU 630 MiB
[07/02/2021-15:05:18] [I] Engine loaded in 1.74508 sec.
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1212 MiB, GPU 630 MiB
[07/02/2021-15:05:18] [V] [TRT] Using cublasLt a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +10, now: CPU 1213, GPU 640 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Using cuDNN as a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1213, GPU 648 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Total per-runner device memory is 16729600
[07/02/2021-15:05:18] [V] [TRT] Total per-runner host memory is 101424
[07/02/2021-15:05:18] [V] [TRT] Allocated activation device memory of size 445687808
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1219 MiB, GPU 1090 MiB
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1219 MiB, GPU 1090 MiB
[07/02/2021-15:05:18] [V] [TRT] Using cublasLt a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1219, GPU 1098 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Using cuDNN as a tactic source
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1219, GPU 1108 (MiB)
[07/02/2021-15:05:18] [V] [TRT] Total per-runner device memory is 16729600
[07/02/2021-15:05:18] [V] [TRT] Total per-runner host memory is 101424
[07/02/2021-15:05:18] [V] [TRT] Allocated activation device memory of size 445687808
[07/02/2021-15:05:18] [I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly.
[07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1219 MiB, GPU 1550 MiB
[07/02/2021-15:05:18] [E] Error[3]: [executionContext.cpp::setBindingDimensions::949] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::949, condition: mOptimizationProfile >= 0 && mOptimizationProfile < mEngine.getNbOptimizationProfiles()
)
[07/02/2021-15:05:18] [E] Inference set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8001] # trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2 --verbose
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1219, GPU 1518 (MiB)
[07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1219, GPU 1058 (MiB)
```
This solved my puzzle. Thanks!
1 Like
system
Closed
April 25, 2022, 5:25am
8
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.