[executionContext.cpp::executeInternal::652] Error Code 1: Cuda Runtime (an illegal memory access was encountered) | Cuda failure: 700

microlj · April 6, 2022, 1:16pm

Description

Inference core dumped with multiple execution contexts parallel.
Model type is onnx dynamic shape. I created same count profiles with execution contexts, and for each execution context, called context->setOptimizationProfile(i) before inference. From the log out, you can see the binding index for each profile and context is correct, but I never made the inference success.
Log shows illegal memory access was encountered. Then, I checked the device buffer and host buffer, both of them were allocated correct memory size. I cannot find any clue as of now. Please help!
Thanks a lot!

The attachment is full code and model for reproducing this issue.

Environment

TensorRT Version: 8.0.1.6, C++ API
GPU Type: Tesla P4
Nvidia Driver Version: 440.33.01
CUDA Version: 10.2
CUDNN Version: 8.2
Operating System + Version: Ubuntu 16.04
Python Version (if applicable): NA
TensorFlow Version (if applicable): NA
PyTorch Version (if applicable): NA
Baremetal or Container (if container which image + tag): NA

Relevant Files

The attachment has full code, model, CMakeLists. Just modify the TensorRT path in CMakeLists and the building should work.
trt-conc.zip (15.0 MB)

Steps To Reproduce

compile and produced bin: concurrency_test

mkdir build && cd build
cmake ..
make -j4

Then run:
./concurrency_test ../mobilenetv1/params image softmax_0.tmp_0 1 1 2

NVES · April 6, 2022, 1:37pm

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.

In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

microlj · April 7, 2022, 1:12pm

Tried trtexec, I found some clues.
For dynamic shape model with multiple execution contexts, the --minShapes, --optShapes, --maxShapes should have same batch_size (explicit batch).
for example:

this is working well

./trtexec --onnx=./mobilenetv1/params --minShapes=image:2x3x224x224 --optShapes=image:2x3x224x224 --maxShapes=image:2x3x224x224 --streams=2 --explicitBatch --shapes=image:2x3x224x224

this is not

./trtexec --onnx=./mobilenetv1/params --minShapes=image:1x3x224x224 --optShapes=image:2x3x224x224 --maxShapes=image:4x3x224x224 --streams=2 --explicitBatch --shapes=image:2x3x224x224

with error msg:

[04/07/2022-12:39:36] [E] Error[3]: [executionContext.cpp::setBindingDimensions::949] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::949, condition: mOptimizationProfile >= 0 && mOptimizationProfile < mEngine.getNbOptimizationProfiles()
)
[04/07/2022-12:39:36] [E] Inference set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8001] # ./trtexec --onnx=./mobilenetv1/params --minShapes=image:1x3x224x224 --optShapes=image:2x3x224x224 --maxShapes=image:4x3x224x224 --streams=2 --explicitBatch --shapes=image:2x3x224x224

From the trtexec source code, I found multiple profiles actually is not supported as of now, even in the latest version of TensorRT(8.4).

if (nOptProfiles > 1)
{
sample::gLogWarning << “Multiple profiles are currently not supported. Running with one profile.” << std::endl;
}

Isn’t it correct to say: dynamic shape model serialied engine has fixed batch_size, not a range, I cannot use this engine for less batch_size inference but the same batch_size?

spolisetty · April 8, 2022, 10:04am

Yes. Please refer following similar issue. Currently, --streams with dynamic shapes not supported in TRT.

github.com/NVIDIA/TensorRT

trtexec failure with multiple streams (TensorRT 8.0.1): "mOptimizationProfile >= 0 && mOptimizationProfile < mEngine.getNbOptimizationProfiles()"

opened 07:44PM - 02 Jul 21 UTC

closed 01:11AM - 01 Mar 22 UTC

gabrielibagon

Samples triaged

## Description I'm trying to run benchmarking using TensorRT 8.0.1 using `trtex…ec`, and I receive the following error when setting more than one stream. Command: `trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2` Error: `Error[3]: [executionContext.cpp::setBindingDimensions::949] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::949, condition: mOptimizationProfile >= 0 && mOptimizationProfile < mEngine.getNbOptimizationProfiles()` I can email the model files if needed. ## Environment **TensorRT Version**: 8.0.1-1+cuda11.3 **NVIDIA GPU**: NVIDIA T4 **NVIDIA Driver Version**: 450.80.02 **CUDA Version**: 11.3 **CUDNN Version**: **Operating System**: Ubuntu 20.04 **Python Version (if applicable)**: **Tensorflow Version (if applicable)**: **PyTorch Version (if applicable)**: **Baremetal or Container (if so, version)**: nvcr.io/nvidia/tensorrt:21.06-py3 ## Steps To Reproduce The pipeline involves converting from ONNX->TRT and the benchmarking the engine file. **Step 1: Run Docker** `docker run --rm -it nvcr.io/nvidia/tensorrt:21.06-py3` **Step 2: Upgrade to TensorRT 8.0.1** A. Download from https://developer.nvidia.com/nvidia-tensorrt-8x-download B. Run installation ``` dpkg -i nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.1.6-ga-20210626_1-1_amd64.deb apt-get update apt-get install tensorrt libcudnn8 ``` **Step 3: Convert ONNX model to TensorRT in 8.0.1** From ONNX->TRT: ``` trtexec --onnx=model.onnx --saveEngine=model-fp32.engine \ --workspace=4096 \ --minShapes=input_tensor:0:1x300x300x3 \ --maxShapes=input_tensor:0:32x300x300x3 \ --optShapes=input_tensor:0:8x300x300x3 \ --buildOnly ``` **Step 4: Run Benchmarking** `trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2 --verbose` Output: ``` &&&& RUNNING TensorRT.trtexec [TensorRT v8001] # trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2 --verbose [07/02/2021-15:05:16] [I] === Model Options === [07/02/2021-15:05:16] [I] Format: * [07/02/2021-15:05:16] [I] Model: [07/02/2021-15:05:16] [I] Output: [07/02/2021-15:05:16] [I] === Build Options === [07/02/2021-15:05:16] [I] Max batch: explicit [07/02/2021-15:05:16] [I] Workspace: 16 MiB [07/02/2021-15:05:16] [I] minTiming: 1 [07/02/2021-15:05:16] [I] avgTiming: 8 [07/02/2021-15:05:16] [I] Precision: FP32 [07/02/2021-15:05:16] [I] Calibration: [07/02/2021-15:05:16] [I] Refit: Disabled [07/02/2021-15:05:16] [I] Sparsity: Disabled [07/02/2021-15:05:16] [I] Safe mode: Disabled [07/02/2021-15:05:16] [I] Restricted mode: Disabled [07/02/2021-15:05:16] [I] Save engine: [07/02/2021-15:05:16] [I] Load engine: model-fp32.engine [07/02/2021-15:05:16] [I] NVTX verbosity: 0 [07/02/2021-15:05:16] [I] Tactic sources: Using default tactic sources [07/02/2021-15:05:16] [I] timingCacheMode: local [07/02/2021-15:05:16] [I] timingCacheFile: [07/02/2021-15:05:16] [I] Input(s)s format: fp32:CHW [07/02/2021-15:05:16] [I] Output(s)s format: fp32:CHW [07/02/2021-15:05:16] [I] Input build shape: input_tensor:0=1x300x300x3+1x300x300x3+1x300x300x3 [07/02/2021-15:05:16] [I] Input calibration shapes: model [07/02/2021-15:05:16] [I] === System Options === [07/02/2021-15:05:16] [I] Device: 0 [07/02/2021-15:05:16] [I] DLACore: [07/02/2021-15:05:16] [I] Plugins: [07/02/2021-15:05:16] [I] === Inference Options === [07/02/2021-15:05:16] [I] Batch: Explicit [07/02/2021-15:05:16] [I] Input inference shape: input_tensor:0=1x300x300x3 [07/02/2021-15:05:16] [I] Iterations: 10 [07/02/2021-15:05:16] [I] Duration: 3s (+ 200ms warm up) [07/02/2021-15:05:16] [I] Sleep time: 0ms [07/02/2021-15:05:16] [I] Streams: 2 [07/02/2021-15:05:16] [I] ExposeDMA: Disabled [07/02/2021-15:05:16] [I] Data transfers: Enabled [07/02/2021-15:05:16] [I] Spin-wait: Disabled [07/02/2021-15:05:16] [I] Multithreading: Disabled [07/02/2021-15:05:16] [I] CUDA Graph: Disabled [07/02/2021-15:05:16] [I] Separate profiling: Disabled [07/02/2021-15:05:16] [I] Time Deserialize: Disabled [07/02/2021-15:05:16] [I] Time Refit: Disabled [07/02/2021-15:05:16] [I] Skip inference: Disabled [07/02/2021-15:05:16] [I] Inputs: [07/02/2021-15:05:16] [I] === Reporting Options === [07/02/2021-15:05:16] [I] Verbose: Enabled [07/02/2021-15:05:16] [I] Averages: 10 inferences [07/02/2021-15:05:16] [I] Percentile: 99 [07/02/2021-15:05:16] [I] Dump refittable layers:Disabled [07/02/2021-15:05:16] [I] Dump output: Disabled [07/02/2021-15:05:16] [I] Profile: Disabled [07/02/2021-15:05:16] [I] Export timing to JSON file: [07/02/2021-15:05:16] [I] Export output to JSON file: [07/02/2021-15:05:16] [I] Export profile to JSON file: [07/02/2021-15:05:16] [I] [07/02/2021-15:05:16] [I] === Device Information === [07/02/2021-15:05:16] [I] Selected Device: Tesla T4 [07/02/2021-15:05:16] [I] Compute Capability: 7.5 [07/02/2021-15:05:16] [I] SMs: 40 [07/02/2021-15:05:16] [I] Compute Clock Rate: 1.59 GHz [07/02/2021-15:05:16] [I] Device Global Memory: 15109 MiB [07/02/2021-15:05:16] [I] Shared Memory per SM: 64 KiB [07/02/2021-15:05:16] [I] Memory Bus Width: 256 bits (ECC enabled) [07/02/2021-15:05:16] [I] Memory Clock Rate: 5.001 GHz [07/02/2021-15:05:16] [I] [07/02/2021-15:05:16] [I] TensorRT version: 8001 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Region_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::ScatterND version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::CropAndResize version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Proposal version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::Split version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1 [07/02/2021-15:05:16] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1 [07/02/2021-15:05:17] [I] [TRT] [MemUsageChange] Init CUDA: CPU +328, GPU +0, now: CPU 355, GPU 250 (MiB) [07/02/2021-15:05:17] [I] [TRT] Loaded engine size: 19 MB [07/02/2021-15:05:17] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 355 MiB, GPU 250 MiB [07/02/2021-15:05:18] [V] [TRT] Using cublasLt a tactic source [07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +482, GPU +206, now: CPU 838, GPU 476 (MiB) [07/02/2021-15:05:18] [V] [TRT] Using cuDNN as a tactic source [07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +394, GPU +172, now: CPU 1232, GPU 648 (MiB) [07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1232, GPU 630 (MiB) [07/02/2021-15:05:18] [V] [TRT] Deserialization required 1204936 microseconds. [07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1232 MiB, GPU 630 MiB [07/02/2021-15:05:18] [I] Engine loaded in 1.74508 sec. [07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1212 MiB, GPU 630 MiB [07/02/2021-15:05:18] [V] [TRT] Using cublasLt a tactic source [07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +10, now: CPU 1213, GPU 640 (MiB) [07/02/2021-15:05:18] [V] [TRT] Using cuDNN as a tactic source [07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1213, GPU 648 (MiB) [07/02/2021-15:05:18] [V] [TRT] Total per-runner device memory is 16729600 [07/02/2021-15:05:18] [V] [TRT] Total per-runner host memory is 101424 [07/02/2021-15:05:18] [V] [TRT] Allocated activation device memory of size 445687808 [07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1219 MiB, GPU 1090 MiB [07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1219 MiB, GPU 1090 MiB [07/02/2021-15:05:18] [V] [TRT] Using cublasLt a tactic source [07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1219, GPU 1098 (MiB) [07/02/2021-15:05:18] [V] [TRT] Using cuDNN as a tactic source [07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1219, GPU 1108 (MiB) [07/02/2021-15:05:18] [V] [TRT] Total per-runner device memory is 16729600 [07/02/2021-15:05:18] [V] [TRT] Total per-runner host memory is 101424 [07/02/2021-15:05:18] [V] [TRT] Allocated activation device memory of size 445687808 [07/02/2021-15:05:18] [I] [TRT] Could not set default profile 0 for execution context. Profile index must be set explicitly. [07/02/2021-15:05:18] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 1219 MiB, GPU 1550 MiB [07/02/2021-15:05:18] [E] Error[3]: [executionContext.cpp::setBindingDimensions::949] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::949, condition: mOptimizationProfile >= 0 && mOptimizationProfile < mEngine.getNbOptimizationProfiles() ) [07/02/2021-15:05:18] [E] Inference set up failed &&&& FAILED TensorRT.trtexec [TensorRT v8001] # trtexec --loadEngine=model-fp32.engine --shapes=input_tensor:0:1x300x300x3 --streams=2 --verbose [07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1219, GPU 1518 (MiB) [07/02/2021-15:05:18] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1219, GPU 1058 (MiB) ```

microlj · April 11, 2022, 5:24am

This solved my puzzle. Thanks!

system · April 25, 2022, 5:25am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Optimization profile not set after creating context {mOptimizationProfile >= 0 && mOptimizationProfile < mEngine.getNbOptimizationProfiles()} TensorRT	3	1042	January 18, 2023
TensorRT С++ optimization profile TensorRT tensorrt , opencv , cuda	29	3390	September 9, 2021
Tensorrt inference with batch > 1 TensorRT	4	1477	October 13, 2022
Dynamic engine building and optimization profile TensorRT	4	2052	July 18, 2023
TensorRT 6: dynamic shapes in thread TensorRT	1	1826	November 27, 2019
Error run 2 context parallel in TensorRT7 TensorRT	13	2755	July 5, 2021
Cuda Error in executeInternal: 700 (an illegal memory access was encountered) Jetson AGX Xavier tensorrt	10	6348	December 2, 2021
Profiling fails with [E] Error[1]: [executionContext.cpp::syncShapeBindingsToDevice::1990] Error Code 1: Cuda Runtime (context is destroyed) TensorRT	2	515	August 28, 2023
TensorRT dynamic shape err: [slot.h::decode::151] Error Code 2: Internal Error (Assertion index < nbSlots failed.invalid encoded reference to a slot) TensorRT tensorrt	18	1813	October 4, 2021
Multiple calls of enqueueV2 TensorRT	15	2339	September 19, 2021