Choosing --useDLACore=1 option dumping the core

Hi all, I tried configuring the --useDLACore=1 option but it crashed. Not only this but when I mentioned --useDLACore=0 then also it crashed.

When this option was not all used then it worked. Below is the execution output logs when it crashed

./trtexec --onnx=…/data/resnet50/ResNet50.onnx --int8 --useDLACore=1 --loadInputs=~/program/nagaraj/tensor_rt_practice/pytorch_to_trt/input_tensor.dat
[11/24/2023-04:06:54] [I] === Model Options ===
[11/24/2023-04:06:54] [I] Format: ONNX
[11/24/2023-04:06:54] [I] Model: …/data/resnet50/ResNet50.onnx
[11/24/2023-04:06:54] [I] Output:
[11/24/2023-04:06:54] [I] === Build Options ===
[11/24/2023-04:06:54] [I] Max batch: explicit
[11/24/2023-04:06:54] [I] Workspace: 16 MiB
[11/24/2023-04:06:54] [I] minTiming: 1
[11/24/2023-04:06:54] [I] avgTiming: 8
[11/24/2023-04:06:54] [I] Precision: FP32+INT8
[11/24/2023-04:06:54] [I] Calibration: Dynamic
[11/24/2023-04:06:54] [I] Refit: Disabled
[11/24/2023-04:06:54] [I] Sparsity: Disabled
[11/24/2023-04:06:54] [I] Safe mode: Disabled
[11/24/2023-04:06:54] [I] Restricted mode: Disabled
[11/24/2023-04:06:54] [I] Save engine:
[11/24/2023-04:06:54] [I] Load engine:
[11/24/2023-04:06:54] [I] NVTX verbosity: 0
[11/24/2023-04:06:54] [I] Tactic sources: Using default tactic sources
[11/24/2023-04:06:54] [I] timingCacheMode: local
[11/24/2023-04:06:54] [I] timingCacheFile:
[11/24/2023-04:06:54] [I] Input(s)s format: fp32:CHW
[11/24/2023-04:06:54] [I] Output(s)s format: fp32:CHW
[11/24/2023-04:06:54] [I] Input build shapes: model
[11/24/2023-04:06:54] [I] Input calibration shapes: model
[11/24/2023-04:06:54] [I] === System Options ===
[11/24/2023-04:06:54] [I] Device: 0
[11/24/2023-04:06:54] [I] DLACore: 1
[11/24/2023-04:06:54] [I] Plugins:
[11/24/2023-04:06:54] [I] === Inference Options ===
[11/24/2023-04:06:54] [I] Batch: Explicit
[11/24/2023-04:06:54] [I] Input inference shapes: model
[11/24/2023-04:06:54] [I] Iterations: 10
[11/24/2023-04:06:54] [I] Duration: 3s (+ 200ms warm up)
[11/24/2023-04:06:54] [I] Sleep time: 0ms
[11/24/2023-04:06:54] [I] Streams: 1
[11/24/2023-04:06:54] [I] ExposeDMA: Disabled
[11/24/2023-04:06:54] [I] Data transfers: Enabled
[11/24/2023-04:06:54] [I] Spin-wait: Disabled
[11/24/2023-04:06:54] [I] Multithreading: Disabled
[11/24/2023-04:06:54] [I] CUDA Graph: Disabled
[11/24/2023-04:06:54] [I] Separate profiling: Disabled
[11/24/2023-04:06:54] [I] Time Deserialize: Disabled
[11/24/2023-04:06:54] [I] Time Refit: Disabled
[11/24/2023-04:06:54] [I] Skip inference: Disabled
[11/24/2023-04:06:54] [I] Inputs:
[11/24/2023-04:06:54] [I] ~/program/nagaraj/tensor_rt_practice/pytorch_to_trt/input_tensor.dat<-~/program/nagaraj/tensor_rt_practice/pytorch_to_trt/input_tensor.dat
[11/24/2023-04:06:54] [I] === Reporting Options ===
[11/24/2023-04:06:54] [I] Verbose: Disabled
[11/24/2023-04:06:54] [I] Averages: 10 inferences
[11/24/2023-04:06:54] [I] Percentile: 99
[11/24/2023-04:06:54] [I] Dump refittable layers:Disabled
[11/24/2023-04:06:54] [I] Dump output: Disabled
[11/24/2023-04:06:54] [I] Profile: Disabled
[11/24/2023-04:06:54] [I] Export timing to JSON file:
[11/24/2023-04:06:54] [I] Export output to JSON file:
[11/24/2023-04:06:54] [I] Export profile to JSON file:
[11/24/2023-04:06:54] [I]
[11/24/2023-04:06:54] [I] === Device Information ===
[11/24/2023-04:06:54] [I] Selected Device: Xavier
[11/24/2023-04:06:54] [I] Compute Capability: 7.2
[11/24/2023-04:06:54] [I] SMs: 6
[11/24/2023-04:06:54] [I] Compute Clock Rate: 1.109 GHz
[11/24/2023-04:06:54] [I] Device Global Memory: 7773 MiB
[11/24/2023-04:06:54] [I] Shared Memory per SM: 96 KiB
[11/24/2023-04:06:54] [I] Memory Bus Width: 256 bits (ECC disabled)
[11/24/2023-04:06:54] [I] Memory Clock Rate: 1.109 GHz
[11/24/2023-04:06:54] [I]
[11/24/2023-04:06:54] [I] TensorRT version: 8001
[11/24/2023-04:06:55] [I] [TRT] [MemUsageChange] Init CUDA: CPU +353, GPU +0, now: CPU 371, GPU 4527 (MiB)
[11/24/2023-04:06:55] [I] Start parsing network model
[11/24/2023-04:06:55] [I] [TRT] ----------------------------------------------------------------
[11/24/2023-04:06:55] [I] [TRT] Input filename: …/data/resnet50/ResNet50.onnx
[11/24/2023-04:06:55] [I] [TRT] ONNX IR version: 0.0.3
[11/24/2023-04:06:55] [I] [TRT] Opset version: 9
[11/24/2023-04:06:55] [I] [TRT] Producer name: onnx-caffe2
[11/24/2023-04:06:55] [I] [TRT] Producer version:
[11/24/2023-04:06:55] [I] [TRT] Domain:
[11/24/2023-04:06:55] [I] [TRT] Model version: 0
[11/24/2023-04:06:55] [I] [TRT] Doc string:
[11/24/2023-04:06:55] [I] [TRT] ----------------------------------------------------------------
[11/24/2023-04:06:55] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[11/24/2023-04:06:55] [I] Finish parsing network model
[11/24/2023-04:06:55] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 471, GPU 4725 (MiB)
[11/24/2023-04:06:55] [I] FP32 and INT8 precisions have been specified - more performance might be enabled by additionally specifying --fp16 or --best
[11/24/2023-04:06:55] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 471 MiB, GPU 4725 MiB
[11/24/2023-04:06:55] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[11/24/2023-04:06:57] [E] Error[9]: [standardEngineBuilder.cpp::isValidDLAConfig::2189] Error Code 9: Internal Error (Default DLA is enabled but layer (Unnamed Layer* 176) [Shuffle] + (Unnamed Layer* 177) [Shuffle] is not supported on DLA and falling back to GPU is not enabled.)
[11/24/2023-04:06:57] [E] Error[2]: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.)
Segmentation fault (core dumped)

Thanks and Regards

Nagaraj Trivedi

Hi,

Default DLA is enabled but layer ... is not supported on DLA and falling back to GPU is not enabled.

Based on the log message, the model cannot fully run on DLA.
Please enable the GPU fallback (--allowGPUFallback) to allow TensorRT to place the non-supported layers back to the GPU.

You can find the DLA support matrix below:

Thanks.

Hi, thank you for providing this information. It worked. But I found one more issue.
With --enableDLACore the qps(queries per second is very less compared to inferencing without --enableDLACore)
For your reference I have attached two log files, one with option --enableDLACore and the other without --enableDLACore

  1. resnet50_withoutdla.txt (inference with --enableDLACore)
  2. resnet50_without_dla.txt (inference without --enableDLACore)

Also let me know from your experience answering/providing resolution to many such queries on using the option --enableDLACore what type of significant changes we can see during inference.

Please provide the information I have asked above.

Thanks and Regards

Nagaraj Trivedi
resnet50_without_dla.txt (20.2 KB)
resnet50_withdla.txt (12.9 KB)

Hi,

If the inference switches on DLA and GPU frequently, then the data transfer overhead might slow down the task.
For example, dla->GPU->dla->GPU-> …

Thanks.

Hi, thank you for providing me the response. It will be helpful for me if you clarify why the --enableDLACore option should be used. What benefit we get executing on this compared to GPU. Does it increase inference speed? or accuracy? or both?
Please clarify.

Thanks and Regards

Nagaraj Trivedi

Hi,

You can find more info in our document:

Q: Why does my network run slower when using DLA compared to without DLA?

A: DLA was designed to maximize energy efficiency. Depending on the features supported by DLA and the features supported by the GPU, either implementation can be more performant. Which implementation to use depends on your latency or throughput requirements and your power budget. Since all DLA engines are independent of the GPU and each other, you could also use both implementations at the same time to further increase the throughput of your network.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.