DLA execution fails with out of memory error

I am trying to execute UNet model with trtexec. The execution fails for few batch sizes 4,8,16,32. It fails throwing out of memory error. What is the memory available for DLA execution. How can we estimate it? Also I observe that this behaviour is inconsistent. What is the reason for it’s inconsistency?

/usr/src/tensorrt/bin/trtexec --avgRuns=100 --deploy=/home/nvidia/ThinCi_models/prototxt/UNet_512x512.prototxt --fp16 --batch=16 --output=loss --iterations=1000 --useSpinWait --useDLACore=1 --allowGPUFallback --workspace=2048
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --avgRuns=100 --deploy=/home/nvidia/ThinCi_models/prototxt/UNet_512x512.prototxt --fp16 --batch=16 --output=loss --iterations=1000 --useSpinWait --useDLACore=1 --allowGPUFallback --workspace=2048
[00/17/2020-15:16:06] [I] === Model Options ===
[00/17/2020-15:16:06] [I] Format: Caffe
[00/17/2020-15:16:06] [I] Model: 
[00/17/2020-15:16:06] [I] Prototxt: /home/nvidia/ThinCi_models/prototxt/UNet_512x512.prototxtOutput: loss
[00/17/2020-15:16:06] [I] === Build Options ===
[00/17/2020-15:16:06] [I] Max batch: 16
[00/17/2020-15:16:06] [I] Workspace: 2048 MB
[00/17/2020-15:16:06] [I] minTiming: 1
[00/17/2020-15:16:06] [I] avgTiming: 8
[00/17/2020-15:16:06] [I] Precision: FP16
[00/17/2020-15:16:06] [I] Calibration: 
[00/17/2020-15:16:06] [I] Safe mode: Disabled
[00/17/2020-15:16:06] [I] Save engine: 
[00/17/2020-15:16:06] [I] Load engine: 
[00/17/2020-15:16:06] [I] Inputs format: fp32:CHW
[00/17/2020-15:16:06] [I] Outputs format: fp32:CHW
[00/17/2020-15:16:06] [I] Input build shapes: model
[00/17/2020-15:16:06] [I] === System Options ===
[00/17/2020-15:16:06] [I] Device: 0
[00/17/2020-15:16:06] [I] DLACore: 1(With GPU fallback)
[00/17/2020-15:16:06] [I] Plugins:
[00/17/2020-15:16:06] [I] === Inference Options ===
[00/17/2020-15:16:06] [I] Batch: 16
[00/17/2020-15:16:06] [I] Iterations: 1000 (200 ms warm up)
[00/17/2020-15:16:06] [I] Duration: 10s
[00/17/2020-15:16:06] [I] Sleep time: 0ms
[00/17/2020-15:16:06] [I] Streams: 1
[00/17/2020-15:16:06] [I] Spin-wait: Enabled
[00/17/2020-15:16:06] [I] Multithreading: Enabled
[00/17/2020-15:16:06] [I] CUDA Graph: Disabled
[00/17/2020-15:16:06] [I] Skip inference: Disabled
[00/17/2020-15:16:06] [I] Input inference shapes: model
[00/17/2020-15:16:06] [I] === Reporting Options ===
[00/17/2020-15:16:06] [I] Verbose: Disabled
[00/17/2020-15:16:06] [I] Averages: 100 inferences
[00/17/2020-15:16:06] [I] Percentile: 99
[00/17/2020-15:16:06] [I] Dump output: Disabled
[00/17/2020-15:16:06] [I] Profile: Disabled
[00/17/2020-15:16:06] [I] Export timing to JSON file: 
[00/17/2020-15:16:06] [I] Export profile to JSON file: 
[00/17/2020-15:16:06] [I] 
[00/17/2020-15:16:09] [W] [TRT] Default DLA is enabled but layer crop_d3c-d3cc is not supported on DLA, falling back to GPU.
[00/17/2020-15:16:09] [W] [TRT] Default DLA is enabled but layer crop_d2c-d2cc is not supported on DLA, falling back to GPU.
[00/17/2020-15:16:09] [W] [TRT] Default DLA is enabled but layer crop_d1c-d1cc is not supported on DLA, falling back to GPU.
[00/17/2020-15:16:09] [W] [TRT] Default DLA is enabled but layer crop_d0c-d0cc is not supported on DLA, falling back to GPU.
[00/17/2020-15:16:09] [W] [TRT] Default DLA is enabled but layer loss is not supported on DLA, falling back to GPU.
[00/17/2020-15:16:10] [W] [TRT] Internal DLA error for layer conv_d4b-c. Switching to GPU fallback.
[00/17/2020-15:16:10] [W] [TRT] Internal DLA error for layer conv_d4b-c. Switching to GPU fallback.
[00/17/2020-15:16:10] [W] [TRT] Internal DLA error for layer conv_u3b-c. Switching to GPU fallback.
[00/17/2020-15:16:12] [W] [TRT] Internal DLA error for layer conv_u0b-c. Switching to GPU fallback.
[00/17/2020-15:16:12] [I] [TRT] 
[00/17/2020-15:16:12] [I] [TRT] --------------- Layers running on DLA: 
[00/17/2020-15:16:12] [I] [TRT] {conv_d0a-b,relu_d0b,conv_d0b-c,relu_d0c,pool_d0c-1a,conv_d1a-b,relu_d1b,conv_d1b-c,relu_d1c,pool_d1c-2a,conv_d2a-b,relu_d2b,conv_d2b-c,relu_d2c,pool_d2c-3a,conv_d3a-b,relu_d3b,conv_d3b-c,relu_d3c,pool_d3c-4a,conv_d4a-b,relu_d4b}, {relu_d4c,upconv_d4c_u3a,relu_u3a}, {relu_u3c,conv_u3c-d,relu_u3d,upconv_u3d_u2a,relu_u2a}, {conv_u2b-c,relu_u2c,conv_u2c-d,relu_u2d,upconv_u2d_u1a,relu_u1a}, {conv_u1b-c,relu_u1c,conv_u1c-d,relu_u1d,upconv_u1d_u0a,relu_u0a}, {relu_u0c,conv_u0c-d,relu_u0d,conv_u0d-score}, 
[00/17/2020-15:16:12] [I] [TRT] --------------- Layers running on GPU: 
[00/17/2020-15:16:12] [I] [TRT] conv_d4b-c, crop_d3c-d3cc, crop_d2c-d2cc, crop_d1c-d1cc, crop_d0c-d0cc, u3a copy, d3cc copy, conv_u3b-c, u2a copy, d2cc copy, u1a copy, d1cc copy, u0a copy, d0cc copy, conv_u0b-c, loss, 
[00/17/2020-15:16:48] [W] [TRT] DLA Node compilation Failed.
[00/17/2020-15:16:48] [E] [TRT] Internal error: could not find any implementation for node {conv_d0a-b,relu_d0b,conv_d0b-c,relu_d0c,pool_d0c-1a,conv_d1a-b,relu_d1b,conv_d1b-c,relu_d1c,pool_d1c-2a,conv_d2a-b,relu_d2b,conv_d2b-c,relu_d2c,pool_d2c-3a,conv_d3a-b,relu_d3b,conv_d3b-c,relu_d3c,pool_d3c-4a,conv_d4a-b,relu_d4b}, try increasing the workspace size with IBuilder::setMaxWorkspaceSize()
[00/17/2020-15:16:48] [E] [TRT] ../builder/tacticOptimizer.cpp (1461) - OutOfMemory Error in computeCosts: 0
[00/17/2020-15:16:48] [E] Engine could not be created
&&&& FAILED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --avgRuns=100 --deploy=/home/nvidia/ThinCi_models/prototxt/UNet_512x512.prototxt --fp16 --batch=16 --output=loss --iterations=1000 --useSpinWait --useDLACore=1 --allowGPUFallback --workspace=2048

Hi,
Could you share UNet_512x512.prototxt to us so that we can reproduce the error? Please also share your release version( $ head -1 /etc/nv_tegra_release ).

PFA the UNet file used.
https://drive.google.com/file/d/15fvkHNAawEMeh24WlRolgZ9jRzFWpEZy/view?usp=sharing

The JetPack Release I’m using is 4.3 Production version.

R32 (release), REVISION: 2.2, GCID: 16669743, BOARD: t186ref, EABI: aarch64, DATE: Sat Sep 7 00:19:15 UTC 2019

Any update on this?
How much memory is available for DLA?
This can help estimate which models and batch sizes can run on the board.

Hi,

Please upgrade to r32.2.3 and give it a try. r32.2.2 is not listed and may not be stable.