Performance without DLA

HuiW · July 29, 2020, 11:07am

Hi ,

Is there a way to disable DLA1 and DLA2 as running jetson_benchmarks?

Due to our customer likes to know NX performance without DLA1 and DLA2 enabled.

Thank you,

AastaLLL · July 30, 2020, 3:29am

Hi,

The simplest way is to update the batch size in nx-benchmarks.csv for DLA into zero.

github.com

NVIDIA-AI-IOT/jetson_benchmarks/blob/master/benchmark_csv/nx-benchmarks.csv

ModelName,FrameWork,Devices,BatchSizeGPU,BatchSizeDLA,WS_GPU,WS_DLA,input,output,URL
inception_v4,caffe,3,2,1,2048,1024,NA,prob,https://www.dropbox.com/s/b7masj8xdoycv2w/inception_v4.prototxt
vgg19_N2,caffe,1,1,0,2048,0,NA,prob,https://www.dropbox.com/s/t4qq079g5q4jibx/vgg19_N2.prototxt
super_resolution_bsd500,onnx,1,2,0,2048,0,NA,NA,https://www.dropbox.com/s/hdhxndo23cm9i5y/super_resolution_bsd500.zip
unet-segmentation,tensorrt,1,2,0,2048,None,"input_1,1,512,512",conv2d_19/Sigmoid,https://www.dropbox.com/s/85lttamnbjeig0e/unet-segmentation.uff
pose_estimation,caffe,1,2,0,2048,None,NA,Mconv7_stage2_L2,https://www.dropbox.com/s/hwa5i14v67u57ij/pose_estimation.prototxt
yolov3-tiny-416,onnx,1,8,0,2048,0,NA,NA,https://www.dropbox.com/s/ck9e40b57rd5o14/yolov3-tiny-416.zip
ResNet50_224x224,caffe,3,4,2,2048,1024,NA,prob,https://www.dropbox.com/s/9ohk387v0ki56wx/ResNet50_224x224.prototxt
ssd-mobilenet-v1,onnx,3,8,2,2048,1024,NA,NA,https://www.dropbox.com/s/gx5zayt76vszhpo/ssd-mobilenet-v1.zip

Thanks.

HuiW · August 4, 2020, 9:42am

Hi AastaLLL,

Thank you for your support.

May I double check again the batch size you mention is the column of BatchSizeDLA as below?

From the column, only inception_v4, ResNet50_224x224 and ssd-mobilenet-v1 three model enable DLA .
So I only can set these three models to disable DLA?
Please correct me, if I mixed up.

Thank you,

HuiW · August 10, 2020, 3:05am

Hi AastaLLL,

We still have NX performance issue as the following.
Thank you for any advice.

From the column, only inception_v4, ResNet50_224x224 and ssd-mobilenet-v1 three model enable DLA .

So I only can set these three models to disable DLA?

After I disabled the three models DLA and only got inception_v4, ResNet50_224x224 worked, and the performance as below are proper or not?
Could I run the benchmark with DLA only? ( I guess only the three models. Or I can get the three models results? Or assume the DLA performance is total performance – DLA0 perforance?)
what is NX GPU and DLA TOPS? ( NX TOPS is 21)

Here is my test on NX. ( NX emmc module + Nano B01 evb board)

Model Name FPS NX EMMC
(Nano_b01)
FPS --H NX EMMC
(Nano_b01)
DLA 0
inception_v4 311.73 240.93 93.3
vgg19_N2 66.43 58.05 57.64
super_resolution_bsd500 150.46 112.23 112.23
unet-segmentation 145.42 101.197 100.76
pose_estimation 237.1 171.26 174.26
yolov3-tiny-416 546.69 413.74 414.74
ResNet50_224x224 824.02 609.29 245.15
ssd-mobilenet-v1 887.6 625.2 no mobilenet-v1-bs0.onnx

Thank you,

AastaLLL · August 10, 2020, 8:46am

Hi,

Sorry that there are some missing information in my previous comment.

To turn off the batch size, the fps won’t take the processor (GPU or DLA) into account.
But the processor is still running back-end which might have some impact on the bench-marking result.

1.
For a better usage, please set Devices into 1 in nx-benchmarks.csv.

3 if GPU+2DLA, 1 if GPU Only

3.
The requires some update to the source.
You can add some code to skip the GPU inference here:

github.com

NVIDIA-AI-IOT/jetson_benchmarks/blob/master/utils/load_store_engine.py

#!/usr/bin/python
import os
import subprocess
import threading
import time

# Class for load, store, remove engine
class load_store_engine():
    def __init__(self, model_path, model_name, batch_size_gpu, batch_size_dla, num_devices, precision, ws_gpu, ws_dla, model_input, model_output ):
        self.model_path = model_path # Directory
        self.model_name = model_name # Model Name
        self.num_devices = num_devices # 3 if GPU+2DLA, 1 if GPU Only
        self.precision = precision # float16 or int8
        self.batch_size_gpu = batch_size_gpu # Batch Size for GPU
        self.batch_size_dla = batch_size_dla # Batch Size for DLA
        self.ws_gpu = ws_gpu # Workspace required for GPU
        self.ws_dla =ws_dla  # Workspace required for DLA
        self. model_input = model_input # Input name of the model
        self.model_output = model_output # Output name of the model
        self.trt_process = []

This file has been truncated. show original

4. TOPS 21 = 12.3 (GPU) + 2*4.5 (each DLA)

Thanks.

HuiW · August 17, 2020, 7:30am

Hi AastaLLL,

Thank you for your support.
After set Devices into 1 and got the result below.

NX EMMC
(NX EVB board) NX EMMC
(NX EVB board)
DLA 0
inception_v4 317.45 193.49
ResNet50_224x224 879.33 621.26
ssd-mobilenet-v1 892.74 770.18

Are the result normal?

Plus, the other models already set Devices to 1.
Does DLA support the other models (vgg19_N2, yolov3-tiny-416, …)?
Why not set the models to use DLA?

Thank you,

AastaLLL · August 28, 2020, 6:07am

Hi,

Could you add some log here to see if you are using the correct process for benchmarking?

github.com

NVIDIA-AI-IOT/jetson_benchmarks/blob/master/utils/load_store_engine.py#L27


      
              self.ws_dla =ws_dla  # Workspace required for DLA
              self. model_input = model_input # Input name of the model
              self.model_output = model_output # Output name of the model
              self.trt_process = []
          
          
def engine_gen(self):
              cmd = []
              model = []
              self.framework = os.path.splitext(self.model_name)[1]
              precision_cmd = str('--' + str(self.precision))
              for device_id in range(0, self.num_devices):
                  if device_id == 1 or device_id == 2:
                      self.device = 'dla'
                      model_base_path = self._model2deploy()
                      dla_cmd = str('--useDLACore=' + str(device_id - 1))
                      workspace_cmd = str('--workspace=' + str(self.ws_dla))
                      _model = str(os.path.splitext(self.model_name)[0]) + '_b' + str(self.batch_size_dla) + '_ws' + str(
                          self.ws_dla) + '_' + str(self.device) + str(device_id)
                      engine_CMD = str(
                          './trtexec' + " " + model_base_path + " " + precision_cmd + " " +'--allowGPUFallback' + " " + " " + dla_cmd + " " +
                          workspace_cmd)

DLA is a hardware based inference engine so not all the operation from TensorRT are supported.
Since this is a benchmark script, we turn off the model that cannot be run on the DLA without fallback.

Thanks.

Topic		Replies	Views
Lower performance with DLA enabled Jetson Xavier NX yolo	16	2341	October 18, 2021
Trying using contemporary DLA and GPU on Jetson NX DeepStream SDK dla	8	748	April 26, 2023
DLA performance DeepStream SDK	17	347	September 23, 2024
Cannot create DLA engine using trtexec Jetson Xavier NX tensorrt	2	1620	October 18, 2021
YOLOV4 inference on DLA Jetson Xavier NX dla	5	1002	August 4, 2021
The throughput not increase when using dla on xavier DeepStream SDK deepstream	3	80	December 18, 2024
slower when change DefaultDeviceType from GPU to DLA? Jetson AGX Xavier	3	721	October 18, 2021
Testing DLA using object detection model and deepstream DeepStream SDK dla	5	151	September 3, 2024
Unable to verify Xavier inference benchmarks Jetson AGX Xavier	17	2514	October 18, 2021
Unable to use DLA with TensorRT Jetson AGX Xavier	11	3511	November 8, 2018

Performance without DLA

Related topics