TensorRT model inference fully on DLA is slow due to abnormally slow cudaEventSynchronize time

slimwangyue · November 7, 2023, 11:48pm

Hello NVIDIA Community,

I’m working with a Jetson Orin 64GB, and I’m trying to run a CNN inference entirely on the DLA without GPU fallback. I used the following command with TensorRT version 8.5.2:

trtexec --onnx='model.onnx' --saveEngine='model_dla.trt' --fp16 --useDLACore=0 --useSpinWait --separateProfileRun

and get the output below, from which we know the model is fully on DLA and none of layers are on GPU:

[11/07/2023-15:36:54] [I] === Model Options ===
[11/07/2023-15:36:54] [I] Format: ONNX
[11/07/2023-15:36:54] [I] Model: /home/orin-1/export/model_v1_3_1x3x768x960_aspp_false.onnx
[11/07/2023-15:36:54] [I] Output:
[11/07/2023-15:36:54] [I] === Build Options ===
[11/07/2023-15:36:54] [I] Max batch: explicit batch
[11/07/2023-15:36:54] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[11/07/2023-15:36:54] [I] minTiming: 1
[11/07/2023-15:36:54] [I] avgTiming: 8
[11/07/2023-15:36:54] [I] Precision: FP32+FP16
[11/07/2023-15:36:54] [I] LayerPrecisions: 
[11/07/2023-15:36:54] [I] Calibration: 
[11/07/2023-15:36:54] [I] Refit: Disabled
[11/07/2023-15:36:54] [I] Sparsity: Disabled
[11/07/2023-15:36:54] [I] Safe mode: Disabled
[11/07/2023-15:36:54] [I] DirectIO mode: Disabled
[11/07/2023-15:36:54] [I] Restricted mode: Disabled
[11/07/2023-15:36:54] [I] Build only: Disabled
[11/07/2023-15:36:54] [I] Save engine: /home/orin-1/export/model_v1_3_1x3x768x960_aspp_false_dla.trt
[11/07/2023-15:36:54] [I] Load engine: 
[11/07/2023-15:36:54] [I] Profiling verbosity: 0
[11/07/2023-15:36:54] [I] Tactic sources: Using default tactic sources
[11/07/2023-15:36:54] [I] timingCacheMode: local
[11/07/2023-15:36:54] [I] timingCacheFile: 
[11/07/2023-15:36:54] [I] Heuristic: Disabled
[11/07/2023-15:36:54] [I] Preview Features: Use default preview flags.
[11/07/2023-15:36:54] [I] Input(s)s format: fp32:CHW
[11/07/2023-15:36:54] [I] Output(s)s format: fp32:CHW
[11/07/2023-15:36:54] [I] Input build shapes: model
[11/07/2023-15:36:54] [I] Input calibration shapes: model
[11/07/2023-15:36:54] [I] === System Options ===
[11/07/2023-15:36:54] [I] Device: 0
[11/07/2023-15:36:54] [I] DLACore: 0(With GPU fallback)
[11/07/2023-15:36:54] [I] Plugins:
[11/07/2023-15:36:54] [I] === Inference Options ===
[11/07/2023-15:36:54] [I] Batch: Explicit
[11/07/2023-15:36:54] [I] Input inference shapes: model
[11/07/2023-15:36:54] [I] Iterations: 10
[11/07/2023-15:36:54] [I] Duration: 3s (+ 200ms warm up)
[11/07/2023-15:36:54] [I] Sleep time: 0ms
[11/07/2023-15:36:54] [I] Idle time: 0ms
[11/07/2023-15:36:54] [I] Streams: 1
[11/07/2023-15:36:54] [I] ExposeDMA: Disabled
[11/07/2023-15:36:54] [I] Data transfers: Enabled
[11/07/2023-15:36:54] [I] Spin-wait: Enabled
[11/07/2023-15:36:54] [I] Multithreading: Disabled
[11/07/2023-15:36:54] [I] CUDA Graph: Disabled
[11/07/2023-15:36:54] [I] Separate profiling: Enabled
[11/07/2023-15:36:54] [I] Time Deserialize: Disabled
[11/07/2023-15:36:54] [I] Time Refit: Disabled
[11/07/2023-15:36:54] [I] NVTX verbosity: 0
[11/07/2023-15:36:54] [I] Persistent Cache Ratio: 0
[11/07/2023-15:36:54] [I] Inputs:
[11/07/2023-15:36:54] [I] === Reporting Options ===
[11/07/2023-15:36:54] [I] Verbose: Disabled
[11/07/2023-15:36:54] [I] Averages: 10 inferences
[11/07/2023-15:36:54] [I] Percentiles: 90,95,99
[11/07/2023-15:36:54] [I] Dump refittable layers:Disabled
[11/07/2023-15:36:54] [I] Dump output: Disabled
[11/07/2023-15:36:54] [I] Profile: Disabled
[11/07/2023-15:36:54] [I] Export timing to JSON file: 
[11/07/2023-15:36:54] [I] Export output to JSON file: 
[11/07/2023-15:36:54] [I] Export profile to JSON file: ./DLA/model_dla_pixel.json
[11/07/2023-15:36:54] [I] 
[11/07/2023-15:36:54] [I] === Device Information ===
[11/07/2023-15:36:54] [I] Selected Device: Orin
[11/07/2023-15:36:54] [I] Compute Capability: 8.7
[11/07/2023-15:36:54] [I] SMs: 16
[11/07/2023-15:36:54] [I] Compute Clock Rate: 1.3 GHz
[11/07/2023-15:36:54] [I] Device Global Memory: 30588 MiB
[11/07/2023-15:36:54] [I] Shared Memory per SM: 164 KiB
[11/07/2023-15:36:54] [I] Memory Bus Width: 128 bits (ECC disabled)
[11/07/2023-15:36:54] [I] Memory Clock Rate: 1.3 GHz
[11/07/2023-15:36:54] [I] 
[11/07/2023-15:36:54] [I] TensorRT version: 8.5.2
[11/07/2023-15:36:54] [I] [TRT] [MemUsageChange] Init CUDA: CPU +220, GPU +0, now: CPU 246, GPU 5598 (MiB)
[11/07/2023-15:36:56] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +302, GPU +286, now: CPU 571, GPU 5906 (MiB)
[11/07/2023-15:36:56] [I] Start parsing network model
[11/07/2023-15:36:56] [I] [TRT] ----------------------------------------------------------------
[11/07/2023-15:36:56] [I] [TRT] Input filename:   /home/orin-1/yue/TLR/export/model_v1_3_1x3x768x960_aspp_false.onnx
[11/07/2023-15:36:56] [I] [TRT] ONNX IR version:  0.0.7
[11/07/2023-15:36:56] [I] [TRT] Opset version:    14
[11/07/2023-15:36:56] [I] [TRT] Producer name:    pytorch
[11/07/2023-15:36:56] [I] [TRT] Producer version: 2.0.0
[11/07/2023-15:36:56] [I] [TRT] Domain:           
[11/07/2023-15:36:56] [I] [TRT] Model version:    0
[11/07/2023-15:36:56] [I] [TRT] Doc string:       
[11/07/2023-15:36:56] [I] [TRT] ----------------------------------------------------------------
[11/07/2023-15:36:56] [I] Finish parsing network model
[11/07/2023-15:37:00] [I] [TRT] ---------- Layers Running on DLA ----------
[11/07/2023-15:37:00] [I] [TRT] [DlaLayer] {ForeignNode[/encoder/stem/conv/Conv.../headers/header_nb_cls/Conv]}
[11/07/2023-15:37:00] [I] [TRT] ---------- Layers Running on GPU ----------
[11/07/2023-15:37:01] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +307, now: CPU 1128, GPU 6437 (MiB)
[11/07/2023-15:37:01] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +82, GPU +81, now: CPU 1210, GPU 6518 (MiB)
[11/07/2023-15:37:01] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[11/07/2023-15:37:11] [I] [TRT] Total Activation Memory: 32106590208
[11/07/2023-15:37:11] [I] [TRT] Detected 1 inputs and 6 output network tensors.
[11/07/2023-15:37:12] [I] [TRT] Total Host Persistent Memory: 128
[11/07/2023-15:37:12] [I] [TRT] Total Device Persistent Memory: 0
[11/07/2023-15:37:12] [I] [TRT] Total Scratch Memory: 0
[11/07/2023-15:37:12] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 43 MiB, GPU 30 MiB
[11/07/2023-15:37:12] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 7 steps to complete.
[11/07/2023-15:37:12] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.047329ms to assign 7 blocks to 7 nodes requiring 32440320 bytes.
[11/07/2023-15:37:12] [I] [TRT] Total Activation Memory: 32440320
[11/07/2023-15:37:12] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +43, GPU +0, now: CPU 43, GPU 0 (MiB)
[11/07/2023-15:37:12] [I] Engine built in 18.1508 sec.
[11/07/2023-15:37:12] [I] [TRT] Loaded engine size: 43 MiB
[11/07/2023-15:37:12] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +43, GPU +0, now: CPU 43, GPU 0 (MiB)
[11/07/2023-15:37:12] [I] Engine deserialized in 0.00748767 sec.
[11/07/2023-15:37:12] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +30, now: CPU 43, GPU 30 (MiB)
[11/07/2023-15:37:12] [I] Setting persistentCacheLimit to 0 bytes.
[11/07/2023-15:37:12] [I] Using random values for input input_x
[11/07/2023-15:37:12] [I] Created input binding for input_x with dimensions 1x3x768x960
[11/07/2023-15:37:12] [I] Using random values for output output_hm
[11/07/2023-15:37:12] [I] Created output binding for output_hm with dimensions 1x2x192x240
[11/07/2023-15:37:12] [I] Using random values for output output_wh
[11/07/2023-15:37:12] [I] Created output binding for output_wh with dimensions 1x2x192x240
[11/07/2023-15:37:12] [I] Using random values for output output_reg
[11/07/2023-15:37:12] [I] Created output binding for output_reg with dimensions 1x2x192x240
[11/07/2023-15:37:12] [I] Using random values for output output_bulb_cls
[11/07/2023-15:37:12] [I] Created output binding for output_bulb_cls with dimensions 1x5x192x240
[11/07/2023-15:37:12] [I] Using random values for output output_arrow_cls
[11/07/2023-15:37:12] [I] Created output binding for output_arrow_cls with dimensions 1x5x192x240
[11/07/2023-15:37:12] [I] Using random values for output output_nb_cls
[11/07/2023-15:37:12] [I] Created output binding for output_nb_cls with dimensions 1x4x192x240
[11/07/2023-15:37:12] [I] Starting inference
[11/07/2023-15:37:16] [I] Warmup completed 2 queries over 200 ms
[11/07/2023-15:37:16] [I] Timing trace has 29 queries over 3.39795 s
[11/07/2023-15:37:16] [I] 
[11/07/2023-15:37:16] [I] === Trace details ===
[11/07/2023-15:37:16] [I] Trace averages of 10 runs:
[11/07/2023-15:37:16] [I] Average on 10 runs - GPU latency: 113.262 ms - Host latency: 113.745 ms (enqueue 0.211328 ms)
[11/07/2023-15:37:16] [I] Average on 10 runs - GPU latency: 113.274 ms - Host latency: 113.757 ms (enqueue 0.204419 ms)
[11/07/2023-15:37:16] [I] 
[11/07/2023-15:37:16] [I] === Performance summary ===
[11/07/2023-15:37:16] [I] Throughput: 8.53455 qps
[11/07/2023-15:37:16] [I] Latency: min = 113.672 ms, max = 113.83 ms, mean = 113.749 ms, median = 113.745 ms, percentile(90%) = 113.768 ms, percentile(95%) = 113.77 ms, percentile(99%) = 113.83 ms
[11/07/2023-15:37:16] [I] Enqueue Time: min = 0.183838 ms, max = 0.274994 ms, mean = 0.203462 ms, median = 0.19751 ms, percentile(90%) = 0.226868 ms, percentile(95%) = 0.231934 ms, percentile(99%) = 0.274994 ms
[11/07/2023-15:37:16] [I] H2D Latency: min = 0.30481 ms, max = 0.31781 ms, mean = 0.308573 ms, median = 0.307861 ms, percentile(90%) = 0.312988 ms, percentile(95%) = 0.31311 ms, percentile(99%) = 0.31781 ms
[11/07/2023-15:37:16] [I] GPU Compute Time: min = 113.248 ms, max = 113.348 ms, mean = 113.268 ms, median = 113.261 ms, percentile(90%) = 113.284 ms, percentile(95%) = 113.287 ms, percentile(99%) = 113.348 ms
[11/07/2023-15:37:16] [I] D2H Latency: min = 0.118896 ms, max = 0.178284 ms, mean = 0.172727 ms, median = 0.174438 ms, percentile(90%) = 0.176758 ms, percentile(95%) = 0.177246 ms, percentile(99%) = 0.178284 ms
[11/07/2023-15:37:16] [I] Total Host Walltime: 3.39795 s
[11/07/2023-15:37:16] [I] Total GPU Compute Time: 3.28477 s
[11/07/2023-15:37:16] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/07/2023-15:37:16] [I]

and then run command below to get the nsight profiling report:

nsys profile --trace=cuda,nvtx,cublas,cudla,cusparse,cudnn,nvmedia --output=model_dla.nvvp /usr/src/tensorrt/bin/trtexec --loadEngine=model_dla.trt --iterations=50 --idleTime=1 --duration=0

and get the below profiling report, from which we see the model inference (blue circle) only takes 0.5 ms while cudaEventSynchronize takes more than 100 ms:

My questions are:

why is cudaEventSynchronize taking that much time comparing to the model inference itself?
how to check what is cudaEventSynchronize waiting for?

AastaLLL · November 8, 2023, 2:52am

Hi,

Could you enable DLA trace and share the output with us?

$ nsys profile --soc-metrics=true ...

Thanks.

slimwangyue · November 8, 2023, 3:21am

Thanks for your quick response! Below is the output. I also attached the profiling report.
model_dla.nvvp.nsys-rep.zip (3.8 MB)

[11/07/2023-22:21:59] [I] === Model Options ===
[11/07/2023-22:21:59] [I] Format: *
[11/07/2023-22:21:59] [I] Model: 
[11/07/2023-22:21:59] [I] Output:
[11/07/2023-22:21:59] [I] === Build Options ===
[11/07/2023-22:21:59] [I] Max batch: 1
[11/07/2023-22:21:59] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[11/07/2023-22:21:59] [I] minTiming: 1
[11/07/2023-22:21:59] [I] avgTiming: 8
[11/07/2023-22:21:59] [I] Precision: FP32
[11/07/2023-22:21:59] [I] LayerPrecisions: 
[11/07/2023-22:21:59] [I] Calibration: 
[11/07/2023-22:21:59] [I] Refit: Disabled
[11/07/2023-22:21:59] [I] Sparsity: Disabled
[11/07/2023-22:21:59] [I] Safe mode: Disabled
[11/07/2023-22:21:59] [I] DirectIO mode: Disabled
[11/07/2023-22:21:59] [I] Restricted mode: Disabled
[11/07/2023-22:21:59] [I] Build only: Disabled
[11/07/2023-22:21:59] [I] Save engine: 
[11/07/2023-22:21:59] [I] Load engine: /home/orin-1/yue/TLR/export/model_v1_3_1x3x768x960_aspp_false_dla.trt
[11/07/2023-22:21:59] [I] Profiling verbosity: 0
[11/07/2023-22:21:59] [I] Tactic sources: Using default tactic sources
[11/07/2023-22:21:59] [I] timingCacheMode: local
[11/07/2023-22:21:59] [I] timingCacheFile: 
[11/07/2023-22:21:59] [I] Heuristic: Disabled
[11/07/2023-22:21:59] [I] Preview Features: Use default preview flags.
[11/07/2023-22:21:59] [I] Input(s)s format: fp32:CHW
[11/07/2023-22:21:59] [I] Output(s)s format: fp32:CHW
[11/07/2023-22:21:59] [I] Input build shapes: model
[11/07/2023-22:21:59] [I] Input calibration shapes: model
[11/07/2023-22:21:59] [I] === System Options ===
[11/07/2023-22:21:59] [I] Device: 0
[11/07/2023-22:21:59] [I] DLACore: 
[11/07/2023-22:21:59] [I] Plugins:
[11/07/2023-22:21:59] [I] === Inference Options ===
[11/07/2023-22:21:59] [I] Batch: 1
[11/07/2023-22:21:59] [I] Input inference shapes: model
[11/07/2023-22:21:59] [I] Iterations: 50
[11/07/2023-22:21:59] [I] Duration: 0s (+ 200ms warm up)
[11/07/2023-22:21:59] [I] Sleep time: 0ms
[11/07/2023-22:21:59] [I] Idle time: 1ms
[11/07/2023-22:21:59] [I] Streams: 1
[11/07/2023-22:21:59] [I] ExposeDMA: Disabled
[11/07/2023-22:21:59] [I] Data transfers: Enabled
[11/07/2023-22:21:59] [I] Spin-wait: Disabled
[11/07/2023-22:21:59] [I] Multithreading: Disabled
[11/07/2023-22:21:59] [I] CUDA Graph: Disabled
[11/07/2023-22:21:59] [I] Separate profiling: Disabled
[11/07/2023-22:21:59] [I] Time Deserialize: Disabled
[11/07/2023-22:21:59] [I] Time Refit: Disabled
[11/07/2023-22:21:59] [I] NVTX verbosity: 0
[11/07/2023-22:21:59] [I] Persistent Cache Ratio: 0
[11/07/2023-22:21:59] [I] Inputs:
[11/07/2023-22:21:59] [I] === Reporting Options ===
[11/07/2023-22:21:59] [I] Verbose: Disabled
[11/07/2023-22:21:59] [I] Averages: 10 inferences
[11/07/2023-22:21:59] [I] Percentiles: 90,95,99
[11/07/2023-22:21:59] [I] Dump refittable layers:Disabled
[11/07/2023-22:21:59] [I] Dump output: Disabled
[11/07/2023-22:21:59] [I] Profile: Disabled
[11/07/2023-22:21:59] [I] Export timing to JSON file: 
[11/07/2023-22:21:59] [I] Export output to JSON file: 
[11/07/2023-22:21:59] [I] Export profile to JSON file: 
[11/07/2023-22:21:59] [I] 
[11/07/2023-22:21:59] [I] === Device Information ===
[11/07/2023-22:21:59] [I] Selected Device: Orin
[11/07/2023-22:21:59] [I] Compute Capability: 8.7
[11/07/2023-22:21:59] [I] SMs: 16
[11/07/2023-22:21:59] [I] Compute Clock Rate: 1.3 GHz
[11/07/2023-22:21:59] [I] Device Global Memory: 30588 MiB
[11/07/2023-22:21:59] [I] Shared Memory per SM: 164 KiB
[11/07/2023-22:21:59] [I] Memory Bus Width: 128 bits (ECC disabled)
[11/07/2023-22:21:59] [I] Memory Clock Rate: 1.3 GHz
[11/07/2023-22:21:59] [I] 
[11/07/2023-22:21:59] [I] TensorRT version: 8.5.2
[11/07/2023-22:21:59] [I] Engine loaded in 0.0240994 sec.
[11/07/2023-22:21:59] [I] [TRT] Loaded engine size: 43 MiB
[11/07/2023-22:22:00] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +43, GPU +0, now: CPU 43, GPU 0 (MiB)
[11/07/2023-22:22:00] [I] Engine deserialized in 0.582743 sec.
[11/07/2023-22:22:00] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +30, now: CPU 43, GPU 30 (MiB)
[11/07/2023-22:22:00] [I] Setting persistentCacheLimit to 0 bytes.
[11/07/2023-22:22:00] [I] Using random values for input input_x
[11/07/2023-22:22:00] [I] Created input binding for input_x with dimensions 1x3x768x960
[11/07/2023-22:22:00] [I] Using random values for output output_hm
[11/07/2023-22:22:00] [I] Created output binding for output_hm with dimensions 1x2x192x240
[11/07/2023-22:22:00] [I] Using random values for output output_wh
[11/07/2023-22:22:00] [I] Created output binding for output_wh with dimensions 1x2x192x240
[11/07/2023-22:22:00] [I] Using random values for output output_reg
[11/07/2023-22:22:00] [I] Created output binding for output_reg with dimensions 1x2x192x240
[11/07/2023-22:22:00] [I] Using random values for output output_bulb_cls
[11/07/2023-22:22:00] [I] Created output binding for output_bulb_cls with dimensions 1x5x192x240
[11/07/2023-22:22:00] [I] Using random values for output output_arrow_cls
[11/07/2023-22:22:00] [I] Created output binding for output_arrow_cls with dimensions 1x5x192x240
[11/07/2023-22:22:00] [I] Using random values for output output_nb_cls
[11/07/2023-22:22:00] [I] Created output binding for output_nb_cls with dimensions 1x4x192x240
[11/07/2023-22:22:00] [I] Starting inference
[11/07/2023-22:22:05] [I] Warmup completed 2 queries over 200 ms
[11/07/2023-22:22:05] [I] Timing trace has 50 queries over 5.78734 s
[11/07/2023-22:22:05] [I] 
[11/07/2023-22:22:05] [I] === Trace details ===
[11/07/2023-22:22:05] [I] Trace averages of 10 runs:
[11/07/2023-22:22:05] [I] Average on 10 runs - GPU latency: 113.42 ms - Host latency: 114.003 ms (enqueue 0.74332 ms)
[11/07/2023-22:22:05] [I] Average on 10 runs - GPU latency: 113.464 ms - Host latency: 114.045 ms (enqueue 0.718201 ms)
[11/07/2023-22:22:05] [I] Average on 10 runs - GPU latency: 113.532 ms - Host latency: 114.111 ms (enqueue 0.749951 ms)
[11/07/2023-22:22:05] [I] Average on 10 runs - GPU latency: 113.453 ms - Host latency: 114.038 ms (enqueue 0.795557 ms)
[11/07/2023-22:22:05] [I] Average on 10 runs - GPU latency: 113.552 ms - Host latency: 114.129 ms (enqueue 0.798877 ms)
[11/07/2023-22:22:05] [I] 
[11/07/2023-22:22:05] [I] === Performance summary ===
[11/07/2023-22:22:05] [I] Throughput: 8.63954 qps
[11/07/2023-22:22:05] [I] Latency: min = 113.909 ms, max = 114.516 ms, mean = 114.065 ms, median = 114.023 ms, percentile(90%) = 114.236 ms, percentile(95%) = 114.306 ms, percentile(99%) = 114.516 ms
[11/07/2023-22:22:05] [I] Enqueue Time: min = 0.656738 ms, max = 0.947754 ms, mean = 0.761181 ms, median = 0.749146 ms, percentile(90%) = 0.854004 ms, percentile(95%) = 0.875977 ms, percentile(99%) = 0.947754 ms
[11/07/2023-22:22:05] [I] H2D Latency: min = 0.326904 ms, max = 0.391113 ms, mean = 0.342562 ms, median = 0.337036 ms, percentile(90%) = 0.361084 ms, percentile(95%) = 0.365234 ms, percentile(99%) = 0.391113 ms
[11/07/2023-22:22:05] [I] GPU Compute Time: min = 113.332 ms, max = 113.932 ms, mean = 113.484 ms, median = 113.444 ms, percentile(90%) = 113.662 ms, percentile(95%) = 113.692 ms, percentile(99%) = 113.932 ms
[11/07/2023-22:22:05] [I] D2H Latency: min = 0.127441 ms, max = 0.246582 ms, mean = 0.238352 ms, median = 0.240234 ms, percentile(90%) = 0.244171 ms, percentile(95%) = 0.245605 ms, percentile(99%) = 0.246582 ms
[11/07/2023-22:22:05] [I] Total Host Walltime: 5.78734 s
[11/07/2023-22:22:05] [I] Total GPU Compute Time: 5.67421 s
[11/07/2023-22:22:05] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/07/2023-22:22:05] [I]

AastaLLL · November 9, 2023, 6:48am

Hi,

It looks like the cudaEventSynchronize is waiting for the DLA task.
If a model can be run on DLA directly, TensorRT only needs to launch the task and wait for it to finish.

You can also find that DLA is always active in your use case.

Thanks.

slimwangyue · November 9, 2023, 2:46pm

Thanks for your response. Here I attached the profiling report of the same model fully running on GPU.

I have several questions:

what is it doing in cudaEventSynchronize for GPU only? I think cudaEventSynchronize starts after TensorRT processing ends.
what is DLA (120ms) is that much slower than GPU (6ms)?
how to improve DLA’s speed except for using INT8?

AastaLLL · November 13, 2023, 8:40am

Hi,

1.
GPU might need to do some data formatting for DLA.
You can find the info below:

2.
We expect DLA to be used in low resolution so the perf with fp16 will be slower.

3. INT8 is recommended to improve the DLA perf.

Thanks

slimwangyue · December 4, 2023, 7:35pm

I am using fx2trt pipeline to achieve int8 quantization for trt. Here is my pipeline:


# First setup qconfig
    qconfig = torch.ao.quantization.qconfig.QConfig(
        activation=torch.ao.quantization.observer.HistogramObserver.with_args(
            qscheme=torch.per_tensor_symmetric, dtype=torch.qint8
        ),
        weight=torch.ao.quantization.observer.default_per_channel_weight_observer
    )
    no_quant_qconfig = torch.ao.quantization.qconfig.QConfig(
        activation=torch.ao.quantization.observer.NoopObserver,
        weight=torch.ao.quantization.observer.NoopObserver
    )

    qconfig_mapping = QConfigMapping()
    qconfig_mapping.set_global(qconfig)  # Default qconfig for most layers
    qconfig_mapping.set_object_type(xnn.layers.AddBlock_noq, no_quant_qconfig)

    model_prepared = quantize_fx.prepare_fx(model_to_quantize, 
                                            qconfig_mapping,
                                            [x],
                                            backend_config=get_tensorrt_backend_config()
                                            )


# Then calibrate
    for i in tqdm(range(1)):
        img, _, _ = dataset.get_item(i)
        img = img[:IMAGE_HEIGHT,:IMAGE_WIDTH,:]
        x = (img.astype(np.float32) - IMAGE_MEAN) / IMAGE_STD
        x = np.transpose(x, (2, 0, 1))
        x = x[np.newaxis, ...]
        x = torch.from_numpy(x).float()
        model_prepared(x)


# Then do quantization with FX
    model_quantized = quantize_fx.convert_to_reference_fx(model_prepared)
    model_traced = acc_tracer.trace(model_quantized, [x])

    interp = TRTInterpreter(
        model_traced,
        [InputTensorSpec(shape=torch.Size([1, 3, IMAGE_HEIGHT, IMAGE_WIDTH]),  
                         dtype=torch.float, 
                         device=torch.device("cuda"), 
                         has_batch_dim=True  # True if the first dimension of your shape is the batch size
        )],
        logger_level=trt.Logger.VERBOSE
    )
    res = interp.run(lower_precision=LowerPrecision.INT8, strict_type_constraints=True, max_workspace_size=1 << 35)
    engine, input_names, output_names = res.engine, res.input_names, res.output_names
    trt_mod = TRTModule(engine, input_names, output_names)
    

# Serialize the engine
    serialized_engine = engine.serialize()

My question is most of the layers are not supported by DLA because the layers are optimized and fused (just my guess, actually I do not know why).
Here is the log when I iterate the trt model and setup for DLA:

GPU | Layer: 0 | Name: pixel_values.per_tensor_quant.scale | Type: LayerType.CONSTANT
 - Output 0 data type: DataType.FLOAT
GPU | Layer: 1 | Name: [QUANTIZE]-[unknown_ir_ops.quantize_per_tensor]-[pixel_values_per_tensor_quant] | Type: LayerType.QUANTIZE
 - Input 0 data type: DataType.FLOAT
 - Input 1 data type: DataType.FLOAT
 - Output 0 data type: DataType.FLOAT
GPU | Layer: 2 | Name: (Unnamed Layer* 1) [Quantize]_output.dequant.scale | Type: LayerType.CONSTANT
 - Output 0 data type: DataType.FLOAT
GPU | Layer: 3 | Name: [DEQUANTIZE]-[unknown_ir_ops.dequantize]-[(Unnamed Layer* 1) [Quantize]_output_.dequant] | Type: LayerType.DEQUANTIZE
 - Input 0 data type: DataType.FLOAT
 - Input 1 data type: DataType.FLOAT
 - Output 0 data type: DataType.FLOAT
GPU | Layer: 4 | Name: conv2d_95_weight | Type: LayerType.CONSTANT
 - Output 0 data type: DataType.FLOAT
GPU | Layer: 5 | Name: [CONVOLUTION]-[unknown_ir_ops.conv2d]-[conv2d_95] | Type: LayerType.CONVOLUTION
 - Input 0 data type: DataType.FLOAT
 - Input 1 data type: DataType.FLOAT
 - Output 0 data type: DataType.FLOAT
**DLA** | Layer: 6 | Name: [RELU]-[acc_ops.relu]-[relu_84] | Type: LayerType.ACTIVATION
 - Input 0 data type: DataType.FLOAT
 - Output 0 data type: DataType.FLOAT

with the following codes, I did this in interp.run()

for i in range(self.network.num_layers):
            layer = self.network.get_layer(i)

            if builder_config.can_run_on_DLA(layer):
                builder_config.set_device_type(layer, trt.DeviceType.DLA)
                print(f"DLA | Layer: {i} | Name: {layer.name} | Type: {layer.type}")
            else:
                builder_config.set_device_type(layer, trt.DeviceType.GPU)
                print(f"GPU | Layer: {i} | Name: {layer.name} | Type: {layer.type}")
            
            # Check input data types
            for j in range(layer.num_inputs):
                tensor = layer.get_input(j)
                print(f" - Input {j} data type: {tensor.dtype}")

            # Check output data types
            for k in range(layer.num_outputs):
                tensor = layer.get_output(k)
                print(f" - Output {k} data type: {tensor.dtype}")

The reason I adopted this pipeline is because that the torch->onnx->trt_int8(calibration) has too much accuracy drop and I think this pipeline could give higher accuracy (at least for the fx model before converting to trt, correct me if I were wrong).
My question is: in this pipeline, is it possible to make the trt_int8 model on DLA? If not, how can I convert a torch model to int8 trt model with the minimum accuracy drop among different choices.

AastaLLL · December 20, 2023, 6:12am

Hi,

Since DLA is a hardware-based inference engine, there are some limitations when deploying the model.
Please find the support matrix of DLA in the below document:

We do have several samples to deploy engines on DLA. Please give it a check:

Maybe you can also try JetPack 6 DP since it contains the latest DLA software.

Thanks.

slimwangyue · January 4, 2024, 6:45pm

Thanks for your response. This does help quantized trt model to work with no problem!
But I have another question about the sparsity. Here is what I did to prune and quantize the model.

First I applied sparsity as follows

from apex.contrib.sparsity import ASP
model_sparse.model.cuda()
optimizer_sparse = torch.optim.AdamW(model_sparse.parameters(), lr=learning_rate, weight_decay=0.05)
ASP.prune_trained_model(model_sparse, optimizer_sparse)
trainer.fit(model=model_sparse, train_dataloaders=train_loader)
torch.save({"state_dict": model_sparse.state_dict()}, "/home/orin-1/yue/TLR/models/model_sparse.ckpt")

then I reload the pruned model from the checkpoint and apply the quantization and then export the onnx as follows (model definition and parameters loading codes are omitted)

def prune_trained_model_custom(model, optimizer, compute_sparse_masks=True):
    asp = ASP()
    asp.init_model_for_pruning(model, mask_calculator="m4n2_1d", verbosity=2, whitelist=[quant_nn.QuantLinear, quant_nn.QuantConv2d], allow_recompute_mask=False)
    asp.init_optimizer_for_pruning(optimizer)
    if compute_sparse_masks:
        asp.compute_sparse_masks()

prune_trained_model_custom(model.model, optimizer_sparse)
model.optimizer = optimizer_sparse
trainer.fit(model=model, train_dataloaders=train_loader)
quant_nn.TensorQuantizer.use_fb_fake_quant = True
torch.onnx.export(model.model.cuda(), dummy_input.cuda(), "/home/orin-1/yue/TLR/export_fine/qat_sparse_864_gpu.onnx", verbose=False, input_names=input_names, output_names=output_names)

then I run the provided script to remove qdq and save the calib cache and export the quantized trt engine as follows, I also tried on GPU only without using DLA, still no speedup.

/usr/src/tensorrt/bin/trtexec --onnx='qat_sparse_864_gpu_noqdq.onnx' --saveEngine=qat_sparse_864_gpu_noqdq.trt' --int8 --fp16 --calib='qat_sparse_864_gpu_precision_config_calib.cache' --profilingVerbosity=detailed --sparsity=force --verbose --allowGPUFallback --useDLACore=0

However this sparse model does not give any speedup because none of the layers are eligible for sparse math from the following log. But I am sure that the structured sparsity meets the requirements which is two elements are exactly 0s out of four elements across the input channel dimension, you can also observe this in the onnx model.

[01/04/2024-18:25:05] [I] [TRT] (Sparsity) Layers eligible for sparse math:
[01/04/2024-18:25:05] [I] [TRT] (Sparsity) TRT inference plan picked sparse implementation for layers:
[01/04/2024-18:25:05] [V] [TRT] Total number of generated kernels selected for the engine: 0
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: CUDNN
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: CUBLAS, CUBLAS_LT
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: EDGE_MASK_CONVOLUTIONS
[01/04/2024-18:25:05] [V] [TRT] Disabling unused tactic source: JIT_CONVOLUTIONS

I have attached my onnx model and trt engine for you to reproduce. Thanks!
Desktop.zip (15.0 MB)

So my question is: am I doing the sparsity correctly and if not, how to get the claimed speedup from adding structured sparsity?

AastaLLL · January 17, 2024, 4:38am

Let’s follow the new sparsity issue on the separate topic you created:

Thanks

system · February 13, 2024, 1:13pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
DLA:Using dla on orin nx meet an error Jetson Orin NX dla	8	71	November 12, 2024
DLA performance DeepStream SDK	17	174	September 23, 2024
Error loading .trt model Jetson AGX Orin tensorrt	7	180	November 6, 2024
ConvTranspose + Add Slow TensorRT tensorrt	4	663	July 25, 2023
Tensorrt inference with batch > 1 TensorRT	4	1393	October 13, 2022
Tensorrt Python API has a bug in DLA usage Jetson AGX Xavier tensorrt	11	656	August 17, 2022
AGX Orin DeepStream 6.1.1 FPS of deepstream-app use DLA low than AGX Xavier DeepStream SDK	19	919	November 2, 2022
Inference slow even using TensorRT Jetson AGX Orin tensorrt	15	1831	November 6, 2023
Using dla on orin nx meet an error Jetson AGX Xavier dla	9	168	September 8, 2024
Model onnx trt engine generation process report different results compared between PC and jetson XAVIER NX Jetson Xavier NX tensorrt	19	1047	September 28, 2022

TensorRT model inference fully on DLA is slow due to abnormally slow cudaEventSynchronize time

Related topics