Hello NVIDIA Community,
I’m working with a Jetson Orin 64GB, and I’m trying to run a CNN inference entirely on the DLA without GPU fallback. I used the following command with TensorRT version 8.5.2:
trtexec --onnx='model.onnx' --saveEngine='model_dla.trt' --fp16 --useDLACore=0 --useSpinWait --separateProfileRun
and get the output below, from which we know the model is fully on DLA and none of layers are on GPU:
[11/07/2023-15:36:54] [I] === Model Options ===
[11/07/2023-15:36:54] [I] Format: ONNX
[11/07/2023-15:36:54] [I] Model: /home/orin-1/export/model_v1_3_1x3x768x960_aspp_false.onnx
[11/07/2023-15:36:54] [I] Output:
[11/07/2023-15:36:54] [I] === Build Options ===
[11/07/2023-15:36:54] [I] Max batch: explicit batch
[11/07/2023-15:36:54] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[11/07/2023-15:36:54] [I] minTiming: 1
[11/07/2023-15:36:54] [I] avgTiming: 8
[11/07/2023-15:36:54] [I] Precision: FP32+FP16
[11/07/2023-15:36:54] [I] LayerPrecisions:
[11/07/2023-15:36:54] [I] Calibration:
[11/07/2023-15:36:54] [I] Refit: Disabled
[11/07/2023-15:36:54] [I] Sparsity: Disabled
[11/07/2023-15:36:54] [I] Safe mode: Disabled
[11/07/2023-15:36:54] [I] DirectIO mode: Disabled
[11/07/2023-15:36:54] [I] Restricted mode: Disabled
[11/07/2023-15:36:54] [I] Build only: Disabled
[11/07/2023-15:36:54] [I] Save engine: /home/orin-1/export/model_v1_3_1x3x768x960_aspp_false_dla.trt
[11/07/2023-15:36:54] [I] Load engine:
[11/07/2023-15:36:54] [I] Profiling verbosity: 0
[11/07/2023-15:36:54] [I] Tactic sources: Using default tactic sources
[11/07/2023-15:36:54] [I] timingCacheMode: local
[11/07/2023-15:36:54] [I] timingCacheFile:
[11/07/2023-15:36:54] [I] Heuristic: Disabled
[11/07/2023-15:36:54] [I] Preview Features: Use default preview flags.
[11/07/2023-15:36:54] [I] Input(s)s format: fp32:CHW
[11/07/2023-15:36:54] [I] Output(s)s format: fp32:CHW
[11/07/2023-15:36:54] [I] Input build shapes: model
[11/07/2023-15:36:54] [I] Input calibration shapes: model
[11/07/2023-15:36:54] [I] === System Options ===
[11/07/2023-15:36:54] [I] Device: 0
[11/07/2023-15:36:54] [I] DLACore: 0(With GPU fallback)
[11/07/2023-15:36:54] [I] Plugins:
[11/07/2023-15:36:54] [I] === Inference Options ===
[11/07/2023-15:36:54] [I] Batch: Explicit
[11/07/2023-15:36:54] [I] Input inference shapes: model
[11/07/2023-15:36:54] [I] Iterations: 10
[11/07/2023-15:36:54] [I] Duration: 3s (+ 200ms warm up)
[11/07/2023-15:36:54] [I] Sleep time: 0ms
[11/07/2023-15:36:54] [I] Idle time: 0ms
[11/07/2023-15:36:54] [I] Streams: 1
[11/07/2023-15:36:54] [I] ExposeDMA: Disabled
[11/07/2023-15:36:54] [I] Data transfers: Enabled
[11/07/2023-15:36:54] [I] Spin-wait: Enabled
[11/07/2023-15:36:54] [I] Multithreading: Disabled
[11/07/2023-15:36:54] [I] CUDA Graph: Disabled
[11/07/2023-15:36:54] [I] Separate profiling: Enabled
[11/07/2023-15:36:54] [I] Time Deserialize: Disabled
[11/07/2023-15:36:54] [I] Time Refit: Disabled
[11/07/2023-15:36:54] [I] NVTX verbosity: 0
[11/07/2023-15:36:54] [I] Persistent Cache Ratio: 0
[11/07/2023-15:36:54] [I] Inputs:
[11/07/2023-15:36:54] [I] === Reporting Options ===
[11/07/2023-15:36:54] [I] Verbose: Disabled
[11/07/2023-15:36:54] [I] Averages: 10 inferences
[11/07/2023-15:36:54] [I] Percentiles: 90,95,99
[11/07/2023-15:36:54] [I] Dump refittable layers:Disabled
[11/07/2023-15:36:54] [I] Dump output: Disabled
[11/07/2023-15:36:54] [I] Profile: Disabled
[11/07/2023-15:36:54] [I] Export timing to JSON file:
[11/07/2023-15:36:54] [I] Export output to JSON file:
[11/07/2023-15:36:54] [I] Export profile to JSON file: ./DLA/model_dla_pixel.json
[11/07/2023-15:36:54] [I]
[11/07/2023-15:36:54] [I] === Device Information ===
[11/07/2023-15:36:54] [I] Selected Device: Orin
[11/07/2023-15:36:54] [I] Compute Capability: 8.7
[11/07/2023-15:36:54] [I] SMs: 16
[11/07/2023-15:36:54] [I] Compute Clock Rate: 1.3 GHz
[11/07/2023-15:36:54] [I] Device Global Memory: 30588 MiB
[11/07/2023-15:36:54] [I] Shared Memory per SM: 164 KiB
[11/07/2023-15:36:54] [I] Memory Bus Width: 128 bits (ECC disabled)
[11/07/2023-15:36:54] [I] Memory Clock Rate: 1.3 GHz
[11/07/2023-15:36:54] [I]
[11/07/2023-15:36:54] [I] TensorRT version: 8.5.2
[11/07/2023-15:36:54] [I] [TRT] [MemUsageChange] Init CUDA: CPU +220, GPU +0, now: CPU 246, GPU 5598 (MiB)
[11/07/2023-15:36:56] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +302, GPU +286, now: CPU 571, GPU 5906 (MiB)
[11/07/2023-15:36:56] [I] Start parsing network model
[11/07/2023-15:36:56] [I] [TRT] ----------------------------------------------------------------
[11/07/2023-15:36:56] [I] [TRT] Input filename: /home/orin-1/yue/TLR/export/model_v1_3_1x3x768x960_aspp_false.onnx
[11/07/2023-15:36:56] [I] [TRT] ONNX IR version: 0.0.7
[11/07/2023-15:36:56] [I] [TRT] Opset version: 14
[11/07/2023-15:36:56] [I] [TRT] Producer name: pytorch
[11/07/2023-15:36:56] [I] [TRT] Producer version: 2.0.0
[11/07/2023-15:36:56] [I] [TRT] Domain:
[11/07/2023-15:36:56] [I] [TRT] Model version: 0
[11/07/2023-15:36:56] [I] [TRT] Doc string:
[11/07/2023-15:36:56] [I] [TRT] ----------------------------------------------------------------
[11/07/2023-15:36:56] [I] Finish parsing network model
[11/07/2023-15:37:00] [I] [TRT] ---------- Layers Running on DLA ----------
[11/07/2023-15:37:00] [I] [TRT] [DlaLayer] {ForeignNode[/encoder/stem/conv/Conv.../headers/header_nb_cls/Conv]}
[11/07/2023-15:37:00] [I] [TRT] ---------- Layers Running on GPU ----------
[11/07/2023-15:37:01] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +307, now: CPU 1128, GPU 6437 (MiB)
[11/07/2023-15:37:01] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +82, GPU +81, now: CPU 1210, GPU 6518 (MiB)
[11/07/2023-15:37:01] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[11/07/2023-15:37:11] [I] [TRT] Total Activation Memory: 32106590208
[11/07/2023-15:37:11] [I] [TRT] Detected 1 inputs and 6 output network tensors.
[11/07/2023-15:37:12] [I] [TRT] Total Host Persistent Memory: 128
[11/07/2023-15:37:12] [I] [TRT] Total Device Persistent Memory: 0
[11/07/2023-15:37:12] [I] [TRT] Total Scratch Memory: 0
[11/07/2023-15:37:12] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 43 MiB, GPU 30 MiB
[11/07/2023-15:37:12] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 7 steps to complete.
[11/07/2023-15:37:12] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.047329ms to assign 7 blocks to 7 nodes requiring 32440320 bytes.
[11/07/2023-15:37:12] [I] [TRT] Total Activation Memory: 32440320
[11/07/2023-15:37:12] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +43, GPU +0, now: CPU 43, GPU 0 (MiB)
[11/07/2023-15:37:12] [I] Engine built in 18.1508 sec.
[11/07/2023-15:37:12] [I] [TRT] Loaded engine size: 43 MiB
[11/07/2023-15:37:12] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +43, GPU +0, now: CPU 43, GPU 0 (MiB)
[11/07/2023-15:37:12] [I] Engine deserialized in 0.00748767 sec.
[11/07/2023-15:37:12] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +30, now: CPU 43, GPU 30 (MiB)
[11/07/2023-15:37:12] [I] Setting persistentCacheLimit to 0 bytes.
[11/07/2023-15:37:12] [I] Using random values for input input_x
[11/07/2023-15:37:12] [I] Created input binding for input_x with dimensions 1x3x768x960
[11/07/2023-15:37:12] [I] Using random values for output output_hm
[11/07/2023-15:37:12] [I] Created output binding for output_hm with dimensions 1x2x192x240
[11/07/2023-15:37:12] [I] Using random values for output output_wh
[11/07/2023-15:37:12] [I] Created output binding for output_wh with dimensions 1x2x192x240
[11/07/2023-15:37:12] [I] Using random values for output output_reg
[11/07/2023-15:37:12] [I] Created output binding for output_reg with dimensions 1x2x192x240
[11/07/2023-15:37:12] [I] Using random values for output output_bulb_cls
[11/07/2023-15:37:12] [I] Created output binding for output_bulb_cls with dimensions 1x5x192x240
[11/07/2023-15:37:12] [I] Using random values for output output_arrow_cls
[11/07/2023-15:37:12] [I] Created output binding for output_arrow_cls with dimensions 1x5x192x240
[11/07/2023-15:37:12] [I] Using random values for output output_nb_cls
[11/07/2023-15:37:12] [I] Created output binding for output_nb_cls with dimensions 1x4x192x240
[11/07/2023-15:37:12] [I] Starting inference
[11/07/2023-15:37:16] [I] Warmup completed 2 queries over 200 ms
[11/07/2023-15:37:16] [I] Timing trace has 29 queries over 3.39795 s
[11/07/2023-15:37:16] [I]
[11/07/2023-15:37:16] [I] === Trace details ===
[11/07/2023-15:37:16] [I] Trace averages of 10 runs:
[11/07/2023-15:37:16] [I] Average on 10 runs - GPU latency: 113.262 ms - Host latency: 113.745 ms (enqueue 0.211328 ms)
[11/07/2023-15:37:16] [I] Average on 10 runs - GPU latency: 113.274 ms - Host latency: 113.757 ms (enqueue 0.204419 ms)
[11/07/2023-15:37:16] [I]
[11/07/2023-15:37:16] [I] === Performance summary ===
[11/07/2023-15:37:16] [I] Throughput: 8.53455 qps
[11/07/2023-15:37:16] [I] Latency: min = 113.672 ms, max = 113.83 ms, mean = 113.749 ms, median = 113.745 ms, percentile(90%) = 113.768 ms, percentile(95%) = 113.77 ms, percentile(99%) = 113.83 ms
[11/07/2023-15:37:16] [I] Enqueue Time: min = 0.183838 ms, max = 0.274994 ms, mean = 0.203462 ms, median = 0.19751 ms, percentile(90%) = 0.226868 ms, percentile(95%) = 0.231934 ms, percentile(99%) = 0.274994 ms
[11/07/2023-15:37:16] [I] H2D Latency: min = 0.30481 ms, max = 0.31781 ms, mean = 0.308573 ms, median = 0.307861 ms, percentile(90%) = 0.312988 ms, percentile(95%) = 0.31311 ms, percentile(99%) = 0.31781 ms
[11/07/2023-15:37:16] [I] GPU Compute Time: min = 113.248 ms, max = 113.348 ms, mean = 113.268 ms, median = 113.261 ms, percentile(90%) = 113.284 ms, percentile(95%) = 113.287 ms, percentile(99%) = 113.348 ms
[11/07/2023-15:37:16] [I] D2H Latency: min = 0.118896 ms, max = 0.178284 ms, mean = 0.172727 ms, median = 0.174438 ms, percentile(90%) = 0.176758 ms, percentile(95%) = 0.177246 ms, percentile(99%) = 0.178284 ms
[11/07/2023-15:37:16] [I] Total Host Walltime: 3.39795 s
[11/07/2023-15:37:16] [I] Total GPU Compute Time: 3.28477 s
[11/07/2023-15:37:16] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/07/2023-15:37:16] [I]
and then run command below to get the nsight profiling report:
nsys profile --trace=cuda,nvtx,cublas,cudla,cusparse,cudnn,nvmedia --output=model_dla.nvvp /usr/src/tensorrt/bin/trtexec --loadEngine=model_dla.trt --iterations=50 --idleTime=1 --duration=0
and get the below profiling report, from which we see the model inference (blue circle) only takes 0.5 ms while cudaEventSynchronize takes more than 100 ms:
My questions are:
- why is cudaEventSynchronize taking that much time comparing to the model inference itself?
- how to check what is cudaEventSynchronize waiting for?