Profile inference time of each layer for .engine model to know where is bottleneck in Deepstream?

linhbkpro2010 · May 17, 2023, 11:16am

I want to know the inference time of each layer for .engine model in Deepstream to better understand where is bottleneck. Are there any tool to support that in Deepstream6.2? Thanks

mchi · May 18, 2023, 5:01am

Hi @linhbkpro2010
since it’s .engine file, you should use nvinfer plugin which is based on TensorRT.

So, you can use TensorRT directly to profile it as command below:

$ /usr/src/tensorrt/bin/trtexec --loadEngine=swin_tiny_patch4_window7_224_bs8_best.engine --dumpProfile
…

linhbkpro2010 · May 19, 2023, 2:40am

I run command to check inference time of each layer, but I got error

&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine
[05/19/2023-02:35:26] [I] === Model Options ===
[05/19/2023-02:35:26] [I] Format: *
[05/19/2023-02:35:26] [I] Model: 
[05/19/2023-02:35:26] [I] Output:
[05/19/2023-02:35:26] [I] === Build Options ===
[05/19/2023-02:35:26] [I] Max batch: 1
[05/19/2023-02:35:26] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[05/19/2023-02:35:26] [I] minTiming: 1
[05/19/2023-02:35:26] [I] avgTiming: 8
[05/19/2023-02:35:26] [I] Precision: FP32
[05/19/2023-02:35:26] [I] LayerPrecisions: 
[05/19/2023-02:35:26] [I] Calibration: 
[05/19/2023-02:35:26] [I] Refit: Disabled
[05/19/2023-02:35:26] [I] Sparsity: Disabled
[05/19/2023-02:35:26] [I] Safe mode: Disabled
[05/19/2023-02:35:26] [I] DirectIO mode: Disabled
[05/19/2023-02:35:26] [I] Restricted mode: Disabled
[05/19/2023-02:35:26] [I] Build only: Disabled
[05/19/2023-02:35:26] [I] Save engine: 
[05/19/2023-02:35:26] [I] Load engine: model_b1_gpu0_fp32.engine
[05/19/2023-02:35:26] [I] Profiling verbosity: 0
[05/19/2023-02:35:26] [I] Tactic sources: Using default tactic sources
[05/19/2023-02:35:26] [I] timingCacheMode: local
[05/19/2023-02:35:26] [I] timingCacheFile: 
[05/19/2023-02:35:26] [I] Heuristic: Disabled
[05/19/2023-02:35:26] [I] Preview Features: Use default preview flags.
[05/19/2023-02:35:26] [I] Input(s)s format: fp32:CHW
[05/19/2023-02:35:26] [I] Output(s)s format: fp32:CHW
[05/19/2023-02:35:26] [I] Input build shapes: model
[05/19/2023-02:35:26] [I] Input calibration shapes: model
[05/19/2023-02:35:26] [I] === System Options ===
[05/19/2023-02:35:26] [I] Device: 0
[05/19/2023-02:35:26] [I] DLACore: 
[05/19/2023-02:35:26] [I] Plugins:
[05/19/2023-02:35:26] [I] === Inference Options ===
[05/19/2023-02:35:26] [I] Batch: 1
[05/19/2023-02:35:26] [I] Input inference shapes: model
[05/19/2023-02:35:26] [I] Iterations: 10
[05/19/2023-02:35:26] [I] Duration: 3s (+ 200ms warm up)
[05/19/2023-02:35:26] [I] Sleep time: 0ms
[05/19/2023-02:35:26] [I] Idle time: 0ms
[05/19/2023-02:35:26] [I] Streams: 1
[05/19/2023-02:35:26] [I] ExposeDMA: Disabled
[05/19/2023-02:35:26] [I] Data transfers: Enabled
[05/19/2023-02:35:26] [I] Spin-wait: Disabled
[05/19/2023-02:35:26] [I] Multithreading: Disabled
[05/19/2023-02:35:26] [I] CUDA Graph: Disabled
[05/19/2023-02:35:26] [I] Separate profiling: Disabled
[05/19/2023-02:35:26] [I] Time Deserialize: Disabled
[05/19/2023-02:35:26] [I] Time Refit: Disabled
[05/19/2023-02:35:26] [I] NVTX verbosity: 0
[05/19/2023-02:35:26] [I] Persistent Cache Ratio: 0
[05/19/2023-02:35:26] [I] Inputs:
[05/19/2023-02:35:26] [I] === Reporting Options ===
[05/19/2023-02:35:26] [I] Verbose: Disabled
[05/19/2023-02:35:26] [I] Averages: 10 inferences
[05/19/2023-02:35:26] [I] Percentiles: 90,95,99
[05/19/2023-02:35:26] [I] Dump refittable layers:Disabled
[05/19/2023-02:35:26] [I] Dump output: Disabled
[05/19/2023-02:35:26] [I] Profile: Disabled
[05/19/2023-02:35:26] [I] Export timing to JSON file: 
[05/19/2023-02:35:26] [I] Export output to JSON file: 
[05/19/2023-02:35:26] [I] Export profile to JSON file: 
[05/19/2023-02:35:26] [I] 
[05/19/2023-02:35:26] [I] === Device Information ===
[05/19/2023-02:35:26] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[05/19/2023-02:35:26] [I] Compute Capability: 7.5
[05/19/2023-02:35:26] [I] SMs: 68
[05/19/2023-02:35:26] [I] Compute Clock Rate: 1.545 GHz
[05/19/2023-02:35:26] [I] Device Global Memory: 11016 MiB
[05/19/2023-02:35:26] [I] Shared Memory per SM: 64 KiB
[05/19/2023-02:35:26] [I] Memory Bus Width: 352 bits (ECC disabled)
[05/19/2023-02:35:26] [I] Memory Clock Rate: 7 GHz
[05/19/2023-02:35:26] [I] 
[05/19/2023-02:35:26] [I] TensorRT version: 8.5.2
[05/19/2023-02:35:26] [I] Engine loaded in 0.211239 sec.
[05/19/2023-02:35:26] [I] [TRT] Loaded engine size: 175 MiB
[05/19/2023-02:35:27] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[05/19/2023-02:35:27] [E] Error[1]: [pluginV2Runner.cpp::load::300] Error Code 1: Serialization (Serialization assertion creator failed.Cannot deserialize plugin since corresponding IPluginCreator not found in Plugin Registry)
[05/19/2023-02:35:27] [E] Error[4]: [runtime.cpp::deserializeCudaEngine::66] Error Code 4: Internal Error (Engine deserialization failed.)
[05/19/2023-02:35:27] [E] Engine deserialization failed
[05/19/2023-02:35:27] [E] Got invalid engine!
[05/19/2023-02:35:27] [E] Inference set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine

My .engine model is converted as repo GitHub - marcoslucianops/DeepStream-Yolo: NVIDIA DeepStream SDK 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 implementation for YOLO models . I use this command to convert .pt model to .engine

deepstream-app -c deepstream_app_config_orig.txt

I’m unfamiliar with C++. Please give a an advice to profile inference of each layer. Thanks.

mchi · May 19, 2023, 2:47am

are you running the tetexec with the engine on the different type of GPU ?

linhbkpro2010 · May 19, 2023, 2:58am

Yes, may be I use different GPU when running

deepstream-app -c deepstream_app_config_orig.txt

and

/usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine

But I run 2 command again and set CUDA_VISIBLE_DEVICES=0 for both command, but I stil got error almost same as above

root@a8db07bc1951:/opt/nvidia/deepstream/deepstream-6.2/sources/DeepStream-Yolo# CUDA_VISIBLE_DEVICES=0 /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine
[05/19/2023-02:54:23] [I] === Model Options ===
[05/19/2023-02:54:23] [I] Format: *
[05/19/2023-02:54:23] [I] Model: 
[05/19/2023-02:54:23] [I] Output:
[05/19/2023-02:54:23] [I] === Build Options ===
[05/19/2023-02:54:23] [I] Max batch: 1
[05/19/2023-02:54:23] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[05/19/2023-02:54:23] [I] minTiming: 1
[05/19/2023-02:54:23] [I] avgTiming: 8
[05/19/2023-02:54:23] [I] Precision: FP32
[05/19/2023-02:54:23] [I] LayerPrecisions: 
[05/19/2023-02:54:23] [I] Calibration: 
[05/19/2023-02:54:23] [I] Refit: Disabled
[05/19/2023-02:54:23] [I] Sparsity: Disabled
[05/19/2023-02:54:23] [I] Safe mode: Disabled
[05/19/2023-02:54:23] [I] DirectIO mode: Disabled
[05/19/2023-02:54:23] [I] Restricted mode: Disabled
[05/19/2023-02:54:23] [I] Build only: Disabled
[05/19/2023-02:54:23] [I] Save engine: 
[05/19/2023-02:54:23] [I] Load engine: model_b1_gpu0_fp32.engine
[05/19/2023-02:54:23] [I] Profiling verbosity: 0
[05/19/2023-02:54:23] [I] Tactic sources: Using default tactic sources
[05/19/2023-02:54:23] [I] timingCacheMode: local
[05/19/2023-02:54:23] [I] timingCacheFile: 
[05/19/2023-02:54:23] [I] Heuristic: Disabled
[05/19/2023-02:54:23] [I] Preview Features: Use default preview flags.
[05/19/2023-02:54:23] [I] Input(s)s format: fp32:CHW
[05/19/2023-02:54:23] [I] Output(s)s format: fp32:CHW
[05/19/2023-02:54:23] [I] Input build shapes: model
[05/19/2023-02:54:23] [I] Input calibration shapes: model
[05/19/2023-02:54:23] [I] === System Options ===
[05/19/2023-02:54:23] [I] Device: 0
[05/19/2023-02:54:23] [I] DLACore: 
[05/19/2023-02:54:23] [I] Plugins:
[05/19/2023-02:54:23] [I] === Inference Options ===
[05/19/2023-02:54:23] [I] Batch: 1
[05/19/2023-02:54:23] [I] Input inference shapes: model
[05/19/2023-02:54:23] [I] Iterations: 10
[05/19/2023-02:54:23] [I] Duration: 3s (+ 200ms warm up)
[05/19/2023-02:54:23] [I] Sleep time: 0ms
[05/19/2023-02:54:23] [I] Idle time: 0ms
[05/19/2023-02:54:23] [I] Streams: 1
[05/19/2023-02:54:23] [I] ExposeDMA: Disabled
[05/19/2023-02:54:23] [I] Data transfers: Enabled
[05/19/2023-02:54:23] [I] Spin-wait: Disabled
[05/19/2023-02:54:23] [I] Multithreading: Disabled
[05/19/2023-02:54:23] [I] CUDA Graph: Disabled
[05/19/2023-02:54:23] [I] Separate profiling: Disabled
[05/19/2023-02:54:23] [I] Time Deserialize: Disabled
[05/19/2023-02:54:23] [I] Time Refit: Disabled
[05/19/2023-02:54:23] [I] NVTX verbosity: 0
[05/19/2023-02:54:23] [I] Persistent Cache Ratio: 0
[05/19/2023-02:54:23] [I] Inputs:
[05/19/2023-02:54:23] [I] === Reporting Options ===
[05/19/2023-02:54:23] [I] Verbose: Disabled
[05/19/2023-02:54:23] [I] Averages: 10 inferences
[05/19/2023-02:54:23] [I] Percentiles: 90,95,99
[05/19/2023-02:54:23] [I] Dump refittable layers:Disabled
[05/19/2023-02:54:23] [I] Dump output: Disabled
[05/19/2023-02:54:23] [I] Profile: Disabled
[05/19/2023-02:54:23] [I] Export timing to JSON file: 
[05/19/2023-02:54:23] [I] Export output to JSON file: 
[05/19/2023-02:54:23] [I] Export profile to JSON file: 
[05/19/2023-02:54:23] [I] 
[05/19/2023-02:54:23] [I] === Device Information ===
[05/19/2023-02:54:23] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[05/19/2023-02:54:23] [I] Compute Capability: 7.5
[05/19/2023-02:54:23] [I] SMs: 68
[05/19/2023-02:54:23] [I] Compute Clock Rate: 1.545 GHz
[05/19/2023-02:54:23] [I] Device Global Memory: 11019 MiB
[05/19/2023-02:54:23] [I] Shared Memory per SM: 64 KiB
[05/19/2023-02:54:23] [I] Memory Bus Width: 352 bits (ECC disabled)
[05/19/2023-02:54:23] [I] Memory Clock Rate: 7 GHz
[05/19/2023-02:54:23] [I] 
[05/19/2023-02:54:23] [I] TensorRT version: 8.5.2
[05/19/2023-02:54:23] [I] Engine loaded in 0.208848 sec.
[05/19/2023-02:54:24] [I] [TRT] Loaded engine size: 175 MiB
[05/19/2023-02:54:24] [E] Error[1]: [pluginV2Runner.cpp::load::300] Error Code 1: Serialization (Serialization assertion creator failed.Cannot deserialize plugin since corresponding IPluginCreator not found in Plugin Registry)
[05/19/2023-02:54:24] [E] Error[4]: [runtime.cpp::deserializeCudaEngine::66] Error Code 4: Internal Error (Engine deserialization failed.)
[05/19/2023-02:54:24] [E] Engine deserialization failed
[05/19/2023-02:54:24] [E] Got invalid engine!
[05/19/2023-02:54:24] [E] Inference set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine

I dont know where is the problem.

mchi · May 19, 2023, 3:18am

does your model need TRT plugin?
If it needs, you need to specify with “–plugins=$TRT_PLUG-IN_LIB” to load the plugin lib

linhbkpro2010 · May 19, 2023, 3:55am

Thanks a lot. What quick awesome response. You are right. Now I can see expected info.

johnminho · May 26, 2023, 6:08am

Is there any other tools better for visualization and analysis of profiling of each layer?

mchi · May 29, 2023, 4:37am

As the image in comment#6 above, why do you think the output is not friendly enough?

johnminho · May 30, 2023, 1:02pm

I have generated .engine model (by using tensorrt in python, I didn’t use trtexec). With trtexec and the aboce command I can get profile for each layer.

I found that Nvidia has tool Tensorrt engine explorer with more features. But this tool needs 3 json file: profile.jsob, graph.json and meta_profile.json. But now I already have .engine model, I can get the first 2 json files by using trtexec --exportProfile .... How I can get the third json file meta_profile.json?

@mchi I know that there are many ways to convert .onnx to .engine model.

mchi · May 30, 2023, 2:48pm

Hi @johnminho ,
You can refer to yolo_deepstream/Guidance_of_QAT_performance_optimization.md at main · NVIDIA-AI-IOT/yolo_deepstream · GitHub to use Tensorrt engine explorer with json file

johnminho · May 30, 2023, 3:07pm

Thanks. The reference that you sent only draws graph of engine model. I want to deeply analyze profiling of each layer (inference), Tensort engine Explorer can give me more insight.

I want to confirm that is there any way to create file meta_profile.json from generated .engine?

As my undetstanding, I need to use Tensor engine explorer code to generate engine model and 3 json files, so that I can use Tensor engine explorer tools to analyze model. Is it right?

johnminho · June 2, 2023, 4:45am

@mchi

I use this example from NVIDIA yolo_deepstream/deepstream_yolo at main · NVIDIA-AI-IOT/yolo_deepstream · GitHub to generate fp32 model. But when I export graph.json file, it only contains layer name. How to set ProfilingVerbosity::DETAILED in Deepstream to get full graph.json for later use in Tensorrt engine explorer? I search for in /opt/nvidia/deepstream/deepstream-6.2/sources/apps/sample_apps/deepstream-app/ but I didn’t find where I need to set. Sorry for unfamiliar with C++.
I also tried changing

    void setProfilingVerbosity(ProfilingVerbosity verbosity) noexcept
    {
        // mImpl->setProfilingVerbosity(verbosity);
        mImpl->setProfilingVerbosity(ProfilingVerbosity::DETAILED);

    }

But it is affected., what is wrong?
Is there any way to set it in deepstream_config.txt? Thanks.

haowang · June 5, 2023, 3:04am

Hi, @johnminho

One solution to do this is use trtexec to save Engine, export the layer info to file. And deepstream load engine directly.
sample command for trtexec:

$ /usr/src/tensorrt/bin/trtexec --onnx=yolov7.onnx --fp16 --int8 --verbose --saveEngine=yolov7_ptq.engine --workspace=1024000 --warmUp=500 --duration=10 --useCudaGraph --useSpinWait --noDataTransfers --exportLayerInfo=yolov7_ptq_layer.json --profilingVerbosity=detailed --exportProfile=yolov7_ptq_profile.json

ref: https://github.com/NVIDIA-AI-IOT/yolo_deepstream/blob/main/yolov7_qat/doc/Guidance_of_QAT_performance_optimization.md
and the visulization script is here:
https://github.com/NVIDIA-AI-IOT/yolo_deepstream/blob/main/yolov7_qat/scripts/draw-engine.py

johnminho · June 5, 2023, 3:13am

@haowang
Thanks. This is one solution.
I want to confirm that what API deepstream use to generate .onnx model? Does it use trtexec?
Could you please point me where is the source code to convert .onnx model to .engine in Deepstream? I means path to source code.

mchi · June 5, 2023, 3:17am

It’s the same as what trtexec does, you can check nvinfer source code in /opt/nvidia/deepstream/deepstream-6.2/sources/libs/nvdsinfer

johnminho · June 5, 2023, 3:21am

@mchi Thank you so much.

system · June 19, 2023, 3:21am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to deploy attention on deepstream6.0? DeepStream SDK	3	173	December 5, 2023
How to get file profile.metadata.json to use TRT engine explorer? TensorRT	2	1389	June 16, 2023
Load only .engine file DeepStream SDK	17	1230	June 24, 2022
Process Killed when Generating a TensorRT Engine for the ViT models DeepStream SDK tensorrt , jetson-inference , deepstream	11	259	October 31, 2024
Issue with Bounding Boxes and Object Detection in DeepStream Using YOLOv8 Model DeepStream SDK yolo , deepstream	9	59	April 18, 2025
Cannot load built engine resnet50_market1501_aicity156 DeepStream SDK nvbugs	53	1726	February 14, 2025
Engine file and calib.table not saved in DeepStream DeepStream SDK tensorrt	6	894	December 25, 2022
Getting an error while trying to use deepstream_launchpad.ipynb code DeepStream SDK deepstream	28	1335	January 10, 2024
Lack of FPS after successfully deploy TLT to Deepstream. DeepStream SDK	18	1005	April 27, 2020
Reshaping error when set batch-size greater than 1 in onnx modle DeepStream SDK	23	1248	February 10, 2023

Profile inference time of each layer for .engine model to know where is bottleneck in Deepstream?

Related topics