Profile inference time of each layer for .engine model to know where is bottleneck in Deepstream?

I want to know the inference time of each layer for .engine model in Deepstream to better understand where is bottleneck. Are there any tool to support that in Deepstream6.2? Thanks

1 Like

Hi @linhbkpro2010
since it’s .engine file, you should use nvinfer plugin which is based on TensorRT.

So, you can use TensorRT directly to profile it as command below:

$ /usr/src/tensorrt/bin/trtexec --loadEngine=swin_tiny_patch4_window7_224_bs8_best.engine --dumpProfile
…

I run command to check inference time of each layer, but I got error

&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine
[05/19/2023-02:35:26] [I] === Model Options ===
[05/19/2023-02:35:26] [I] Format: *
[05/19/2023-02:35:26] [I] Model: 
[05/19/2023-02:35:26] [I] Output:
[05/19/2023-02:35:26] [I] === Build Options ===
[05/19/2023-02:35:26] [I] Max batch: 1
[05/19/2023-02:35:26] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[05/19/2023-02:35:26] [I] minTiming: 1
[05/19/2023-02:35:26] [I] avgTiming: 8
[05/19/2023-02:35:26] [I] Precision: FP32
[05/19/2023-02:35:26] [I] LayerPrecisions: 
[05/19/2023-02:35:26] [I] Calibration: 
[05/19/2023-02:35:26] [I] Refit: Disabled
[05/19/2023-02:35:26] [I] Sparsity: Disabled
[05/19/2023-02:35:26] [I] Safe mode: Disabled
[05/19/2023-02:35:26] [I] DirectIO mode: Disabled
[05/19/2023-02:35:26] [I] Restricted mode: Disabled
[05/19/2023-02:35:26] [I] Build only: Disabled
[05/19/2023-02:35:26] [I] Save engine: 
[05/19/2023-02:35:26] [I] Load engine: model_b1_gpu0_fp32.engine
[05/19/2023-02:35:26] [I] Profiling verbosity: 0
[05/19/2023-02:35:26] [I] Tactic sources: Using default tactic sources
[05/19/2023-02:35:26] [I] timingCacheMode: local
[05/19/2023-02:35:26] [I] timingCacheFile: 
[05/19/2023-02:35:26] [I] Heuristic: Disabled
[05/19/2023-02:35:26] [I] Preview Features: Use default preview flags.
[05/19/2023-02:35:26] [I] Input(s)s format: fp32:CHW
[05/19/2023-02:35:26] [I] Output(s)s format: fp32:CHW
[05/19/2023-02:35:26] [I] Input build shapes: model
[05/19/2023-02:35:26] [I] Input calibration shapes: model
[05/19/2023-02:35:26] [I] === System Options ===
[05/19/2023-02:35:26] [I] Device: 0
[05/19/2023-02:35:26] [I] DLACore: 
[05/19/2023-02:35:26] [I] Plugins:
[05/19/2023-02:35:26] [I] === Inference Options ===
[05/19/2023-02:35:26] [I] Batch: 1
[05/19/2023-02:35:26] [I] Input inference shapes: model
[05/19/2023-02:35:26] [I] Iterations: 10
[05/19/2023-02:35:26] [I] Duration: 3s (+ 200ms warm up)
[05/19/2023-02:35:26] [I] Sleep time: 0ms
[05/19/2023-02:35:26] [I] Idle time: 0ms
[05/19/2023-02:35:26] [I] Streams: 1
[05/19/2023-02:35:26] [I] ExposeDMA: Disabled
[05/19/2023-02:35:26] [I] Data transfers: Enabled
[05/19/2023-02:35:26] [I] Spin-wait: Disabled
[05/19/2023-02:35:26] [I] Multithreading: Disabled
[05/19/2023-02:35:26] [I] CUDA Graph: Disabled
[05/19/2023-02:35:26] [I] Separate profiling: Disabled
[05/19/2023-02:35:26] [I] Time Deserialize: Disabled
[05/19/2023-02:35:26] [I] Time Refit: Disabled
[05/19/2023-02:35:26] [I] NVTX verbosity: 0
[05/19/2023-02:35:26] [I] Persistent Cache Ratio: 0
[05/19/2023-02:35:26] [I] Inputs:
[05/19/2023-02:35:26] [I] === Reporting Options ===
[05/19/2023-02:35:26] [I] Verbose: Disabled
[05/19/2023-02:35:26] [I] Averages: 10 inferences
[05/19/2023-02:35:26] [I] Percentiles: 90,95,99
[05/19/2023-02:35:26] [I] Dump refittable layers:Disabled
[05/19/2023-02:35:26] [I] Dump output: Disabled
[05/19/2023-02:35:26] [I] Profile: Disabled
[05/19/2023-02:35:26] [I] Export timing to JSON file: 
[05/19/2023-02:35:26] [I] Export output to JSON file: 
[05/19/2023-02:35:26] [I] Export profile to JSON file: 
[05/19/2023-02:35:26] [I] 
[05/19/2023-02:35:26] [I] === Device Information ===
[05/19/2023-02:35:26] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[05/19/2023-02:35:26] [I] Compute Capability: 7.5
[05/19/2023-02:35:26] [I] SMs: 68
[05/19/2023-02:35:26] [I] Compute Clock Rate: 1.545 GHz
[05/19/2023-02:35:26] [I] Device Global Memory: 11016 MiB
[05/19/2023-02:35:26] [I] Shared Memory per SM: 64 KiB
[05/19/2023-02:35:26] [I] Memory Bus Width: 352 bits (ECC disabled)
[05/19/2023-02:35:26] [I] Memory Clock Rate: 7 GHz
[05/19/2023-02:35:26] [I] 
[05/19/2023-02:35:26] [I] TensorRT version: 8.5.2
[05/19/2023-02:35:26] [I] Engine loaded in 0.211239 sec.
[05/19/2023-02:35:26] [I] [TRT] Loaded engine size: 175 MiB
[05/19/2023-02:35:27] [W] [TRT] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[05/19/2023-02:35:27] [E] Error[1]: [pluginV2Runner.cpp::load::300] Error Code 1: Serialization (Serialization assertion creator failed.Cannot deserialize plugin since corresponding IPluginCreator not found in Plugin Registry)
[05/19/2023-02:35:27] [E] Error[4]: [runtime.cpp::deserializeCudaEngine::66] Error Code 4: Internal Error (Engine deserialization failed.)
[05/19/2023-02:35:27] [E] Engine deserialization failed
[05/19/2023-02:35:27] [E] Got invalid engine!
[05/19/2023-02:35:27] [E] Inference set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine

My .engine model is converted as repo GitHub - marcoslucianops/DeepStream-Yolo: NVIDIA DeepStream SDK 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 implementation for YOLO models . I use this command to convert .pt model to .engine

deepstream-app -c deepstream_app_config_orig.txt

I’m unfamiliar with C++. Please give a an advice to profile inference of each layer. Thanks.

are you running the tetexec with the engine on the different type of GPU ?

Yes, may be I use different GPU when running

deepstream-app -c deepstream_app_config_orig.txt

and

/usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine

But I run 2 command again and set CUDA_VISIBLE_DEVICES=0 for both command, but I stil got error almost same as above

root@a8db07bc1951:/opt/nvidia/deepstream/deepstream-6.2/sources/DeepStream-Yolo# CUDA_VISIBLE_DEVICES=0 /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine
[05/19/2023-02:54:23] [I] === Model Options ===
[05/19/2023-02:54:23] [I] Format: *
[05/19/2023-02:54:23] [I] Model: 
[05/19/2023-02:54:23] [I] Output:
[05/19/2023-02:54:23] [I] === Build Options ===
[05/19/2023-02:54:23] [I] Max batch: 1
[05/19/2023-02:54:23] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[05/19/2023-02:54:23] [I] minTiming: 1
[05/19/2023-02:54:23] [I] avgTiming: 8
[05/19/2023-02:54:23] [I] Precision: FP32
[05/19/2023-02:54:23] [I] LayerPrecisions: 
[05/19/2023-02:54:23] [I] Calibration: 
[05/19/2023-02:54:23] [I] Refit: Disabled
[05/19/2023-02:54:23] [I] Sparsity: Disabled
[05/19/2023-02:54:23] [I] Safe mode: Disabled
[05/19/2023-02:54:23] [I] DirectIO mode: Disabled
[05/19/2023-02:54:23] [I] Restricted mode: Disabled
[05/19/2023-02:54:23] [I] Build only: Disabled
[05/19/2023-02:54:23] [I] Save engine: 
[05/19/2023-02:54:23] [I] Load engine: model_b1_gpu0_fp32.engine
[05/19/2023-02:54:23] [I] Profiling verbosity: 0
[05/19/2023-02:54:23] [I] Tactic sources: Using default tactic sources
[05/19/2023-02:54:23] [I] timingCacheMode: local
[05/19/2023-02:54:23] [I] timingCacheFile: 
[05/19/2023-02:54:23] [I] Heuristic: Disabled
[05/19/2023-02:54:23] [I] Preview Features: Use default preview flags.
[05/19/2023-02:54:23] [I] Input(s)s format: fp32:CHW
[05/19/2023-02:54:23] [I] Output(s)s format: fp32:CHW
[05/19/2023-02:54:23] [I] Input build shapes: model
[05/19/2023-02:54:23] [I] Input calibration shapes: model
[05/19/2023-02:54:23] [I] === System Options ===
[05/19/2023-02:54:23] [I] Device: 0
[05/19/2023-02:54:23] [I] DLACore: 
[05/19/2023-02:54:23] [I] Plugins:
[05/19/2023-02:54:23] [I] === Inference Options ===
[05/19/2023-02:54:23] [I] Batch: 1
[05/19/2023-02:54:23] [I] Input inference shapes: model
[05/19/2023-02:54:23] [I] Iterations: 10
[05/19/2023-02:54:23] [I] Duration: 3s (+ 200ms warm up)
[05/19/2023-02:54:23] [I] Sleep time: 0ms
[05/19/2023-02:54:23] [I] Idle time: 0ms
[05/19/2023-02:54:23] [I] Streams: 1
[05/19/2023-02:54:23] [I] ExposeDMA: Disabled
[05/19/2023-02:54:23] [I] Data transfers: Enabled
[05/19/2023-02:54:23] [I] Spin-wait: Disabled
[05/19/2023-02:54:23] [I] Multithreading: Disabled
[05/19/2023-02:54:23] [I] CUDA Graph: Disabled
[05/19/2023-02:54:23] [I] Separate profiling: Disabled
[05/19/2023-02:54:23] [I] Time Deserialize: Disabled
[05/19/2023-02:54:23] [I] Time Refit: Disabled
[05/19/2023-02:54:23] [I] NVTX verbosity: 0
[05/19/2023-02:54:23] [I] Persistent Cache Ratio: 0
[05/19/2023-02:54:23] [I] Inputs:
[05/19/2023-02:54:23] [I] === Reporting Options ===
[05/19/2023-02:54:23] [I] Verbose: Disabled
[05/19/2023-02:54:23] [I] Averages: 10 inferences
[05/19/2023-02:54:23] [I] Percentiles: 90,95,99
[05/19/2023-02:54:23] [I] Dump refittable layers:Disabled
[05/19/2023-02:54:23] [I] Dump output: Disabled
[05/19/2023-02:54:23] [I] Profile: Disabled
[05/19/2023-02:54:23] [I] Export timing to JSON file: 
[05/19/2023-02:54:23] [I] Export output to JSON file: 
[05/19/2023-02:54:23] [I] Export profile to JSON file: 
[05/19/2023-02:54:23] [I] 
[05/19/2023-02:54:23] [I] === Device Information ===
[05/19/2023-02:54:23] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[05/19/2023-02:54:23] [I] Compute Capability: 7.5
[05/19/2023-02:54:23] [I] SMs: 68
[05/19/2023-02:54:23] [I] Compute Clock Rate: 1.545 GHz
[05/19/2023-02:54:23] [I] Device Global Memory: 11019 MiB
[05/19/2023-02:54:23] [I] Shared Memory per SM: 64 KiB
[05/19/2023-02:54:23] [I] Memory Bus Width: 352 bits (ECC disabled)
[05/19/2023-02:54:23] [I] Memory Clock Rate: 7 GHz
[05/19/2023-02:54:23] [I] 
[05/19/2023-02:54:23] [I] TensorRT version: 8.5.2
[05/19/2023-02:54:23] [I] Engine loaded in 0.208848 sec.
[05/19/2023-02:54:24] [I] [TRT] Loaded engine size: 175 MiB
[05/19/2023-02:54:24] [E] Error[1]: [pluginV2Runner.cpp::load::300] Error Code 1: Serialization (Serialization assertion creator failed.Cannot deserialize plugin since corresponding IPluginCreator not found in Plugin Registry)
[05/19/2023-02:54:24] [E] Error[4]: [runtime.cpp::deserializeCudaEngine::66] Error Code 4: Internal Error (Engine deserialization failed.)
[05/19/2023-02:54:24] [E] Engine deserialization failed
[05/19/2023-02:54:24] [E] Got invalid engine!
[05/19/2023-02:54:24] [E] Inference set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine

I dont know where is the problem.

does your model need TRT plugin?
If it needs, you need to specify with “–plugins=$TRT_PLUG-IN_LIB” to load the plugin lib

Thanks a lot. What quick awesome response. You are right. Now I can see expected info.

1 Like

Is there any other tools better for visualization and analysis of profiling of each layer?

As the image in comment#6 above, why do you think the output is not friendly enough?

1 Like

I have generated .engine model (by using tensorrt in python, I didn’t use trtexec). With trtexec and the aboce command I can get profile for each layer.

I found that Nvidia has tool Tensorrt engine explorer with more features. But this tool needs 3 json file: profile.jsob, graph.json and meta_profile.json. But now I already have .engine model, I can get the first 2 json files by using trtexec --exportProfile .... How I can get the third json file meta_profile.json?

@mchi I know that there are many ways to convert .onnx to .engine model.

Hi @johnminho ,
You can refer to yolo_deepstream/Guidance_of_QAT_performance_optimization.md at main · NVIDIA-AI-IOT/yolo_deepstream · GitHub to use Tensorrt engine explorer with json file

1 Like

Thanks. The reference that you sent only draws graph of engine model. I want to deeply analyze profiling of each layer (inference), Tensort engine Explorer can give me more insight.

I want to confirm that is there any way to create file meta_profile.json from generated .engine?

As my undetstanding, I need to use Tensor engine explorer code to generate engine model and 3 json files, so that I can use Tensor engine explorer tools to analyze model. Is it right?

@mchi

I use this example from NVIDIA yolo_deepstream/deepstream_yolo at main · NVIDIA-AI-IOT/yolo_deepstream · GitHub to generate fp32 model. But when I export graph.json file, it only contains layer name. How to set ProfilingVerbosity::DETAILED in Deepstream to get full graph.json for later use in Tensorrt engine explorer? I search for in /opt/nvidia/deepstream/deepstream-6.2/sources/apps/sample_apps/deepstream-app/ but I didn’t find where I need to set. Sorry for unfamiliar with C++.
I also tried changing

    void setProfilingVerbosity(ProfilingVerbosity verbosity) noexcept
    {
        // mImpl->setProfilingVerbosity(verbosity);
        mImpl->setProfilingVerbosity(ProfilingVerbosity::DETAILED);

    }

But it is affected., what is wrong?
Is there any way to set it in deepstream_config.txt? Thanks.