Performance regression when using CUDA Graph with MPS enabled

zhi_xz · October 16, 2023, 10:03am

Hi,

To optimize GPU utilization, I employed the CUDA graph technology which can launch the whole AI model’s kernels at once so that it can reduce the effect of pthread_mutex_lock (discussed in Model inference on multiple cuda streams with tensorrt api - Jetson & Embedded Systems / Jetson AGX Orin - NVIDIA Developer Forums). Then I found a peculiar phenomenon. Typically using CUDA graph technology could indeed reduce inference latency but if I enabled MPS then using CUDA graph technology would increase inference latency which is illogical.

Allow me to provide you with the steps to replicate the observed phenomenon. You need an x86 machine that supports NVIDIA MPS and with TensorRT installed.

Pre-compile the TensorRT engine file with the following command.
/usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx --saveEngine=resnet.engine --buildOnly
Run the following script to get the latency results of four cases. Each case runs three processes concurrently.

#!/bin/bash 
echo "Run trtexec with MPS but without CudaGraph" 
export CUDA_VISIBLE_DEVICES=0 
sudo nvidia-smi -i 0 -c EXCLUSIVE_PROCESS 
sudo nvidia-cuda-mps-control -d 
/usr/src/tensorrt/bin/trtexec --loadEngine=./resnet.engine | tee log_mps_1.txt & 
/usr/src/tensorrt/bin/trtexec --loadEngine=./resnet.engine | tee log_mps_2.txt & 
/usr/src/tensorrt/bin/trtexec --loadEngine=./resnet.engine | tee log_mps_3.txt
echo "Run trtexec with MPS and CudaGraph" 
/usr/src/tensorrt/bin/trtexec --loadEngine=./resnet.engine --useCudaGraph | tee log_mps_cuda_graph_1.txt &
/usr/src/tensorrt/bin/trtexec --loadEngine=./resnet.engine --useCudaGraph | tee log_mps_cuda_graph_2.txt &
/usr/src/tensorrt/bin/trtexec --loadEngine=./resnet.engine --useCudaGraph | tee log_mps_cuda_graph_3.txt 
echo quit | sudo nvidia-cuda-mps-control 
sudo nvidia-smi -i 0 -c DEFAULT 
echo "Run trtexec without MPS and CudaGraph" 
/usr/src/tensorrt/bin/trtexec --loadEngine=resnet.engine | tee log_no_mps_1.txt &
/usr/src/tensorrt/bin/trtexec --loadEngine=resnet.engine | tee log_no_mps_2.txt &
/usr/src/tensorrt/bin/trtexec --loadEngine=resnet.engine | tee log_no_mps_3.txt
echo "Run trtexec without MPS but with CudaGraph" 
/usr/src/tensorrt/bin/trtexec --loadEngine=resnet.engine --useCudaGraph | tee log_no_mps_cudagraph_1.txt &
/usr/src/tensorrt/bin/trtexec --loadEngine=resnet.engine --useCudaGraph | tee log_no_mps_cudagraph_2.txt &
/usr/src/tensorrt/bin/trtexec --loadEngine=resnet.engine --useCudaGraph | tee log_no_mps_cudagraph_3.txt

Compare the latency before and after using the CUDA graph with and without MPS, respectively.

I have compiled the data collected during my test in the table provided below.

	MPS ✅ & CUDAgraph ✅	MPS ✅ & CUDAgraph ❌	MPS ❌ & CUDAgraph ✅	MPS ❌ & CUDAgraph ❌
Mean latency of process 1	3.23736 ms	3.16783 ms	4.41656 ms	4.63009 ms
Mean latency of process 2	3.23528 ms	3.16744 ms	4.41148 ms	4.63431 ms
Mean latency of process 3	3.21404 ms	3.07404 ms	4.40596 ms	4.6277 ms

From the table, it is evident that the utilization of CUDA graph with MPS disabled led to a positive return. However, when MPS was enabled using CUDA graph, it led to a slight increase in latency.

AakankshaS · October 17, 2023, 8:39am

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

github.com

onnx/onnx-tensorrt/blob/main/docs/operators.md

<!--- SPDX-License-Identifier: Apache-2.0 -->

# Supported ONNX Operators

TensorRT 8.6 supports operators up to Opset 17. Latest information of ONNX operators can be found [here](https://github.com/onnx/onnx/blob/master/docs/Operators.md)

TensorRT supports the following ONNX data types: DOUBLE, FLOAT32, FLOAT16, INT8, and BOOL

> Note: There is limited support for INT32, INT64, and DOUBLE types. TensorRT will attempt to cast down INT64 to INT32 and DOUBLE down to FLOAT, clamping values to `+-INT_MAX` or `+-FLT_MAX` if necessary.

See below for the support matrix of ONNX operators in ONNX-TensorRT.

## Operator Support Matrix

| Operator                  | Supported  | Supported Types | Restrictions                                                                                                           |
|---------------------------|------------|-----------------|------------------------------------------------------------------------------------------------------------------------|
| Abs                       | Y          | FP32, FP16, INT32 |
| Acos                      | Y          | FP32, FP16 |
| Acosh                     | Y          | FP32, FP16 |
| Add                       | Y          | FP32, FP16, INT32 |

This file has been truncated. show original

Also, request you to share your model and script if not shared already so that we can help you better.

Meanwhile, for some common errors and queries please refer to below link:

Thanks!

zhi_xz · October 18, 2023, 2:45am

Hi,

The issue persists, below is the log using --cudaGraph and with MPS enabled. It seems there is no unsupported operator in the test model.

&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # /usr/src/tensorrt/bin/trtexec --loadEngine=./resnet.engine --useCudaGraph --verbose
[10/18/2023-10:37:41] [I] === Model Options ===
[10/18/2023-10:37:41] [I] Format: *
[10/18/2023-10:37:41] [I] Model: 
[10/18/2023-10:37:41] [I] Output:
[10/18/2023-10:37:41] [I] === Build Options ===
[10/18/2023-10:37:41] [I] Max batch: 1
[10/18/2023-10:37:41] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/18/2023-10:37:41] [I] minTiming: 1
[10/18/2023-10:37:41] [I] avgTiming: 8
[10/18/2023-10:37:41] [I] Precision: FP32
[10/18/2023-10:37:41] [I] LayerPrecisions: 
[10/18/2023-10:37:41] [I] Layer Device Types: 
[10/18/2023-10:37:41] [I] Calibration: 
[10/18/2023-10:37:41] [I] Refit: Disabled
[10/18/2023-10:37:41] [I] Version Compatible: Disabled
[10/18/2023-10:37:41] [I] TensorRT runtime: full
[10/18/2023-10:37:41] [I] Lean DLL Path: 
[10/18/2023-10:37:41] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[10/18/2023-10:37:41] [I] Exclude Lean Runtime: Disabled
[10/18/2023-10:37:41] [I] Sparsity: Disabled
[10/18/2023-10:37:41] [I] Safe mode: Disabled
[10/18/2023-10:37:41] [I] Build DLA standalone loadable: Disabled
[10/18/2023-10:37:41] [I] Allow GPU fallback for DLA: Disabled
[10/18/2023-10:37:41] [I] DirectIO mode: Disabled
[10/18/2023-10:37:41] [I] Restricted mode: Disabled
[10/18/2023-10:37:41] [I] Skip inference: Disabled
[10/18/2023-10:37:41] [I] Save engine: 
[10/18/2023-10:37:41] [I] Load engine: ./resnet.engine
[10/18/2023-10:37:41] [I] Profiling verbosity: 0
[10/18/2023-10:37:41] [I] Tactic sources: Using default tactic sources
[10/18/2023-10:37:41] [I] timingCacheMode: local
[10/18/2023-10:37:41] [I] timingCacheFile: 
[10/18/2023-10:37:41] [I] Heuristic: Disabled
[10/18/2023-10:37:41] [I] Preview Features: Use default preview flags.
[10/18/2023-10:37:41] [I] MaxAuxStreams: -1
[10/18/2023-10:37:41] [I] BuilderOptimizationLevel: -1
[10/18/2023-10:37:41] [I] Input(s)s format: fp32:CHW
[10/18/2023-10:37:41] [I] Output(s)s format: fp32:CHW
[10/18/2023-10:37:41] [I] Input build shapes: model
[10/18/2023-10:37:41] [I] Input calibration shapes: model
[10/18/2023-10:37:41] [I] === System Options ===
[10/18/2023-10:37:41] [I] Device: 0
[10/18/2023-10:37:41] [I] DLACore: 
[10/18/2023-10:37:41] [I] Plugins:
[10/18/2023-10:37:41] [I] setPluginsToSerialize:
[10/18/2023-10:37:41] [I] dynamicPlugins:
[10/18/2023-10:37:41] [I] ignoreParsedPluginLibs: 0
[10/18/2023-10:37:41] [I] 
[10/18/2023-10:37:41] [I] === Inference Options ===
[10/18/2023-10:37:41] [I] Batch: 1
[10/18/2023-10:37:41] [I] Input inference shapes: model
[10/18/2023-10:37:41] [I] Iterations: 10
[10/18/2023-10:37:41] [I] Duration: 3s (+ 200ms warm up)
[10/18/2023-10:37:41] [I] Sleep time: 0ms
[10/18/2023-10:37:41] [I] Idle time: 0ms
[10/18/2023-10:37:41] [I] Inference Streams: 1
[10/18/2023-10:37:41] [I] ExposeDMA: Disabled
[10/18/2023-10:37:41] [I] Data transfers: Enabled
[10/18/2023-10:37:41] [I] Spin-wait: Disabled
[10/18/2023-10:37:41] [I] Multithreading: Disabled
[10/18/2023-10:37:41] [I] CUDA Graph: Enabled
[10/18/2023-10:37:41] [I] Separate profiling: Disabled
[10/18/2023-10:37:41] [I] Time Deserialize: Disabled
[10/18/2023-10:37:41] [I] Time Refit: Disabled
[10/18/2023-10:37:41] [I] NVTX verbosity: 0
[10/18/2023-10:37:41] [I] Persistent Cache Ratio: 0
[10/18/2023-10:37:41] [I] Inputs:
[10/18/2023-10:37:41] [I] === Reporting Options ===
[10/18/2023-10:37:41] [I] Verbose: Enabled
[10/18/2023-10:37:41] [I] Averages: 10 inferences
[10/18/2023-10:37:41] [I] Percentiles: 90,95,99
[10/18/2023-10:37:41] [I] Dump refittable layers:Disabled
[10/18/2023-10:37:41] [I] Dump output: Disabled
[10/18/2023-10:37:41] [I] Profile: Disabled
[10/18/2023-10:37:41] [I] Export timing to JSON file: 
[10/18/2023-10:37:41] [I] Export output to JSON file: 
[10/18/2023-10:37:41] [I] Export profile to JSON file: 
[10/18/2023-10:37:41] [I] 
[10/18/2023-10:37:41] [I] === Device Information ===
[10/18/2023-10:37:41] [I] Selected Device: NVIDIA GeForce RTX 3070
[10/18/2023-10:37:41] [I] Compute Capability: 8.6
[10/18/2023-10:37:41] [I] SMs: 46
[10/18/2023-10:37:41] [I] Device Global Memory: 7970 MiB
[10/18/2023-10:37:41] [I] Shared Memory per SM: 100 KiB
[10/18/2023-10:37:41] [I] Memory Bus Width: 256 bits (ECC disabled)
[10/18/2023-10:37:41] [I] Application Compute Clock Rate: 1.815 GHz
[10/18/2023-10:37:41] [I] Application Memory Clock Rate: 7.001 GHz
[10/18/2023-10:37:41] [I] 
[10/18/2023-10:37:41] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[10/18/2023-10:37:41] [I] 
[10/18/2023-10:37:41] [I] TensorRT version: 8.6.1
[10/18/2023-10:37:41] [I] Loading standard plugins
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::CropAndResizeDynamic version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::DecodeBbox3DPlugin version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::EfficientNMS_Explicit_TF_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::EfficientNMS_Implicit_TF_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 2
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::ModulatedDeformConv2d version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::MultiscaleDeformableAttnPlugin_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::NMSDynamic_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::PillarScatterPlugin version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::ProposalDynamic version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::Proposal version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::ROIAlign_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::Split version 1
[10/18/2023-10:37:41] [V] [TRT] Registered plugin creator - ::VoxelGeneratorPlugin version 1
[10/18/2023-10:37:41] [I] Engine loaded in 0.0975095 sec.
[10/18/2023-10:37:41] [I] [TRT] Loaded engine size: 100 MiB
[10/18/2023-10:37:41] [V] [TRT] Deserialization required 30352 microseconds.
[10/18/2023-10:37:41] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +98, now: CPU 0, GPU 98 (MiB)
[10/18/2023-10:37:41] [I] Engine deserialized in 0.117426 sec.
[10/18/2023-10:37:41] [V] [TRT] Total per-runner device persistent memory is 6656
[10/18/2023-10:37:41] [V] [TRT] Total per-runner host persistent memory is 341296
[10/18/2023-10:37:41] [V] [TRT] Allocated activation device memory of size 7225344
[10/18/2023-10:37:41] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +7, now: CPU 0, GPU 105 (MiB)
[10/18/2023-10:37:41] [V] [TRT] CUDA lazy loading is enabled.
[10/18/2023-10:37:41] [I] Setting persistentCacheLimit to 0 bytes.
[10/18/2023-10:37:41] [V] Using enqueueV3.
[10/18/2023-10:37:41] [I] Using random values for input gpu_0/data_0
[10/18/2023-10:37:41] [I] Input binding for gpu_0/data_0 with dimensions 1x3x224x224 is created.
[10/18/2023-10:37:41] [I] Output binding for gpu_0/softmax_1 with dimensions 1x1000 is created.
[10/18/2023-10:37:41] [I] Starting inference
[10/18/2023-10:37:44] [I] Warmup completed 45 queries over 200 ms
[10/18/2023-10:37:44] [I] Timing trace has 949 queries over 3.01127 s
[10/18/2023-10:37:44] [I] 
[10/18/2023-10:37:44] [I] === Trace details ===
[10/18/2023-10:37:44] [I] Trace averages of 10 runs:
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.38534 ms - Host latency: 3.42051 ms (enqueue 0.0159241 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.31755 ms - Host latency: 3.35573 ms (enqueue 0.019455 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.32769 ms - Host latency: 3.37976 ms (enqueue 0.0345459 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.27536 ms - Host latency: 3.32983 ms (enqueue 0.0387238 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.14922 ms - Host latency: 3.20332 ms (enqueue 0.0343262 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.30936 ms - Host latency: 3.36407 ms (enqueue 0.0348511 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.08388 ms - Host latency: 3.13969 ms (enqueue 0.0437256 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.27116 ms - Host latency: 3.32559 ms (enqueue 0.0395142 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.26963 ms - Host latency: 3.32234 ms (enqueue 0.0344849 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.24925 ms - Host latency: 3.30248 ms (enqueue 0.0341461 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.30568 ms - Host latency: 3.34762 ms (enqueue 0.0234436 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.13865 ms - Host latency: 3.17893 ms (enqueue 0.0260559 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.18228 ms - Host latency: 3.21661 ms (enqueue 0.00500488 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.11603 ms - Host latency: 3.16116 ms (enqueue 0.0237976 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.29492 ms - Host latency: 3.35035 ms (enqueue 0.0371338 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.15721 ms - Host latency: 3.21032 ms (enqueue 0.0352173 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.10867 ms - Host latency: 3.15661 ms (enqueue 0.0263855 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.22579 ms - Host latency: 3.27024 ms (enqueue 0.0250488 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.25582 ms - Host latency: 3.29649 ms (enqueue 0.0195312 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.00176 ms - Host latency: 3.05145 ms (enqueue 0.0329224 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.19447 ms - Host latency: 3.24615 ms (enqueue 0.0350159 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.14891 ms - Host latency: 3.20087 ms (enqueue 0.0347656 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.18823 ms - Host latency: 3.23987 ms (enqueue 0.03479 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.14594 ms - Host latency: 3.19772 ms (enqueue 0.0355652 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.11348 ms - Host latency: 3.16763 ms (enqueue 0.0348267 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.10956 ms - Host latency: 3.16118 ms (enqueue 0.0346863 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.25875 ms - Host latency: 3.30996 ms (enqueue 0.0347168 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.15363 ms - Host latency: 3.20421 ms (enqueue 0.036084 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.11829 ms - Host latency: 3.16501 ms (enqueue 0.0284546 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.10044 ms - Host latency: 3.13716 ms (enqueue 0.0109741 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.11215 ms - Host latency: 3.16125 ms (enqueue 0.0297607 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.26449 ms - Host latency: 3.31696 ms (enqueue 0.0349365 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.2004 ms - Host latency: 3.2532 ms (enqueue 0.0359131 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.05183 ms - Host latency: 3.10415 ms (enqueue 0.035498 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.14993 ms - Host latency: 3.20262 ms (enqueue 0.0348389 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.19109 ms - Host latency: 3.24485 ms (enqueue 0.0347046 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.20216 ms - Host latency: 3.2545 ms (enqueue 0.0345093 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.07948 ms - Host latency: 3.13358 ms (enqueue 0.0352173 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.04823 ms - Host latency: 3.09904 ms (enqueue 0.0346924 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.26766 ms - Host latency: 3.31926 ms (enqueue 0.0346069 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.17325 ms - Host latency: 3.22472 ms (enqueue 0.0343384 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.30436 ms - Host latency: 3.35972 ms (enqueue 0.0348999 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.19774 ms - Host latency: 3.2494 ms (enqueue 0.0352661 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.12941 ms - Host latency: 3.18009 ms (enqueue 0.0343994 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.11346 ms - Host latency: 3.16705 ms (enqueue 0.0355957 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.14379 ms - Host latency: 3.19878 ms (enqueue 0.035498 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.04998 ms - Host latency: 3.08921 ms (enqueue 0.0143799 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.12054 ms - Host latency: 3.15658 ms (enqueue 0.0101929 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.07766 ms - Host latency: 3.12703 ms (enqueue 0.0310791 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.10438 ms - Host latency: 3.15458 ms (enqueue 0.0347412 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.21853 ms - Host latency: 3.27021 ms (enqueue 0.0355957 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.0587 ms - Host latency: 3.11307 ms (enqueue 0.0297241 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.21034 ms - Host latency: 3.25096 ms (enqueue 0.0249634 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.13027 ms - Host latency: 3.17253 ms (enqueue 0.024353 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.16071 ms - Host latency: 3.21141 ms (enqueue 0.0312378 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.21045 ms - Host latency: 3.26603 ms (enqueue 0.0342773 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 2.98785 ms - Host latency: 3.03947 ms (enqueue 0.0351685 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.14626 ms - Host latency: 3.19873 ms (enqueue 0.0343262 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.06135 ms - Host latency: 3.1147 ms (enqueue 0.0342407 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.18064 ms - Host latency: 3.2334 ms (enqueue 0.0344727 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.1447 ms - Host latency: 3.19924 ms (enqueue 0.0344238 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.04973 ms - Host latency: 3.10295 ms (enqueue 0.0340332 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.21558 ms - Host latency: 3.26929 ms (enqueue 0.0345947 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.11433 ms - Host latency: 3.1688 ms (enqueue 0.0327148 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.06506 ms - Host latency: 3.10103 ms (enqueue 0.00969238 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.12603 ms - Host latency: 3.17283 ms (enqueue 0.0217285 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.18435 ms - Host latency: 3.23345 ms (enqueue 0.0345947 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.09822 ms - Host latency: 3.13298 ms (enqueue 0.0163818 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.1532 ms - Host latency: 3.20513 ms (enqueue 0.0351807 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.2179 ms - Host latency: 3.27122 ms (enqueue 0.0353516 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.21426 ms - Host latency: 3.26541 ms (enqueue 0.0347412 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.17522 ms - Host latency: 3.22646 ms (enqueue 0.0347656 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.29526 ms - Host latency: 3.34583 ms (enqueue 0.0348633 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.1106 ms - Host latency: 3.16504 ms (enqueue 0.0357178 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.1446 ms - Host latency: 3.19636 ms (enqueue 0.0362061 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.32976 ms - Host latency: 3.38081 ms (enqueue 0.0346191 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.28567 ms - Host latency: 3.33757 ms (enqueue 0.0347412 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.11675 ms - Host latency: 3.16899 ms (enqueue 0.0341064 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.13113 ms - Host latency: 3.18379 ms (enqueue 0.0345215 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.15481 ms - Host latency: 3.20852 ms (enqueue 0.0340332 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.05974 ms - Host latency: 3.11086 ms (enqueue 0.0325684 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.10681 ms - Host latency: 3.1408 ms (enqueue 0.0095459 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.23757 ms - Host latency: 3.27566 ms (enqueue 0.0116455 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.13401 ms - Host latency: 3.17573 ms (enqueue 0.0232422 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.28965 ms - Host latency: 3.34226 ms (enqueue 0.034375 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.17151 ms - Host latency: 3.22546 ms (enqueue 0.0343506 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.09888 ms - Host latency: 3.14211 ms (enqueue 0.0217041 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.00835 ms - Host latency: 3.04468 ms (enqueue 0.0174316 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.12939 ms - Host latency: 3.17878 ms (enqueue 0.0333984 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.04224 ms - Host latency: 3.09573 ms (enqueue 0.0349121 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.33564 ms - Host latency: 3.38826 ms (enqueue 0.0339355 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.0915 ms - Host latency: 3.14399 ms (enqueue 0.0343994 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.19629 ms - Host latency: 3.24934 ms (enqueue 0.0343506 ms)
[10/18/2023-10:37:44] [I] Average on 10 runs - GPU latency: 3.22124 ms - Host latency: 3.27395 ms (enqueue 0.0345947 ms)
[10/18/2023-10:37:44] [I] 
[10/18/2023-10:37:44] [I] === Performance summary ===
[10/18/2023-10:37:44] [I] Throughput: 315.15 qps
[10/18/2023-10:37:44] [I] Latency: min = 2.52075 ms, max = 4.0654 ms, mean = 3.2168 ms, median = 3.2207 ms, percentile(90%) = 3.5321 ms, percentile(95%) = 3.63696 ms, percentile(99%) = 3.81171 ms
[10/18/2023-10:37:44] [I] Enqueue Time: min = 0.003479 ms, max = 0.0953979 ms, mean = 0.0308373 ms, median = 0.0344238 ms, percentile(90%) = 0.0356445 ms, percentile(95%) = 0.0369263 ms, percentile(99%) = 0.0440979 ms
[10/18/2023-10:37:44] [I] H2D Latency: min = 0.0285645 ms, max = 0.129883 ms, mean = 0.0450907 ms, median = 0.0473633 ms, percentile(90%) = 0.0495605 ms, percentile(95%) = 0.0517273 ms, percentile(99%) = 0.0692139 ms
[10/18/2023-10:37:44] [I] GPU Compute Time: min = 2.48022 ms, max = 4.02945 ms, mean = 3.16732 ms, median = 3.16931 ms, percentile(90%) = 3.48267 ms, percentile(95%) = 3.58502 ms, percentile(99%) = 3.7724 ms
[10/18/2023-10:37:44] [I] D2H Latency: min = 0.00268555 ms, max = 0.0164795 ms, mean = 0.00439466 ms, median = 0.00415039 ms, percentile(90%) = 0.00518799 ms, percentile(95%) = 0.00567627 ms, percentile(99%) = 0.0107422 ms
[10/18/2023-10:37:44] [I] Total Host Walltime: 3.01127 s
[10/18/2023-10:37:44] [I] Total GPU Compute Time: 3.00578 s
[10/18/2023-10:37:44] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/18/2023-10:37:44] [V] 
[10/18/2023-10:37:44] [V] === Explanations of the performance metrics ===
[10/18/2023-10:37:44] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[10/18/2023-10:37:44] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[10/18/2023-10:37:44] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[10/18/2023-10:37:44] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[10/18/2023-10:37:44] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[10/18/2023-10:37:44] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[10/18/2023-10:37:44] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[10/18/2023-10:37:44] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[10/18/2023-10:37:44] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # /usr/src/tensorrt/bin/trtexec --loadEngine=./resnet.engine --useCudaGraph --verbose

My test model is provided in the installation directory of TensorRT /TensorRT-Installation-Directory/data/resnet50/ResNet50.onnx

Thanks.

AakankshaS · December 31, 2023, 10:01am

HI @zhi_xz ,
Apologies for delaye, Can you please share your onnx model and repro script.
Thanks

Topic		Replies	Views
TensorRT inference process TensorRT	4	642	May 17, 2021
TensorRT Error: `[genericReformat.cuh::copyVectorizedRunKernel::1579] Error Code 1: Cuda Runtime (invalid resource handle)` in ROS Callback Function TensorRT tensorrt , ros , cuda , cudnn	0	52	December 10, 2024
ConvTranspose + Add Slow TensorRT tensorrt	4	658	July 25, 2023
Trt with batch TensorRT	4	630	July 27, 2022
Tensor RT optimization causes performance downgrade compared to onnx model TensorRT	4	891	January 26, 2022
Cuda OutOfMemory when creating tensor with 2^29 (~0.5 G) elements TensorRT tensorrt , cuda , onnx	6	1756	March 9, 2022
DW_DNN_INVALID_MODEL error for trt model (isPointPillarNet \| NVIDIA NGC) TAO Toolkit tensorrt , driveworks , onnx	6	44	February 12, 2025
[executionContext.cpp::executeInternal::652] Error Code 1: Cuda Runtime (an illegal memory access was encountered) \| Cuda failure: 700 TensorRT tensorrt	5	2979	April 11, 2022
deserializeCudaEngine failed. Serialization assertion magicTagRead == kMAGIC_TAG failed.Magic tag does not match TensorRT	4	2934	April 22, 2024
tensorRT inference unstable compared onnxruntime TensorRT	4	1323	May 4, 2021

Performance regression when using CUDA Graph with MPS enabled

Related topics