Model onnx trt engine generation process report different results compared between PC and jetson XAVIER NX

Description

Same model, but diffeent platforms, SW SDKs (NVIDIA libraries and Python packages (Based on my understanding they are don’t care differences)).
On the PC I successfully generate the TRT engine and on the jetson NX I cannot.

Environment

PC:
TensorRT Version: 8.4.1.5
GPU Type: Quadro T2000
Nvidia Driver Version: R471.68 (r471_59-5) / 30.0.14.7168 (8-5-2021)
CUDA Version: 11.4
CUDNN Version: 8.1.1
Operating System + Version: Windows 10
Python Version (if applicable): 3.6.8
TensorFlow Version (if applicable): NA
PyTorch Version (if applicable): NA
Baremetal or Container (if container which image + tag): Baremetal

Jetson XAVIER NX:
TensorRT Version: 8.4.0.11
GPU Type: Volta
Jetpack: 5.0.1

Relevant Files

model2_folded.onnx (14.4 KB)
trtReportModel2.txt (238.0 KB)

Steps To Reproduce

Try to build the engine for the onnx that is attached here.

The error is-
10: [optimizer.cpp::computeCosts::3826] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[55…Div_261]}.)

When I build an engine for the onnx on my PC, the build step is going well. But when I try to build the engine for the same onnx on the Jetson NX, I get that error.

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.

In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

Hi,

Yes, we could build TRT engine successfully on PC.
We are moving this post to Jetson NX forum to get better help.

Thank you.

The onnx model and the verbose report are attached.
I also checked the model as you wrote

I tried to build the engine using trtexec, but the engine i got doesn’t work as a dynamic model.
For example, in my verbose when i build the engine(before the error), so i get the following report-

*************** Autotuning format combination: Float(E0,2,1), Int32(1), Float(2,1), Int32(1), Int32(1), Int32(1), Float(128,1), Float(128,1), Float(128,1), Float(128,1), Int32(), Int32(), Int32() → Float(E0,2,1), Float((* 128 (# 1 (SHAPE keypoints))),128,1) where E0=(* 2 (# 1 (SHAPE keypoints))) ***************

But when i try the trtexec, so i see the following report-

*************** Autotuning format combination: Float(1,1,1), Float(2,2,2,2,1), Float(1,1,1,1) → Float(2,2,1), Float(1,1,1) ***************

So it doesn’t help me to solve the problem

Hi, the issue hasn’t been resolved yet

JetPack 4.6.1 docker can build your onnx model correctly.

on Jetson terminal

git clone https://github.com/naisy/docker
cd docker
sudo su
. /run-jetson-jp461-base.sh

web browser

http://your_jetson_ip:8888
password: jupyter
  • upload your onnx model to jupyterlab
  • launch jupyterlab terminal

on jupyterlab terminal

trtexec --onnx=model2_folded.onnx --saveEngine=model2_folded.engine

I am not using jetpack 5.0.1 yet.
If you need JetPack 5.0.1, please check carefully how to install ONNX related packages.

Hi,

We test your model on TensorRT 8.4.1 (JetPack 5 GA).
It can run successfully without issue.

Please wait for our announcement for the release.
Thanks.

I built an engine using trtexec, but it doesn’t help me because its default parameters do not fit my demands.
My parameters are-
common:
input sizes:
keypoints- min: [1,1,2] opt: [1,2565,2] max: [1,8000,2]
scores.1- min: [1,1] opt: [1,2565] max: [1,8000]
score_map- min: [1,320,256] opt: [1,320,256] max: [1,760,690]
dense_feat_map- min: [1,128,80,64] opt: [1,128,80,64] max: [1,128,190,173]

output: “console”

Onnx:
modelCheck:true
executionProviders:[“CUDAExecutionProvider”, “CPUExecutionProvider”]
prioritizedProvider:“CUDAExecutionProvider”

Trt:
modelSupport:true
modelParse:true
modelOpt:true
modelOptSerialize:true
modelOptDeSerialize:true
maxBatchSize:100
maxWorkspaceSize: 3221225472
precision:fp32

Maybe you understand the correct way to use trtexec, and you can try it again with these parameters.

I wrote my execution parameters in the previous comment, so do you know if it should work the same way you used it?
I’m afraid that the result of the execution will be the same as trtexec result(that uses its default parameters and creates a wrong engine).

In addition, can you tell me the approximate release date?
And if I will need to update the whole JetPack, or i could update just the TensorRT version on my current JetPack 5?

Thank you both for your response

Hi,

The JetPack 5 GA should be available this month but is slightly delayed.

Would you mind sharing more details about the parameter you mentioned here?
Do you need a specific input dimension?

Here is the output for TensorRT for your reference (with random input):

...
[07/29/2022-10:50:22] [I] === Performance summary ===
[07/29/2022-10:50:22] [I] Throughput: 8550.86 qps
[07/29/2022-10:50:22] [I] Latency: min = 0.0963745 ms, max = 0.557739 ms, mean = 0.111848 ms, median = 0.108337 ms, percentile(99%) = 0.128326 ms
[07/29/2022-10:50:22] [I] Enqueue Time: min = 0.06604 ms, max = 0.250122 ms, mean = 0.0696324 ms, median = 0.0688477 ms, percentile(99%) = 0.0869141 ms
[07/29/2022-10:50:22] [I] H2D Latency: min = 0.010437 ms, max = 0.231018 ms, mean = 0.0145556 ms, median = 0.0144043 ms, percentile(99%) = 0.017334 ms
[07/29/2022-10:50:22] [I] GPU Compute Time: min = 0.0732422 ms, max = 0.536133 ms, mean = 0.0893532 ms, median = 0.0859375 ms, percentile(99%) = 0.104614 ms
[07/29/2022-10:50:22] [I] D2H Latency: min = 0.00439453 ms, max = 0.131958 ms, mean = 0.00793955 ms, median = 0.0078125 ms, percentile(99%) = 0.0131836 ms
[07/29/2022-10:50:22] [I] Total Host Walltime: 3.00028 s
[07/29/2022-10:50:22] [I] Total GPU Compute Time: 2.29236 s
[07/29/2022-10:50:22] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[07/29/2022-10:50:22] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[07/29/2022-10:50:22] [W] * GPU compute time is unstable, with coefficient of variance = 8.29879%.
[07/29/2022-10:50:22] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/29/2022-10:50:22] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/29/2022-10:50:22] [I]
[07/29/2022-10:50:22] [I] Output Tensors:
[07/29/2022-10:50:22] [I] kpts: (1x1x2)
[07/29/2022-10:50:22] [I] 0 0
[07/29/2022-10:50:22] [I] descs: (1x1x1)
[07/29/2022-10:50:22] [I] -1
[07/29/2022-10:50:22] [I] scores: (1x1)
[07/29/2022-10:50:22] [I] -0.999984
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --onnx=model2_folded.onnx --dumpOutput

Thanks.

This input size seems too large.
You can increase the limit in workspace, but it still seems to be insufficient.

For example, this will result in an error.

input sizes:
keypoints- min: [1,1,2] opt: [1,2565,2] max: [1,8000,2]
scores.1- min: [1,1] opt: [1,2565] max: [1,8000]
score_map- min: [1,320,256] opt: [1,320,256] max: [1,760,690]
dense_feat_map- : [1,1,1,1]

trtexec --onnx=model2_folded.onnx  --saveEngine=model2_folded.engine --minShapes="keypoints:1x1x2,scores.1:1x1,score_map:1x1x320x256" --optShapes="keypoints:1x2565x2,scores.1:1x2565,score_map:1x1x320x256" --maxShapes="keypoints:1x8000x2,scores.1:1x8000,score_map:1x1x760x690"

It will pass with workspace option. (Workspace: 1000MiB)

trtexec --onnx=model2_folded.onnx  --saveEngine=model2_folded.engine --minShapes="keypoints:1x1x2,scores.1:1x1,score_map:1x1x320x256" --optShapes="keypoints:1x2565x2,scores.1:1x2565,score_map:1x1x320x256" --maxShapes="keypoints:1x8000x2,scores.1:1x8000,score_map:1x1x760x690" --workspace=1000

It will pass even if you reduce input size.

trtexec --onnx=model2_folded.onnx  --saveEngine=model2_folded.engine --minShapes="keypoints:1x1x2,scores.1:1x1,score_map:1x1x320x256" --optShapes="keypoints:1x256x2,scores.1:1x256,score_map:1x1x320x256" --maxShapes="keypoints:1x800x2,scores.1:1x800,score_map:1x1x320x256"

But you will also need a dense_feat_map.
I tried AGX Xavier 32GB JetPack 4.6.1 with workspace specified as 30000 (MiB), but CUDA may have an 8GB limit.

trtexec --onnx=model2_folded.onnx  --saveEngine=model2_folded.engine --minShapes="keypoints:1x1x2,scores.1:1x1,score_map:1x1x320x256,dense_feat_map:1x128x80x64" --optShapes="keypoints:1x2565x2,scores.1:1x2565,score_map:1x1x320x256,dense_feat_map:1x128x80x64" --maxShapes="keypoints:1x8000x2,scores.1:1x8000,score_map:1x1x320x256,dense_feat_map:1x128x190x173" --workspace=30000

logs

&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # trtexec --onnx=model2_folded.onnx --saveEngine=model2_folded.engine --minShapes=keypoints:1x1x2,scores.1:1x1,score_map:1x1x320x256,dense_feat_map:1x128x80x64 --optShapes=keypoints:1x2565x2,scores.1:1x2565,score_map:1x1x320x256,dense_feat_map:1x128x80x64 --maxShapes=keypoints:1x8000x2,scores.1:1x8000,score_map:1x1x320x256,dense_feat_map:1x128x190x173 --workspace=30000
[07/29/2022-12:17:34] [I] === Model Options ===
[07/29/2022-12:17:34] [I] Format: ONNX
[07/29/2022-12:17:34] [I] Model: model2_folded.onnx
[07/29/2022-12:17:34] [I] Output:
[07/29/2022-12:17:34] [I] === Build Options ===
[07/29/2022-12:17:34] [I] Max batch: explicit batch
[07/29/2022-12:17:34] [I] Workspace: 30000 MiB
[07/29/2022-12:17:34] [I] minTiming: 1
[07/29/2022-12:17:34] [I] avgTiming: 8
[07/29/2022-12:17:34] [I] Precision: FP32
[07/29/2022-12:17:34] [I] Calibration: 
[07/29/2022-12:17:34] [I] Refit: Disabled
[07/29/2022-12:17:34] [I] Sparsity: Disabled
[07/29/2022-12:17:34] [I] Safe mode: Disabled
[07/29/2022-12:17:34] [I] DirectIO mode: Disabled
[07/29/2022-12:17:34] [I] Restricted mode: Disabled
[07/29/2022-12:17:34] [I] Save engine: model2_folded.engine
[07/29/2022-12:17:34] [I] Load engine: 
[07/29/2022-12:17:34] [I] Profiling verbosity: 0
[07/29/2022-12:17:34] [I] Tactic sources: Using default tactic sources
[07/29/2022-12:17:34] [I] timingCacheMode: local
[07/29/2022-12:17:34] [I] timingCacheFile: 
[07/29/2022-12:17:34] [I] Input(s)s format: fp32:CHW
[07/29/2022-12:17:34] [I] Output(s)s format: fp32:CHW
[07/29/2022-12:17:34] [I] Input build shape: score_map=1x1x320x256+1x1x320x256+1x1x320x256
[07/29/2022-12:17:34] [I] Input build shape: dense_feat_map=1x128x80x64+1x128x80x64+1x128x190x173
[07/29/2022-12:17:34] [I] Input build shape: keypoints=1x1x2+1x2565x2+1x8000x2
[07/29/2022-12:17:34] [I] Input build shape: scores.1=1x1+1x2565+1x8000
[07/29/2022-12:17:34] [I] Input calibration shapes: model
[07/29/2022-12:17:34] [I] === System Options ===
[07/29/2022-12:17:34] [I] Device: 0
[07/29/2022-12:17:34] [I] DLACore: 
[07/29/2022-12:17:34] [I] Plugins:
[07/29/2022-12:17:34] [I] === Inference Options ===
[07/29/2022-12:17:34] [I] Batch: Explicit
[07/29/2022-12:17:34] [I] Input inference shape: score_map=1x1x320x256
[07/29/2022-12:17:34] [I] Input inference shape: scores.1=1x2565
[07/29/2022-12:17:34] [I] Input inference shape: keypoints=1x2565x2
[07/29/2022-12:17:34] [I] Input inference shape: dense_feat_map=1x128x80x64
[07/29/2022-12:17:34] [I] Iterations: 10
[07/29/2022-12:17:34] [I] Duration: 3s (+ 200ms warm up)
[07/29/2022-12:17:34] [I] Sleep time: 0ms
[07/29/2022-12:17:34] [I] Idle time: 0ms
[07/29/2022-12:17:34] [I] Streams: 1
[07/29/2022-12:17:34] [I] ExposeDMA: Disabled
[07/29/2022-12:17:34] [I] Data transfers: Enabled
[07/29/2022-12:17:34] [I] Spin-wait: Disabled
[07/29/2022-12:17:34] [I] Multithreading: Disabled
[07/29/2022-12:17:34] [I] CUDA Graph: Disabled
[07/29/2022-12:17:34] [I] Separate profiling: Disabled
[07/29/2022-12:17:34] [I] Time Deserialize: Disabled
[07/29/2022-12:17:34] [I] Time Refit: Disabled
[07/29/2022-12:17:34] [I] Skip inference: Disabled
[07/29/2022-12:17:34] [I] Inputs:
[07/29/2022-12:17:34] [I] === Reporting Options ===
[07/29/2022-12:17:34] [I] Verbose: Disabled
[07/29/2022-12:17:34] [I] Averages: 10 inferences
[07/29/2022-12:17:34] [I] Percentile: 99
[07/29/2022-12:17:34] [I] Dump refittable layers:Disabled
[07/29/2022-12:17:34] [I] Dump output: Disabled
[07/29/2022-12:17:34] [I] Profile: Disabled
[07/29/2022-12:17:34] [I] Export timing to JSON file: 
[07/29/2022-12:17:34] [I] Export output to JSON file: 
[07/29/2022-12:17:34] [I] Export profile to JSON file: 
[07/29/2022-12:17:34] [I] 
[07/29/2022-12:17:34] [I] === Device Information ===
[07/29/2022-12:17:34] [I] Selected Device: Xavier
[07/29/2022-12:17:34] [I] Compute Capability: 7.2
[07/29/2022-12:17:34] [I] SMs: 8
[07/29/2022-12:17:34] [I] Compute Clock Rate: 1.377 GHz
[07/29/2022-12:17:34] [I] Device Global Memory: 31928 MiB
[07/29/2022-12:17:34] [I] Shared Memory per SM: 96 KiB
[07/29/2022-12:17:34] [I] Memory Bus Width: 256 bits (ECC disabled)
[07/29/2022-12:17:34] [I] Memory Clock Rate: 1.377 GHz
[07/29/2022-12:17:34] [I] 
[07/29/2022-12:17:34] [I] TensorRT version: 8.2.1
[07/29/2022-12:17:35] [I] [TRT] [MemUsageChange] Init CUDA: CPU +362, GPU +0, now: CPU 381, GPU 8226 (MiB)
[07/29/2022-12:17:36] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 381 MiB, GPU 8226 MiB
[07/29/2022-12:17:36] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 486 MiB, GPU 8332 MiB
[07/29/2022-12:17:36] [I] Start parsing network model
[07/29/2022-12:17:36] [I] [TRT] ----------------------------------------------------------------
[07/29/2022-12:17:36] [I] [TRT] Input filename:   model2_folded.onnx
[07/29/2022-12:17:36] [I] [TRT] ONNX IR version:  0.0.7
[07/29/2022-12:17:36] [I] [TRT] Opset version:    13
[07/29/2022-12:17:36] [I] [TRT] Producer name:    
[07/29/2022-12:17:36] [I] [TRT] Producer version: 
[07/29/2022-12:17:36] [I] [TRT] Domain:           
[07/29/2022-12:17:36] [I] [TRT] Model version:    0
[07/29/2022-12:17:36] [I] [TRT] Doc string:       
[07/29/2022-12:17:36] [I] [TRT] ----------------------------------------------------------------
[07/29/2022-12:17:36] [W] [TRT] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[07/29/2022-12:17:36] [W] [TRT] Output type must be INT32 for shape outputs
[07/29/2022-12:17:36] [W] [TRT] Output type must be INT32 for shape outputs
[07/29/2022-12:17:36] [W] [TRT] Output type must be INT32 for shape outputs
[07/29/2022-12:17:36] [W] [TRT] Output type must be INT32 for shape outputs
[07/29/2022-12:17:36] [I] Finish parsing network model
[07/29/2022-12:17:36] [W] [TRT] DLA requests all profiles have same min, max, and opt value. All dla layers are falling back to GPU
[07/29/2022-12:17:36] [I] [TRT] ---------- Layers Running on DLA ----------
[07/29/2022-12:17:36] [I] [TRT] ---------- Layers Running on GPU ----------
[07/29/2022-12:17:36] [I] [TRT] [GpuLayer] Conv_10
[07/29/2022-12:17:36] [I] [TRT] [GpuLayer] Conv_8
[07/29/2022-12:17:36] [I] [TRT] [GpuLayer] Conv_6
[07/29/2022-12:17:36] [I] [TRT] [GpuLayer] Conv_4
[07/29/2022-12:17:36] [I] [TRT] [GpuLayer] Conv_2
[07/29/2022-12:17:36] [I] [TRT] [GpuLayer] {ForeignNode[19 + (Unnamed Layer* 29) [Shuffle]...Concat_45]}
[07/29/2022-12:17:36] [I] [TRT] [GpuLayer] Transpose_242 + Flatten_243
[07/29/2022-12:17:36] [I] [TRT] [GpuLayer] Transpose_91
[07/29/2022-12:17:36] [I] [TRT] [GpuLayer] [HostToDeviceCopy]
[07/29/2022-12:17:36] [I] [TRT] [GpuLayer] PWN(Clip_46)
[07/29/2022-12:17:36] [I] [TRT] [GpuLayer] {ForeignNode[55...Div_261]}
[07/29/2022-12:17:37] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +226, GPU +233, now: CPU 713, GPU 8566 (MiB)
[07/29/2022-12:17:37] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +308, GPU +308, now: CPU 1021, GPU 8874 (MiB)
[07/29/2022-12:17:37] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/29/2022-12:18:04] [W] [TRT] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
[07/29/2022-12:18:04] [W] [TRT]  (# 3 (SHAPE dense_feat_map))
[07/29/2022-12:18:04] [W] [TRT]  (# 1 (SHAPE keypoints))
[07/29/2022-12:18:04] [W] [TRT]  (# 2 (SHAPE dense_feat_map))
[07/29/2022-12:18:09] [W] [TRT] Skipping tactic 0 due to Myelin error: autotuning: CUDA error 2 allocating 0-byte buffer: out of memory
[07/29/2022-12:18:09] [E] Error[10]: [optimizer.cpp::computeCosts::2011] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[55...Div_261]}.)
[07/29/2022-12:18:09] [E] Error[2]: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
[07/29/2022-12:18:09] [E] Engine could not be created from network
[07/29/2022-12:18:09] [E] Building engine failed
[07/29/2022-12:18:09] [E] Failed to create engine from model.
[07/29/2022-12:18:09] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8201] # trtexec --onnx=model2_folded.onnx --saveEngine=model2_folded.engine --minShapes=keypoints:1x1x2,scores.1:1x1,score_map:1x1x320x256,dense_feat_map:1x128x80x64 --optShapes=keypoints:1x2565x2,scores.1:1x2565,score_map:1x1x320x256,dense_feat_map:1x128x80x64 --maxShapes=keypoints:1x8000x2,scores.1:1x8000,score_map:1x1x320x256,dense_feat_map:1x128x190x173 --workspace=30000

How about reducing the input size?

@AastaLLL :
The model is dynamic, so i need to create the engine with a range of input sizes.
The range is per input-
keypoints- min: [1,1,2] opt: [1,2565,2] max: [1,8000,2]
scores.1- min: [1,1] opt: [1,2565] max: [1,8000]
score_map- min: [1,320,256] opt: [1,320,256] max: [1,760,690]
dense_feat_map- min: [1,128,80,64] opt: [1,128,80,64] max: [1,128,190,173]

I see in your log that the input sizes are not dynamic, because the output sizes should be dynamic as a function of the input sizes.
For example- the kpts output size should be the same as the keypoints input size.

I used the parser, and after that I used build_engine using model optimize. Could you try reproduce that?
I see that @naisy succeeded in reproducing the problem. I’m trying to check what he wrote, but it’s not seem really clear that the problem is a memory problem. I succeded in building the engine as is on my PC(T2000), so i don’t understand why the trtexec crushes also on his GPU(on 32GB ram).

I don’t undersand how it is possible that on my PC, T2000 GPU, the build process works, but on your AGX it fails.
In addition, I don’t understand the limit issue you mentioned. If cuda really have this limit, so how should we use 8GB< ram GPUs?
And even though, so again, if it exceeds this limit, so why does it work on my T2000 GPU?

I used trtexec executable which contained in tensorrt 8.4 GA release, with the same command which you used, on my PC with 4GB ram GPU(T2000)-

[07/31/2022-10:07:36] [W] --workspace flag has been deprecated by --memPoolSize flag.
[07/31/2022-10:07:36] [I] === Model Options ===
[07/31/2022-10:07:36] [I] Format: ONNX
[07/31/2022-10:07:36] [I] Model: C:/projects/Dleware/Testers/Playground/Models/ASLtorch/Delivery/model2_folded.onnx
[07/31/2022-10:07:36] [I] Output:
[07/31/2022-10:07:36] [I] === Build Options ===
[07/31/2022-10:07:36] [I] Max batch: explicit batch
[07/31/2022-10:07:36] [I] Memory Pools: workspace: 30000 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[07/31/2022-10:07:36] [I] minTiming: 1
[07/31/2022-10:07:36] [I] avgTiming: 8
[07/31/2022-10:07:36] [I] Precision: FP32
[07/31/2022-10:07:36] [I] LayerPrecisions:
[07/31/2022-10:07:36] [I] Calibration:
[07/31/2022-10:07:36] [I] Refit: Disabled
[07/31/2022-10:07:36] [I] Sparsity: Disabled
[07/31/2022-10:07:36] [I] Safe mode: Disabled
[07/31/2022-10:07:36] [I] DirectIO mode: Disabled
[07/31/2022-10:07:36] [I] Restricted mode: Disabled
[07/31/2022-10:07:36] [I] Build only: Disabled
[07/31/2022-10:07:36] [I] Save engine: model2_folded.engine
[07/31/2022-10:07:36] [I] Load engine:
[07/31/2022-10:07:36] [I] Profiling verbosity: 0
[07/31/2022-10:07:36] [I] Tactic sources: Using default tactic sources
[07/31/2022-10:07:36] [I] timingCacheMode: local
[07/31/2022-10:07:36] [I] timingCacheFile:
[07/31/2022-10:07:36] [I] Input(s)s format: fp32:CHW
[07/31/2022-10:07:36] [I] Output(s)s format: fp32:CHW
[07/31/2022-10:07:36] [I] Input build shape: keypoints=1x1x2+1x2565x2+1x8000x2
[07/31/2022-10:07:36] [I] Input build shape: scores.1=1x1+1x2565+1x8000
[07/31/2022-10:07:36] [I] Input build shape: score_map=1x1x320x256+1x1x320x256+1x1x320x256
[07/31/2022-10:07:36] [I] Input build shape: dense_feat_map=1x128x80x64+1x128x80x64+1x128x190x173
[07/31/2022-10:07:36] [I] Input calibration shapes: model
[07/31/2022-10:07:36] [I] === System Options ===
[07/31/2022-10:07:36] [I] Device: 0
[07/31/2022-10:07:36] [I] DLACore:
[07/31/2022-10:07:36] [I] Plugins:
[07/31/2022-10:07:36] [I] === Inference Options ===
[07/31/2022-10:07:36] [I] Batch: Explicit
[07/31/2022-10:07:36] [I] Input inference shape: keypoints=1x2565x2
[07/31/2022-10:07:36] [I] Input inference shape: scores.1=1x2565
[07/31/2022-10:07:36] [I] Input inference shape: score_map=1x1x320x256
[07/31/2022-10:07:36] [I] Input inference shape: dense_feat_map=1x128x80x64
[07/31/2022-10:07:36] [I] Iterations: 10
[07/31/2022-10:07:36] [I] Duration: 3s (+ 200ms warm up)
[07/31/2022-10:07:36] [I] Sleep time: 0ms
[07/31/2022-10:07:36] [I] Idle time: 0ms
[07/31/2022-10:07:36] [I] Streams: 1
[07/31/2022-10:07:36] [I] ExposeDMA: Disabled
[07/31/2022-10:07:36] [I] Data transfers: Enabled
[07/31/2022-10:07:36] [I] Spin-wait: Disabled
[07/31/2022-10:07:36] [I] Multithreading: Disabled
[07/31/2022-10:07:36] [I] CUDA Graph: Disabled
[07/31/2022-10:07:36] [I] Separate profiling: Disabled
[07/31/2022-10:07:36] [I] Time Deserialize: Disabled
[07/31/2022-10:07:36] [I] Time Refit: Disabled
[07/31/2022-10:07:36] [I] Inputs:
[07/31/2022-10:07:36] [I] === Reporting Options ===
[07/31/2022-10:07:36] [I] Verbose: Disabled
[07/31/2022-10:07:36] [I] Averages: 10 inferences
[07/31/2022-10:07:36] [I] Percentile: 99
[07/31/2022-10:07:36] [I] Dump refittable layers:Disabled
[07/31/2022-10:07:36] [I] Dump output: Disabled
[07/31/2022-10:07:36] [I] Profile: Disabled
[07/31/2022-10:07:36] [I] Export timing to JSON file:
[07/31/2022-10:07:36] [I] Export output to JSON file:
[07/31/2022-10:07:36] [I] Export profile to JSON file:
[07/31/2022-10:07:36] [I]
[07/31/2022-10:07:36] [I] === Device Information ===
[07/31/2022-10:07:36] [I] Selected Device: Quadro T2000
[07/31/2022-10:07:36] [I] Compute Capability: 7.5
[07/31/2022-10:07:36] [I] SMs: 16
[07/31/2022-10:07:36] [I] Compute Clock Rate: 1.785 GHz
[07/31/2022-10:07:36] [I] Device Global Memory: 4096 MiB
[07/31/2022-10:07:36] [I] Shared Memory per SM: 64 KiB
[07/31/2022-10:07:36] [I] Memory Bus Width: 128 bits (ECC disabled)
[07/31/2022-10:07:36] [I] Memory Clock Rate: 4.001 GHz
[07/31/2022-10:07:36] [I]
[07/31/2022-10:07:36] [I] TensorRT version: 8.4.2
[07/31/2022-10:07:37] [I] [TRT] [MemUsageChange] Init CUDA: CPU +412, GPU +0, now: CPU 8675, GPU 901 (MiB)
[07/31/2022-10:07:38] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +215, GPU +68, now: CPU 9078, GPU 969 (MiB)
[07/31/2022-10:07:38] [I] Start parsing network model
[07/31/2022-10:07:38] [I] [TRT] ----------------------------------------------------------------
[07/31/2022-10:07:38] [I] [TRT] Input filename:   C:/projects/Dleware/Testers/Playground/Models/ASLtorch/Delivery/model2_folded.onnx
[07/31/2022-10:07:38] [I] [TRT] ONNX IR version:  0.0.7
[07/31/2022-10:07:38] [I] [TRT] Opset version:    13
[07/31/2022-10:07:38] [I] [TRT] Producer name:
[07/31/2022-10:07:38] [I] [TRT] Producer version:
[07/31/2022-10:07:38] [I] [TRT] Domain:
[07/31/2022-10:07:38] [I] [TRT] Model version:    0
[07/31/2022-10:07:38] [I] [TRT] Doc string:
[07/31/2022-10:07:38] [I] [TRT] ----------------------------------------------------------------
[07/31/2022-10:07:38] [W] [TRT] onnx2trt_utils.cpp:369: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[07/31/2022-10:07:38] [I] Finish parsing network model
[07/31/2022-10:07:38] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +539, GPU +206, now: CPU 9479, GPU 1175 (MiB)
[07/31/2022-10:07:39] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +407, GPU +146, now: CPU 9886, GPU 1321 (MiB)
[07/31/2022-10:07:39] [W] [TRT] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.0.4
[07/31/2022-10:07:39] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/31/2022-10:07:50] [W] [TRT] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are:
[07/31/2022-10:07:50] [W] [TRT]  (# 1 (SHAPE keypoints))
[07/31/2022-10:07:50] [W] [TRT]  (# 2 (SHAPE dense_feat_map))
[07/31/2022-10:07:50] [W] [TRT]  (# 3 (SHAPE dense_feat_map))
[07/31/2022-10:07:52] [I] [TRT] Detected 4 inputs and 3 output network tensors.
[07/31/2022-10:07:52] [W] [TRT] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are:
[07/31/2022-10:07:52] [W] [TRT]  (# 1 (SHAPE keypoints))
[07/31/2022-10:07:52] [W] [TRT]  (# 2 (SHAPE dense_feat_map))
[07/31/2022-10:07:52] [W] [TRT]  (# 3 (SHAPE dense_feat_map))
[07/31/2022-10:07:52] [I] [TRT] Total Host Persistent Memory: 8832
[07/31/2022-10:07:52] [I] [TRT] Total Device Persistent Memory: 2460160
[07/31/2022-10:07:52] [I] [TRT] Total Scratch Memory: 21085440
[07/31/2022-10:07:52] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 375 MiB
[07/31/2022-10:07:52] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.972ms to assign 13 blocks to 22 nodes requiring 27150848 bytes.
[07/31/2022-10:07:52] [I] [TRT] Total Activation Memory: 27150848
[07/31/2022-10:07:52] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +6, now: CPU 0, GPU 6 (MiB)
[07/31/2022-10:07:52] [W] [TRT] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[07/31/2022-10:07:52] [W] [TRT] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
[07/31/2022-10:07:52] [I] Engine built in 15.3986 sec.
[07/31/2022-10:07:52] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 10203, GPU 1403 (MiB)
[07/31/2022-10:07:52] [I] [TRT] Loaded engine size: 0 MiB
[07/31/2022-10:07:52] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +2, now: CPU 0, GPU 2 (MiB)
[07/31/2022-10:07:52] [I] Engine deserialized in 0.0029188 sec.
[07/31/2022-10:07:52] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +28, now: CPU 0, GPU 30 (MiB)
[07/31/2022-10:07:52] [I] Using random values for input keypoints
[07/31/2022-10:07:52] [I] Created input binding for keypoints with dimensions 1x2565x2
[07/31/2022-10:07:52] [I] Using random values for input scores.1
[07/31/2022-10:07:52] [I] Created input binding for scores.1 with dimensions 1x2565
[07/31/2022-10:07:52] [I] Using random values for input score_map
[07/31/2022-10:07:52] [I] Created input binding for score_map with dimensions 1x1x320x256
[07/31/2022-10:07:52] [I] Using random values for input dense_feat_map
[07/31/2022-10:07:52] [I] Created input binding for dense_feat_map with dimensions 1x128x80x64
[07/31/2022-10:07:52] [I] Using random values for output scores
[07/31/2022-10:07:52] [I] Created output binding for scores with dimensions 2565x1
[07/31/2022-10:07:52] [I] Using random values for output descs
[07/31/2022-10:07:52] [I] Created output binding for descs with dimensions 1x2565x128
[07/31/2022-10:07:52] [I] Using random values for output kpts
[07/31/2022-10:07:52] [I] Created output binding for kpts with dimensions 1x2565x2
[07/31/2022-10:07:52] [I] Starting inference
[07/31/2022-10:07:55] [I] Warmup completed 210 queries over 200 ms
[07/31/2022-10:07:55] [I] Timing trace has 3524 queries over 3.00251 s
[07/31/2022-10:07:55] [I]
[07/31/2022-10:07:55] [I] === Trace details ===
[07/31/2022-10:07:55] [I] Trace averages of 10 runs:
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.859322 ms - Host latency: 1.24322 ms (enqueue 0.0853851 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.886624 ms - Host latency: 1.27203 ms (enqueue 0.0851501 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.887576 ms - Host latency: 1.27273 ms (enqueue 0.0850586 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.906545 ms - Host latency: 1.31155 ms (enqueue 0.0981308 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.851265 ms - Host latency: 1.24614 ms (enqueue 0.0927414 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833945 ms - Host latency: 1.21683 ms (enqueue 0.0826996 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.843884 ms - Host latency: 1.23357 ms (enqueue 0.0887588 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.836957 ms - Host latency: 1.23304 ms (enqueue 0.103876 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.848743 ms - Host latency: 1.25468 ms (enqueue 0.103326 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.838507 ms - Host latency: 1.23635 ms (enqueue 0.110602 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.856219 ms - Host latency: 1.25658 ms (enqueue 0.100003 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.837473 ms - Host latency: 1.22844 ms (enqueue 0.0931549 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.844672 ms - Host latency: 1.24457 ms (enqueue 0.0841492 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832016 ms - Host latency: 1.21906 ms (enqueue 0.0833954 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.856549 ms - Host latency: 1.26299 ms (enqueue 0.0895905 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.840356 ms - Host latency: 1.23782 ms (enqueue 0.0828003 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835391 ms - Host latency: 1.21867 ms (enqueue 0.085907 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.838684 ms - Host latency: 1.22591 ms (enqueue 0.0845184 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835736 ms - Host latency: 1.21844 ms (enqueue 0.084845 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.824088 ms - Host latency: 1.20951 ms (enqueue 0.0883972 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830377 ms - Host latency: 1.21364 ms (enqueue 0.0848358 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.825198 ms - Host latency: 1.21041 ms (enqueue 0.0879944 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.826975 ms - Host latency: 1.21212 ms (enqueue 0.0891357 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.825629 ms - Host latency: 1.2089 ms (enqueue 0.084552 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.842841 ms - Host latency: 1.22822 ms (enqueue 0.0841461 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.837549 ms - Host latency: 1.23329 ms (enqueue 0.0915497 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.838528 ms - Host latency: 1.22437 ms (enqueue 0.085199 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830536 ms - Host latency: 1.21369 ms (enqueue 0.0863037 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832925 ms - Host latency: 1.22036 ms (enqueue 0.0900391 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834775 ms - Host latency: 1.21823 ms (enqueue 0.0831207 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.828998 ms - Host latency: 1.21858 ms (enqueue 0.0882782 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.840463 ms - Host latency: 1.23916 ms (enqueue 0.101453 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.845233 ms - Host latency: 1.25261 ms (enqueue 0.0965973 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.838043 ms - Host latency: 1.22797 ms (enqueue 0.084845 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.86246 ms - Host latency: 1.28496 ms (enqueue 0.112051 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.836343 ms - Host latency: 1.22968 ms (enqueue 0.0884583 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.837421 ms - Host latency: 1.23841 ms (enqueue 0.0906982 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.827765 ms - Host latency: 1.2217 ms (enqueue 0.0824036 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.847058 ms - Host latency: 1.23698 ms (enqueue 0.0850952 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.836847 ms - Host latency: 1.22327 ms (enqueue 0.0854126 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835754 ms - Host latency: 1.2186 ms (enqueue 0.08479 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.853723 ms - Host latency: 1.26522 ms (enqueue 0.0907776 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.83313 ms - Host latency: 1.22538 ms (enqueue 0.0969055 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830518 ms - Host latency: 1.21332 ms (enqueue 0.0853516 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.8349 ms - Host latency: 1.21843 ms (enqueue 0.0847107 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.852478 ms - Host latency: 1.24062 ms (enqueue 0.0872131 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829071 ms - Host latency: 1.21252 ms (enqueue 0.0853027 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835474 ms - Host latency: 1.22286 ms (enqueue 0.0851624 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831311 ms - Host latency: 1.21435 ms (enqueue 0.0846191 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.842072 ms - Host latency: 1.23209 ms (enqueue 0.0908997 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.838806 ms - Host latency: 1.22181 ms (enqueue 0.0851318 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.847821 ms - Host latency: 1.25037 ms (enqueue 0.105359 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834015 ms - Host latency: 1.22733 ms (enqueue 0.083136 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.844849 ms - Host latency: 1.23366 ms (enqueue 0.0888733 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.849872 ms - Host latency: 1.23564 ms (enqueue 0.0852966 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833441 ms - Host latency: 1.2165 ms (enqueue 0.0830872 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.924591 ms - Host latency: 1.34191 ms (enqueue 0.103687 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.823163 ms - Host latency: 1.20475 ms (enqueue 0.0827515 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829382 ms - Host latency: 1.21262 ms (enqueue 0.0825256 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834332 ms - Host latency: 1.21812 ms (enqueue 0.0824036 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833228 ms - Host latency: 1.21859 ms (enqueue 0.0854614 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.836633 ms - Host latency: 1.22004 ms (enqueue 0.0821838 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.82868 ms - Host latency: 1.21219 ms (enqueue 0.0826416 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829584 ms - Host latency: 1.21762 ms (enqueue 0.0818848 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.83006 ms - Host latency: 1.21465 ms (enqueue 0.0850342 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.838202 ms - Host latency: 1.23391 ms (enqueue 0.0910584 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.828882 ms - Host latency: 1.21259 ms (enqueue 0.0840637 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834589 ms - Host latency: 1.22388 ms (enqueue 0.0918884 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.827295 ms - Host latency: 1.21053 ms (enqueue 0.0827881 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.845959 ms - Host latency: 1.23838 ms (enqueue 0.0887329 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830682 ms - Host latency: 1.21265 ms (enqueue 0.0824219 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833289 ms - Host latency: 1.23255 ms (enqueue 0.0954773 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.826154 ms - Host latency: 1.20916 ms (enqueue 0.0828064 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.836102 ms - Host latency: 1.22457 ms (enqueue 0.0873901 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829858 ms - Host latency: 1.2126 ms (enqueue 0.0825317 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834741 ms - Host latency: 1.22208 ms (enqueue 0.0898682 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.840527 ms - Host latency: 1.23145 ms (enqueue 0.0823486 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.843439 ms - Host latency: 1.2346 ms (enqueue 0.0867188 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832532 ms - Host latency: 1.22645 ms (enqueue 0.0942383 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832129 ms - Host latency: 1.21588 ms (enqueue 0.083667 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831354 ms - Host latency: 1.2136 ms (enqueue 0.0828125 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832275 ms - Host latency: 1.21531 ms (enqueue 0.082721 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.83382 ms - Host latency: 1.219 ms (enqueue 0.0854553 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835278 ms - Host latency: 1.21858 ms (enqueue 0.0821716 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.853497 ms - Host latency: 1.27411 ms (enqueue 0.105786 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830609 ms - Host latency: 1.21467 ms (enqueue 0.0827515 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.846484 ms - Host latency: 1.23685 ms (enqueue 0.0847046 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.8297 ms - Host latency: 1.21359 ms (enqueue 0.0842102 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.839795 ms - Host latency: 1.22993 ms (enqueue 0.0882019 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.840765 ms - Host latency: 1.23369 ms (enqueue 0.0849793 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.84856 ms - Host latency: 1.26513 ms (enqueue 0.110126 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.876874 ms - Host latency: 1.31198 ms (enqueue 0.105914 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.844049 ms - Host latency: 1.25071 ms (enqueue 0.0927063 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.901624 ms - Host latency: 1.29758 ms (enqueue 0.0908752 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.83172 ms - Host latency: 1.21497 ms (enqueue 0.0856445 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834332 ms - Host latency: 1.22054 ms (enqueue 0.0878113 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.83125 ms - Host latency: 1.21384 ms (enqueue 0.0851257 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.863159 ms - Host latency: 1.25837 ms (enqueue 0.0903931 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829565 ms - Host latency: 1.21222 ms (enqueue 0.0845459 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.879639 ms - Host latency: 1.26652 ms (enqueue 0.0863159 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.840601 ms - Host latency: 1.23518 ms (enqueue 0.0894775 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830005 ms - Host latency: 1.21495 ms (enqueue 0.0847656 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.855969 ms - Host latency: 1.33939 ms (enqueue 0.0988647 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835327 ms - Host latency: 1.22085 ms (enqueue 0.0910889 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.827295 ms - Host latency: 1.21193 ms (enqueue 0.0858887 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.82666 ms - Host latency: 1.21025 ms (enqueue 0.0844605 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.83783 ms - Host latency: 1.22771 ms (enqueue 0.0938232 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.827844 ms - Host latency: 1.21084 ms (enqueue 0.0840576 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.841052 ms - Host latency: 1.22742 ms (enqueue 0.0859619 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829785 ms - Host latency: 1.21326 ms (enqueue 0.0840454 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.836353 ms - Host latency: 1.23527 ms (enqueue 0.100244 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.838611 ms - Host latency: 1.23417 ms (enqueue 0.0828979 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.848975 ms - Host latency: 1.25283 ms (enqueue 0.103711 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833533 ms - Host latency: 1.2226 ms (enqueue 0.0829956 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829346 ms - Host latency: 1.21124 ms (enqueue 0.0821533 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835974 ms - Host latency: 1.22661 ms (enqueue 0.0830322 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830969 ms - Host latency: 1.21372 ms (enqueue 0.0828491 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.857263 ms - Host latency: 1.28945 ms (enqueue 0.0991089 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.842749 ms - Host latency: 1.23348 ms (enqueue 0.0822754 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.853088 ms - Host latency: 1.27131 ms (enqueue 0.0949341 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833948 ms - Host latency: 1.22043 ms (enqueue 0.0870605 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833875 ms - Host latency: 1.22531 ms (enqueue 0.0857422 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.825049 ms - Host latency: 1.20707 ms (enqueue 0.0846069 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833215 ms - Host latency: 1.21714 ms (enqueue 0.0849365 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.83678 ms - Host latency: 1.22214 ms (enqueue 0.0841919 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.828503 ms - Host latency: 1.21194 ms (enqueue 0.0845825 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.827576 ms - Host latency: 1.21047 ms (enqueue 0.0846191 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831665 ms - Host latency: 1.21471 ms (enqueue 0.0841309 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831457 ms - Host latency: 1.21527 ms (enqueue 0.0852661 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832117 ms - Host latency: 1.21449 ms (enqueue 0.0840088 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.837939 ms - Host latency: 1.22415 ms (enqueue 0.0847778 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830176 ms - Host latency: 1.2139 ms (enqueue 0.0838257 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831213 ms - Host latency: 1.2146 ms (enqueue 0.0860962 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.822961 ms - Host latency: 1.20868 ms (enqueue 0.0834595 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830054 ms - Host latency: 1.21355 ms (enqueue 0.0828735 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830432 ms - Host latency: 1.21323 ms (enqueue 0.0816284 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832568 ms - Host latency: 1.2194 ms (enqueue 0.0872559 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835754 ms - Host latency: 1.22704 ms (enqueue 0.08396 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831616 ms - Host latency: 1.21447 ms (enqueue 0.0838013 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834436 ms - Host latency: 1.22274 ms (enqueue 0.0897705 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.828882 ms - Host latency: 1.21113 ms (enqueue 0.0824707 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.838953 ms - Host latency: 1.22628 ms (enqueue 0.0826782 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.828052 ms - Host latency: 1.21205 ms (enqueue 0.0820801 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829907 ms - Host latency: 1.21862 ms (enqueue 0.0878052 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835913 ms - Host latency: 1.21791 ms (enqueue 0.0829346 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832397 ms - Host latency: 1.21688 ms (enqueue 0.082251 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.823804 ms - Host latency: 1.21432 ms (enqueue 0.0803467 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.863525 ms - Host latency: 1.2825 ms (enqueue 0.0903198 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.849744 ms - Host latency: 1.26742 ms (enqueue 0.094812 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.824084 ms - Host latency: 1.20725 ms (enqueue 0.0803833 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.82655 ms - Host latency: 1.20924 ms (enqueue 0.0800903 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829138 ms - Host latency: 1.21207 ms (enqueue 0.0799561 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.843408 ms - Host latency: 1.2635 ms (enqueue 0.097998 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.82804 ms - Host latency: 1.21045 ms (enqueue 0.0807617 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.828015 ms - Host latency: 1.21122 ms (enqueue 0.0806763 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.828418 ms - Host latency: 1.21178 ms (enqueue 0.0808594 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.845337 ms - Host latency: 1.26279 ms (enqueue 0.0958618 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831995 ms - Host latency: 1.21536 ms (enqueue 0.0819824 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.827942 ms - Host latency: 1.21389 ms (enqueue 0.0829956 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.839441 ms - Host latency: 1.24434 ms (enqueue 0.0817383 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.846704 ms - Host latency: 1.26348 ms (enqueue 0.109119 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830005 ms - Host latency: 1.2189 ms (enqueue 0.0919189 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.825879 ms - Host latency: 1.20835 ms (enqueue 0.0826538 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.84104 ms - Host latency: 1.2292 ms (enqueue 0.0850342 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830029 ms - Host latency: 1.21318 ms (enqueue 0.0826172 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.84021 ms - Host latency: 1.22241 ms (enqueue 0.0847046 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.825671 ms - Host latency: 1.20869 ms (enqueue 0.0822144 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834583 ms - Host latency: 1.21918 ms (enqueue 0.0835083 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830225 ms - Host latency: 1.21354 ms (enqueue 0.0820679 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.854334 ms - Host latency: 1.27435 ms (enqueue 0.10719 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.826697 ms - Host latency: 1.20953 ms (enqueue 0.0827881 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831457 ms - Host latency: 1.21637 ms (enqueue 0.0831787 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.840698 ms - Host latency: 1.242 ms (enqueue 0.091272 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.843127 ms - Host latency: 1.23904 ms (enqueue 0.0945801 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.837549 ms - Host latency: 1.23268 ms (enqueue 0.0886963 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.824243 ms - Host latency: 1.20665 ms (enqueue 0.0825195 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.84281 ms - Host latency: 1.24524 ms (enqueue 0.0898438 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.827795 ms - Host latency: 1.20991 ms (enqueue 0.0820679 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.82627 ms - Host latency: 1.21797 ms (enqueue 0.0864502 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.84646 ms - Host latency: 1.23036 ms (enqueue 0.0817139 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.822717 ms - Host latency: 1.20627 ms (enqueue 0.0834106 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.860059 ms - Host latency: 1.24708 ms (enqueue 0.0837891 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.83938 ms - Host latency: 1.22191 ms (enqueue 0.0858765 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.840784 ms - Host latency: 1.23271 ms (enqueue 0.0846069 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832605 ms - Host latency: 1.21559 ms (enqueue 0.085083 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.824805 ms - Host latency: 1.20996 ms (enqueue 0.0847412 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.828223 ms - Host latency: 1.21141 ms (enqueue 0.0846069 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.84939 ms - Host latency: 1.24041 ms (enqueue 0.0853516 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832678 ms - Host latency: 1.21617 ms (enqueue 0.0845337 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831787 ms - Host latency: 1.22168 ms (enqueue 0.0962524 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832239 ms - Host latency: 1.21715 ms (enqueue 0.0844238 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.836414 ms - Host latency: 1.21998 ms (enqueue 0.084375 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829456 ms - Host latency: 1.21276 ms (enqueue 0.0842773 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.839575 ms - Host latency: 1.24779 ms (enqueue 0.108923 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.843457 ms - Host latency: 1.23511 ms (enqueue 0.0834717 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831482 ms - Host latency: 1.21654 ms (enqueue 0.0855713 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.841089 ms - Host latency: 1.24626 ms (enqueue 0.0929321 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.842285 ms - Host latency: 1.23074 ms (enqueue 0.0886597 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834741 ms - Host latency: 1.22548 ms (enqueue 0.0908936 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.82688 ms - Host latency: 1.21039 ms (enqueue 0.083374 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829871 ms - Host latency: 1.21349 ms (enqueue 0.0830688 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832446 ms - Host latency: 1.21677 ms (enqueue 0.0822388 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832727 ms - Host latency: 1.223 ms (enqueue 0.0895142 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.827905 ms - Host latency: 1.2104 ms (enqueue 0.0820923 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.838623 ms - Host latency: 1.22957 ms (enqueue 0.0838013 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833032 ms - Host latency: 1.21613 ms (enqueue 0.0821045 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.83147 ms - Host latency: 1.22078 ms (enqueue 0.090918 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.84906 ms - Host latency: 1.2694 ms (enqueue 0.0916382 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.836707 ms - Host latency: 1.22567 ms (enqueue 0.0859131 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835107 ms - Host latency: 1.22338 ms (enqueue 0.0846313 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834888 ms - Host latency: 1.21793 ms (enqueue 0.0832397 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.851978 ms - Host latency: 1.28446 ms (enqueue 0.0956665 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833508 ms - Host latency: 1.2301 ms (enqueue 0.0862305 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829077 ms - Host latency: 1.21282 ms (enqueue 0.0830688 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.826477 ms - Host latency: 1.21061 ms (enqueue 0.0822998 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830774 ms - Host latency: 1.22504 ms (enqueue 0.0905273 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830835 ms - Host latency: 1.21519 ms (enqueue 0.082019 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831982 ms - Host latency: 1.21604 ms (enqueue 0.082666 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.839136 ms - Host latency: 1.23425 ms (enqueue 0.0901855 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.827686 ms - Host latency: 1.21062 ms (enqueue 0.082373 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.860083 ms - Host latency: 1.28042 ms (enqueue 0.0949951 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.8323 ms - Host latency: 1.21609 ms (enqueue 0.0834717 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.844287 ms - Host latency: 1.23547 ms (enqueue 0.0877686 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831152 ms - Host latency: 1.21455 ms (enqueue 0.0828125 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.83562 ms - Host latency: 1.22092 ms (enqueue 0.0844971 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829712 ms - Host latency: 1.21309 ms (enqueue 0.0822266 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.85166 ms - Host latency: 1.24553 ms (enqueue 0.0846436 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835083 ms - Host latency: 1.22151 ms (enqueue 0.0829102 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.837891 ms - Host latency: 1.22278 ms (enqueue 0.0889648 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.843823 ms - Host latency: 1.23188 ms (enqueue 0.0839355 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.839087 ms - Host latency: 1.23643 ms (enqueue 0.0982422 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.828711 ms - Host latency: 1.21157 ms (enqueue 0.0838379 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.837036 ms - Host latency: 1.22014 ms (enqueue 0.0825684 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.846606 ms - Host latency: 1.2481 ms (enqueue 0.0932861 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833008 ms - Host latency: 1.22271 ms (enqueue 0.087207 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.842822 ms - Host latency: 1.24768 ms (enqueue 0.0946533 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.827319 ms - Host latency: 1.21104 ms (enqueue 0.0823975 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.887769 ms - Host latency: 1.36157 ms (enqueue 0.103345 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.862524 ms - Host latency: 1.27756 ms (enqueue 0.0913574 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834155 ms - Host latency: 1.23677 ms (enqueue 0.090332 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.836133 ms - Host latency: 1.22549 ms (enqueue 0.0843994 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833081 ms - Host latency: 1.2166 ms (enqueue 0.0843262 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.840698 ms - Host latency: 1.23267 ms (enqueue 0.0935059 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834692 ms - Host latency: 1.2189 ms (enqueue 0.0838867 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.86814 ms - Host latency: 1.25515 ms (enqueue 0.0849854 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.8354 ms - Host latency: 1.21902 ms (enqueue 0.0838623 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833423 ms - Host latency: 1.22114 ms (enqueue 0.0847656 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835596 ms - Host latency: 1.22009 ms (enqueue 0.0844238 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835547 ms - Host latency: 1.22056 ms (enqueue 0.0854492 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.837036 ms - Host latency: 1.21936 ms (enqueue 0.0842773 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833472 ms - Host latency: 1.22209 ms (enqueue 0.0941162 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.852734 ms - Host latency: 1.25601 ms (enqueue 0.0841797 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831982 ms - Host latency: 1.21462 ms (enqueue 0.0817139 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.823267 ms - Host latency: 1.20984 ms (enqueue 0.0813477 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.828247 ms - Host latency: 1.21101 ms (enqueue 0.080542 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.846338 ms - Host latency: 1.24124 ms (enqueue 0.0855469 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834302 ms - Host latency: 1.22117 ms (enqueue 0.0799316 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832105 ms - Host latency: 1.22332 ms (enqueue 0.0886475 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.827441 ms - Host latency: 1.21021 ms (enqueue 0.0802734 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.853833 ms - Host latency: 1.25996 ms (enqueue 0.101245 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.840527 ms - Host latency: 1.2418 ms (enqueue 0.102295 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.846533 ms - Host latency: 1.25808 ms (enqueue 0.104297 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.843579 ms - Host latency: 1.24536 ms (enqueue 0.102295 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.837549 ms - Host latency: 1.24902 ms (enqueue 0.11499 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.850586 ms - Host latency: 1.25159 ms (enqueue 0.10437 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.859717 ms - Host latency: 1.28196 ms (enqueue 0.12063 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.842871 ms - Host latency: 1.25647 ms (enqueue 0.116211 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.836279 ms - Host latency: 1.23013 ms (enqueue 0.103198 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.83877 ms - Host latency: 1.23752 ms (enqueue 0.103418 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.847217 ms - Host latency: 1.27427 ms (enqueue 0.12207 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.837036 ms - Host latency: 1.24502 ms (enqueue 0.117725 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834644 ms - Host latency: 1.23037 ms (enqueue 0.105566 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.837769 ms - Host latency: 1.24167 ms (enqueue 0.111035 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.849097 ms - Host latency: 1.25066 ms (enqueue 0.103589 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.837012 ms - Host latency: 1.22368 ms (enqueue 0.0866699 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.839429 ms - Host latency: 1.22771 ms (enqueue 0.0900147 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830762 ms - Host latency: 1.21409 ms (enqueue 0.084375 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.858447 ms - Host latency: 1.29075 ms (enqueue 0.0987549 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.844263 ms - Host latency: 1.24761 ms (enqueue 0.10625 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.856323 ms - Host latency: 1.27632 ms (enqueue 0.10105 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.840283 ms - Host latency: 1.23445 ms (enqueue 0.086377 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.839526 ms - Host latency: 1.23699 ms (enqueue 0.0917969 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834082 ms - Host latency: 1.21821 ms (enqueue 0.0840332 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832935 ms - Host latency: 1.21506 ms (enqueue 0.0844482 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831714 ms - Host latency: 1.2146 ms (enqueue 0.0837158 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.844043 ms - Host latency: 1.23508 ms (enqueue 0.0981201 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.84353 ms - Host latency: 1.23154 ms (enqueue 0.083252 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.843994 ms - Host latency: 1.25059 ms (enqueue 0.100586 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.836694 ms - Host latency: 1.21963 ms (enqueue 0.0837646 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830542 ms - Host latency: 1.21228 ms (enqueue 0.0830322 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.838672 ms - Host latency: 1.22827 ms (enqueue 0.0833252 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835303 ms - Host latency: 1.22607 ms (enqueue 0.0852051 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830957 ms - Host latency: 1.22275 ms (enqueue 0.0892822 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.828638 ms - Host latency: 1.21067 ms (enqueue 0.0824219 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.83584 ms - Host latency: 1.22908 ms (enqueue 0.0960449 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831714 ms - Host latency: 1.21516 ms (enqueue 0.0822022 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.836694 ms - Host latency: 1.21973 ms (enqueue 0.0834961 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832959 ms - Host latency: 1.22087 ms (enqueue 0.0822266 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833423 ms - Host latency: 1.22505 ms (enqueue 0.0906494 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.841968 ms - Host latency: 1.25115 ms (enqueue 0.0937012 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.851416 ms - Host latency: 1.26257 ms (enqueue 0.110278 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.842554 ms - Host latency: 1.24771 ms (enqueue 0.0902344 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834863 ms - Host latency: 1.21853 ms (enqueue 0.0855713 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834644 ms - Host latency: 1.21873 ms (enqueue 0.0829346 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831226 ms - Host latency: 1.21499 ms (enqueue 0.0823486 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.901685 ms - Host latency: 1.28892 ms (enqueue 0.0876221 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832471 ms - Host latency: 1.21553 ms (enqueue 0.0826172 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829224 ms - Host latency: 1.21357 ms (enqueue 0.0890137 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.827173 ms - Host latency: 1.21021 ms (enqueue 0.0822754 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830151 ms - Host latency: 1.21355 ms (enqueue 0.0831055 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830884 ms - Host latency: 1.21467 ms (enqueue 0.0822022 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.854419 ms - Host latency: 1.28079 ms (enqueue 0.119238 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831274 ms - Host latency: 1.21978 ms (enqueue 0.0849365 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.832373 ms - Host latency: 1.2167 ms (enqueue 0.083374 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.840161 ms - Host latency: 1.22998 ms (enqueue 0.0827148 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831934 ms - Host latency: 1.21558 ms (enqueue 0.0825195 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.838452 ms - Host latency: 1.23484 ms (enqueue 0.0985107 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831519 ms - Host latency: 1.21516 ms (enqueue 0.0822754 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.84165 ms - Host latency: 1.2355 ms (enqueue 0.0889404 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831055 ms - Host latency: 1.21545 ms (enqueue 0.0821045 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.831885 ms - Host latency: 1.21841 ms (enqueue 0.0866699 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.826978 ms - Host latency: 1.21055 ms (enqueue 0.0817871 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.838721 ms - Host latency: 1.22283 ms (enqueue 0.0838867 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.836816 ms - Host latency: 1.23162 ms (enqueue 0.0831543 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.842847 ms - Host latency: 1.24116 ms (enqueue 0.0947021 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.833057 ms - Host latency: 1.21687 ms (enqueue 0.0821777 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829883 ms - Host latency: 1.21526 ms (enqueue 0.0871094 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835864 ms - Host latency: 1.23181 ms (enqueue 0.0931152 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.828931 ms - Host latency: 1.21106 ms (enqueue 0.0821045 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.849536 ms - Host latency: 1.25818 ms (enqueue 0.0858398 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.838379 ms - Host latency: 1.23892 ms (enqueue 0.0863525 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.843872 ms - Host latency: 1.24458 ms (enqueue 0.0919922 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.828955 ms - Host latency: 1.21326 ms (enqueue 0.0821777 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830762 ms - Host latency: 1.21721 ms (enqueue 0.0829346 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.823364 ms - Host latency: 1.20754 ms (enqueue 0.0821777 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835498 ms - Host latency: 1.22427 ms (enqueue 0.0867432 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830493 ms - Host latency: 1.21829 ms (enqueue 0.0818848 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.835034 ms - Host latency: 1.22585 ms (enqueue 0.086084 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.849927 ms - Host latency: 1.29221 ms (enqueue 0.0899414 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.834277 ms - Host latency: 1.2312 ms (enqueue 0.0897461 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.836182 ms - Host latency: 1.23203 ms (enqueue 0.0893799 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.83457 ms - Host latency: 1.21765 ms (enqueue 0.0828125 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829785 ms - Host latency: 1.2179 ms (enqueue 0.0825928 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830591 ms - Host latency: 1.21311 ms (enqueue 0.0823486 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.837012 ms - Host latency: 1.23003 ms (enqueue 0.090332 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.828931 ms - Host latency: 1.21121 ms (enqueue 0.0816895 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.842188 ms - Host latency: 1.25808 ms (enqueue 0.109253 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.830713 ms - Host latency: 1.21748 ms (enqueue 0.0827393 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.841187 ms - Host latency: 1.24834 ms (enqueue 0.106104 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.827417 ms - Host latency: 1.21809 ms (enqueue 0.0899902 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.825659 ms - Host latency: 1.20979 ms (enqueue 0.0828857 ms)
[07/31/2022-10:07:55] [I] Average on 10 runs - GPU latency: 0.829321 ms - Host latency: 1.21692 ms (enqueue 0.0875488 ms)
[07/31/2022-10:07:55] [I]
[07/31/2022-10:07:55] [I] === Performance summary ===
[07/31/2022-10:07:55] [I] Throughput: 1173.68 qps
[07/31/2022-10:07:55] [I] Latency: min = 1.15747 ms, max = 2.0061 ms, mean = 1.2299 ms, median = 1.21655 ms, percentile(99%) = 1.42554 ms
[07/31/2022-10:07:55] [I] Enqueue Time: min = 0.0787354 ms, max = 0.267578 ms, mean = 0.0883669 ms, median = 0.0834961 ms, percentile(99%) = 0.17807 ms
[07/31/2022-10:07:55] [I] H2D Latency: min = 0.245361 ms, max = 0.545654 ms, mean = 0.264068 ms, median = 0.259033 ms, percentile(99%) = 0.372803 ms
[07/31/2022-10:07:55] [I] GPU Compute Time: min = 0.776611 ms, max = 1.62408 ms, mean = 0.838034 ms, median = 0.831894 ms, percentile(99%) = 0.944931 ms
[07/31/2022-10:07:55] [I] D2H Latency: min = 0.115479 ms, max = 0.33667 ms, mean = 0.127797 ms, median = 0.124512 ms, percentile(99%) = 0.208435 ms
[07/31/2022-10:07:55] [I] Total Host Walltime: 3.00251 s
[07/31/2022-10:07:55] [I] Total GPU Compute Time: 2.95323 s
[07/31/2022-10:07:55] [W] * GPU compute time is unstable, with coefficient of variance = 4.18864%.
[07/31/2022-10:07:55] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/31/2022-10:07:55] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/31/2022-10:07:55] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8402] # trtexec.exe --onnx=C:/projects/Dleware/Testers/Playground/Models/ASLtorch/Delivery/model2_folded.onnx --saveEngine=model2_folded.engine --minShapes=keypoints:1x1x2,scores.1:1x1,score_map:1x1x320x256,dense_feat_map:1x128x80x64 --optShapes=keypoints:1x2565x2,scores.1:1x2565,score_map:1x1x320x256,dense_feat_map:1x128x80x64 --maxShapes=keypoints:1x8000x2,scores.1:1x8000,score_map:1x1x320x256,dense_feat_map:1x128x190x173 --workspace=30000

My Jetson AGX Xavier 32GB JetPack 4.6.1 Docker (TensorRT 8.2.1) was able to build your model by reducing the max of keypoints and scores.1 from 8000 to 3000.

command:

trtexec --onnx=model2_folded.onnx  --saveEngine=model2_folded.engine --minShapes="keypoints:1x1x2,scores.1:1x1,score_map:1x1x320x256,dense_feat_map:1x128x80x64" --optShapes="keypoints:1x2565x2,scores.1:1x2565,score_map:1x1x320x256,dense_feat_map:1x128x80x64" --maxShapes="keypoints:1x3000x2,scores.1:1x3000,score_map:1x1x320x256,dense_feat_map:1x128x190x173" --workspace=30000

log

&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # trtexec --onnx=model2_folded.onnx --saveEngine=model2_folded.engine --minShapes=keypoints:1x1x2,scores.1:1x1,score_map:1x1x320x256,dense_feat_map:1x128x80x64 --optShapes=keypoints:1x2565x2,scores.1:1x2565,score_map:1x1x320x256,dense_feat_map:1x128x80x64 --maxShapes=keypoints:1x3000x2,scores.1:1x3000,score_map:1x1x320x256,dense_feat_map:1x128x190x173 --workspace=30000
[08/01/2022-10:27:47] [I] === Model Options ===
[08/01/2022-10:27:47] [I] Format: ONNX
[08/01/2022-10:27:47] [I] Model: model2_folded.onnx
[08/01/2022-10:27:47] [I] Output:
[08/01/2022-10:27:47] [I] === Build Options ===
[08/01/2022-10:27:47] [I] Max batch: explicit batch
[08/01/2022-10:27:47] [I] Workspace: 30000 MiB
[08/01/2022-10:27:47] [I] minTiming: 1
[08/01/2022-10:27:47] [I] avgTiming: 8
[08/01/2022-10:27:47] [I] Precision: FP32
[08/01/2022-10:27:47] [I] Calibration: 
[08/01/2022-10:27:47] [I] Refit: Disabled
[08/01/2022-10:27:47] [I] Sparsity: Disabled
[08/01/2022-10:27:47] [I] Safe mode: Disabled
[08/01/2022-10:27:47] [I] DirectIO mode: Disabled
[08/01/2022-10:27:47] [I] Restricted mode: Disabled
[08/01/2022-10:27:47] [I] Save engine: model2_folded.engine
[08/01/2022-10:27:47] [I] Load engine: 
[08/01/2022-10:27:47] [I] Profiling verbosity: 0
[08/01/2022-10:27:47] [I] Tactic sources: Using default tactic sources
[08/01/2022-10:27:47] [I] timingCacheMode: local
[08/01/2022-10:27:47] [I] timingCacheFile: 
[08/01/2022-10:27:47] [I] Input(s)s format: fp32:CHW
[08/01/2022-10:27:47] [I] Output(s)s format: fp32:CHW
[08/01/2022-10:27:47] [I] Input build shape: score_map=1x1x320x256+1x1x320x256+1x1x320x256
[08/01/2022-10:27:47] [I] Input build shape: dense_feat_map=1x128x80x64+1x128x80x64+1x128x190x173
[08/01/2022-10:27:47] [I] Input build shape: keypoints=1x1x2+1x2565x2+1x3000x2
[08/01/2022-10:27:47] [I] Input build shape: scores.1=1x1+1x2565+1x3000
[08/01/2022-10:27:47] [I] Input calibration shapes: model
[08/01/2022-10:27:47] [I] === System Options ===
[08/01/2022-10:27:47] [I] Device: 0
[08/01/2022-10:27:47] [I] DLACore: 
[08/01/2022-10:27:47] [I] Plugins:
[08/01/2022-10:27:47] [I] === Inference Options ===
[08/01/2022-10:27:47] [I] Batch: Explicit
[08/01/2022-10:27:47] [I] Input inference shape: score_map=1x1x320x256
[08/01/2022-10:27:47] [I] Input inference shape: scores.1=1x2565
[08/01/2022-10:27:47] [I] Input inference shape: keypoints=1x2565x2
[08/01/2022-10:27:47] [I] Input inference shape: dense_feat_map=1x128x80x64
[08/01/2022-10:27:47] [I] Iterations: 10
[08/01/2022-10:27:47] [I] Duration: 3s (+ 200ms warm up)
[08/01/2022-10:27:47] [I] Sleep time: 0ms
[08/01/2022-10:27:47] [I] Idle time: 0ms
[08/01/2022-10:27:47] [I] Streams: 1
[08/01/2022-10:27:47] [I] ExposeDMA: Disabled
[08/01/2022-10:27:47] [I] Data transfers: Enabled
[08/01/2022-10:27:47] [I] Spin-wait: Disabled
[08/01/2022-10:27:47] [I] Multithreading: Disabled
[08/01/2022-10:27:47] [I] CUDA Graph: Disabled
[08/01/2022-10:27:47] [I] Separate profiling: Disabled
[08/01/2022-10:27:47] [I] Time Deserialize: Disabled
[08/01/2022-10:27:47] [I] Time Refit: Disabled
[08/01/2022-10:27:47] [I] Skip inference: Disabled
[08/01/2022-10:27:47] [I] Inputs:
[08/01/2022-10:27:47] [I] === Reporting Options ===
[08/01/2022-10:27:47] [I] Verbose: Disabled
[08/01/2022-10:27:47] [I] Averages: 10 inferences
[08/01/2022-10:27:47] [I] Percentile: 99
[08/01/2022-10:27:47] [I] Dump refittable layers:Disabled
[08/01/2022-10:27:47] [I] Dump output: Disabled
[08/01/2022-10:27:47] [I] Profile: Disabled
[08/01/2022-10:27:47] [I] Export timing to JSON file: 
[08/01/2022-10:27:47] [I] Export output to JSON file: 
[08/01/2022-10:27:47] [I] Export profile to JSON file: 
[08/01/2022-10:27:47] [I] 
[08/01/2022-10:27:47] [I] === Device Information ===
[08/01/2022-10:27:47] [I] Selected Device: Xavier
[08/01/2022-10:27:47] [I] Compute Capability: 7.2
[08/01/2022-10:27:47] [I] SMs: 8
[08/01/2022-10:27:47] [I] Compute Clock Rate: 1.377 GHz
[08/01/2022-10:27:47] [I] Device Global Memory: 31928 MiB
[08/01/2022-10:27:47] [I] Shared Memory per SM: 96 KiB
[08/01/2022-10:27:47] [I] Memory Bus Width: 256 bits (ECC disabled)
[08/01/2022-10:27:47] [I] Memory Clock Rate: 1.377 GHz
[08/01/2022-10:27:47] [I] 
[08/01/2022-10:27:47] [I] TensorRT version: 8.2.1
[08/01/2022-10:27:48] [I] [TRT] [MemUsageChange] Init CUDA: CPU +362, GPU +0, now: CPU 381, GPU 3064 (MiB)
[08/01/2022-10:27:48] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 381 MiB, GPU 3064 MiB
[08/01/2022-10:27:48] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 486 MiB, GPU 3171 MiB
[08/01/2022-10:27:48] [I] Start parsing network model
[08/01/2022-10:27:48] [I] [TRT] ----------------------------------------------------------------
[08/01/2022-10:27:48] [I] [TRT] Input filename:   model2_folded.onnx
[08/01/2022-10:27:48] [I] [TRT] ONNX IR version:  0.0.7
[08/01/2022-10:27:48] [I] [TRT] Opset version:    13
[08/01/2022-10:27:48] [I] [TRT] Producer name:    
[08/01/2022-10:27:48] [I] [TRT] Producer version: 
[08/01/2022-10:27:48] [I] [TRT] Domain:           
[08/01/2022-10:27:48] [I] [TRT] Model version:    0
[08/01/2022-10:27:48] [I] [TRT] Doc string:       
[08/01/2022-10:27:48] [I] [TRT] ----------------------------------------------------------------
[08/01/2022-10:27:48] [W] [TRT] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/01/2022-10:27:49] [W] [TRT] Output type must be INT32 for shape outputs
[08/01/2022-10:27:49] [W] [TRT] Output type must be INT32 for shape outputs
[08/01/2022-10:27:49] [W] [TRT] Output type must be INT32 for shape outputs
[08/01/2022-10:27:49] [W] [TRT] Output type must be INT32 for shape outputs
[08/01/2022-10:27:49] [I] Finish parsing network model
[08/01/2022-10:27:49] [W] [TRT] DLA requests all profiles have same min, max, and opt value. All dla layers are falling back to GPU
[08/01/2022-10:27:49] [I] [TRT] ---------- Layers Running on DLA ----------
[08/01/2022-10:27:49] [I] [TRT] ---------- Layers Running on GPU ----------
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] Conv_10
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] Conv_8
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] Conv_6
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] Conv_4
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] Conv_2
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] {ForeignNode[19 + (Unnamed Layer* 29) [Shuffle]...Concat_45]}
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] Transpose_242 + Flatten_243
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] Transpose_91
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] [HostToDeviceCopy]
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] PWN(Clip_46)
[08/01/2022-10:27:49] [I] [TRT] [GpuLayer] {ForeignNode[55...Div_261]}
[08/01/2022-10:27:50] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +226, GPU +381, now: CPU 713, GPU 3554 (MiB)
[08/01/2022-10:27:51] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +308, GPU +510, now: CPU 1021, GPU 4064 (MiB)
[08/01/2022-10:27:51] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[08/01/2022-10:28:18] [W] [TRT] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
[08/01/2022-10:28:18] [W] [TRT]  (# 3 (SHAPE dense_feat_map))
[08/01/2022-10:28:18] [W] [TRT]  (# 1 (SHAPE keypoints))
[08/01/2022-10:28:18] [W] [TRT]  (# 2 (SHAPE dense_feat_map))
[08/01/2022-10:28:35] [I] [TRT] Detected 4 inputs and 3 output network tensors.
[08/01/2022-10:28:35] [W] [TRT] Myelin graph with multiple dynamic values may have poor performance if they differ. Dynamic values are: 
[08/01/2022-10:28:35] [W] [TRT]  (# 3 (SHAPE dense_feat_map))
[08/01/2022-10:28:35] [W] [TRT]  (# 1 (SHAPE keypoints))
[08/01/2022-10:28:35] [W] [TRT]  (# 2 (SHAPE dense_feat_map))
[08/01/2022-10:28:35] [I] [TRT] Total Host Persistent Memory: 11440
[08/01/2022-10:28:35] [I] [TRT] Total Device Persistent Memory: 0
[08/01/2022-10:28:35] [I] [TRT] Total Scratch Memory: 18433596128
[08/01/2022-10:28:35] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 17667 MiB
[08/01/2022-10:28:35] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 1.21227ms to assign 12 blocks to 19 nodes requiring 18452722176 bytes.
[08/01/2022-10:28:35] [I] [TRT] Total Activation Memory: 18452722176
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1481, GPU 9003 (MiB)
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1481, GPU 9003 (MiB)
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +4, now: CPU 0, GPU 4 (MiB)
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1480, GPU 8983 (MiB)
[08/01/2022-10:28:35] [I] [TRT] Loaded engine size: 6 MiB
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1487, GPU 8983 (MiB)
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1487, GPU 8983 (MiB)
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[08/01/2022-10:28:35] [I] Engine built in 48.4199 sec.
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1375, GPU 8986 (MiB)
[08/01/2022-10:28:35] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1375, GPU 8986 (MiB)
[08/01/2022-10:28:38] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +17597, now: CPU 0, GPU 17597 (MiB)
[08/01/2022-10:28:38] [I] Using random values for input keypoints
[08/01/2022-10:28:38] [I] Created input binding for keypoints with dimensions 1x2565x2
[08/01/2022-10:28:38] [I] Using random values for input scores.1
[08/01/2022-10:28:38] [I] Created input binding for scores.1 with dimensions 1x2565
[08/01/2022-10:28:38] [I] Using random values for input score_map
[08/01/2022-10:28:38] [I] Created input binding for score_map with dimensions 1x1x320x256
[08/01/2022-10:28:38] [I] Using random values for input dense_feat_map
[08/01/2022-10:28:38] [I] Created input binding for dense_feat_map with dimensions 1x128x80x64
[08/01/2022-10:28:38] [I] Using random values for output scores
[08/01/2022-10:28:38] [I] Created output binding for scores with dimensions 2565x1
[08/01/2022-10:28:38] [I] Using random values for output descs
[08/01/2022-10:28:38] [I] Created output binding for descs with dimensions 1x2565x128
[08/01/2022-10:28:38] [I] Using random values for output kpts
[08/01/2022-10:28:38] [I] Created output binding for kpts with dimensions 1x2565x2
[08/01/2022-10:28:38] [I] Starting inference
[08/01/2022-10:28:42] [I] Warmup completed 1 queries over 200 ms
[08/01/2022-10:28:42] [I] Timing trace has 17 queries over 3.52191 s
[08/01/2022-10:28:42] [I] 
[08/01/2022-10:28:42] [I] === Trace details ===
[08/01/2022-10:28:42] [I] Trace averages of 10 runs:
[08/01/2022-10:28:42] [I] Average on 10 runs - GPU latency: 207.042 ms - Host latency: 207.221 ms (end to end 207.23 ms, enqueue 0.532111 ms)
[08/01/2022-10:28:42] [I] 
[08/01/2022-10:28:42] [I] === Performance summary ===
[08/01/2022-10:28:42] [I] Throughput: 4.82692 qps
[08/01/2022-10:28:42] [I] Latency: min = 205.671 ms, max = 208.697 ms, mean = 207.162 ms, median = 207.218 ms, percentile(99%) = 208.697 ms
[08/01/2022-10:28:42] [I] End-to-End Host Latency: min = 205.679 ms, max = 208.708 ms, mean = 207.171 ms, median = 207.223 ms, percentile(99%) = 208.708 ms
[08/01/2022-10:28:42] [I] Enqueue Time: min = 0.421387 ms, max = 0.639282 ms, mean = 0.496025 ms, median = 0.485596 ms, percentile(99%) = 0.639282 ms
[08/01/2022-10:28:42] [I] H2D Latency: min = 0.120239 ms, max = 0.126587 ms, mean = 0.122923 ms, median = 0.122711 ms, percentile(99%) = 0.126587 ms
[08/01/2022-10:28:42] [I] GPU Compute Time: min = 205.492 ms, max = 208.514 ms, mean = 206.984 ms, median = 207.036 ms, percentile(99%) = 208.514 ms
[08/01/2022-10:28:42] [I] D2H Latency: min = 0.0415039 ms, max = 0.0603027 ms, mean = 0.0559046 ms, median = 0.0562744 ms, percentile(99%) = 0.0603027 ms
[08/01/2022-10:28:42] [I] Total Host Walltime: 3.52191 s
[08/01/2022-10:28:42] [I] Total GPU Compute Time: 3.51872 s
[08/01/2022-10:28:42] [I] Explanations of the performance metrics are printed in the verbose logs.
[08/01/2022-10:28:42] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # trtexec --onnx=model2_folded.onnx --saveEngine=model2_folded.engine --minShapes=keypoints:1x1x2,scores.1:1x1,score_map:1x1x320x256,dense_feat_map:1x128x80x64 --optShapes=keypoints:1x2565x2,scores.1:1x2565,score_map:1x1x320x256,dense_feat_map:1x128x80x64 --maxShapes=keypoints:1x3000x2,scores.1:1x3000,score_map:1x1x320x256,dense_feat_map:1x128x190x173 --workspace=30000

Can you try TensorRT 8.2.1 on your PC?
There are several fixes between 8.2.1 and 8.4.0.
If it is simply a version issue, it may work with the JetPack 5.0.1 you are targeting.

My jetson nx(5.0.1 Jetpack) is able to build the model, but I need to reduce the max of the keypoints, scores and also the dense_feat_map. Anyway, it’s not useful for me, because I need to activate the model on this range of sizes.
I guess the optimization of trt uses a lot of memory, so maybe the next trt version will handle it?

And I still don’t understand why my computer is able to build the model as is, but the jetson nx need smaller outputs for the engine build.