Can't Duplicate TAO results for Yolo V3 Inferencing

Description

I can successfully use all aspects of “tao yolo_v3 …” including training, pruning, inferencing, etc. via the example jupyter notebooks.

Unfortunately, I cannot duplicate the results using a python or C++ program outside of the tao interface. The engine loads, it seems to process an image, but it never finds anything in the image. I suspect the image preprocessing isn’t correct but I’m not sure what I’m doing wrong and I’ve been following the example provided in …/tensorrt/python/yolo_v3.

What I really need is access to the source code for “tao inference” to make sure I’m following the preprocessing, the engine creation, context, binding, etc., correctly. Is that available somewhere?

Environment

TensorRT Version: 7.2.1.6
GPU Type: Tesla T4
Nvidia Driver Version: 460.73.01
CUDA Version: 11.2
CUDNN Version: 8.1
Operating System + Version: aws-marketplace/NVIDIA Deep Learning AMI v21.06.0-46a68101-e56b-41cd-8e32-631ac6e5d02b
Python Version (if applicable): 3.6.9
TensorFlow Version (if applicable): 1.15.4
PyTorch Version (if applicable): N/A
Baremetal or Container (if container which image + tag): tao yolo_v3 run /bin/bash

Relevant Files

https://drive.google.com/drive/folders/1dHwGyah-pXWkLhDThcu6qM40S6Bq9YWs?usp=sharing

Steps To Reproduce

This uses the environment for cv_samples_v1.2.0.

Step 1: Copy the files in the above google drive folder to /tmp

Step 2: Start the tao container

workon launcher
tao yolo_v3 run /bin/bash
cd /workspace/tao-experiments/yolo_v3
mkdir tensorrt
cd tensorrt

Step 3: Copy the files from /tmp to the …/tensorrt subdirectory

Step 4: Run the python file

 python3 trt.py

The output will be:

Reading engine from file /workspace/tao-experiments/yolo_v3/tensorrt/trt.engine
[array([0], dtype=int32), array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
...

The output should NOT be all zeros.

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Meanwhile, for some common errors and queries please refer to below link:
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/#error-messaging
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/#faq

Thanks!

Here it is running trtexec like this:

/usr/src/tensorrt/bin/trtexec --loadEngine=trt.engine --verbose

Results are:

&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --loadEngine=trt.engine --verbose
[11/01/2021-09:15:32] [I] === Model Options ===
[11/01/2021-09:15:32] [I] Format: *
[11/01/2021-09:15:32] [I] Model: 
[11/01/2021-09:15:32] [I] Output:
[11/01/2021-09:15:32] [I] === Build Options ===
[11/01/2021-09:15:32] [I] Max batch: 1
[11/01/2021-09:15:32] [I] Workspace: 16 MiB
[11/01/2021-09:15:32] [I] minTiming: 1
[11/01/2021-09:15:32] [I] avgTiming: 8
[11/01/2021-09:15:32] [I] Precision: FP32
[11/01/2021-09:15:32] [I] Calibration: 
[11/01/2021-09:15:32] [I] Refit: Disabled
[11/01/2021-09:15:32] [I] Sparsity: Disabled
[11/01/2021-09:15:32] [I] Safe mode: Disabled
[11/01/2021-09:15:32] [I] Restricted mode: Disabled
[11/01/2021-09:15:32] [I] Save engine: 
[11/01/2021-09:15:32] [I] Load engine: trt.engine
[11/01/2021-09:15:32] [I] NVTX verbosity: 0
[11/01/2021-09:15:32] [I] Tactic sources: Using default tactic sources
[11/01/2021-09:15:32] [I] timingCacheMode: local
[11/01/2021-09:15:32] [I] timingCacheFile: 
[11/01/2021-09:15:32] [I] Input(s)s format: fp32:CHW
[11/01/2021-09:15:32] [I] Output(s)s format: fp32:CHW
[11/01/2021-09:15:32] [I] Input build shapes: model
[11/01/2021-09:15:32] [I] Input calibration shapes: model
[11/01/2021-09:15:32] [I] === System Options ===
[11/01/2021-09:15:32] [I] Device: 0
[11/01/2021-09:15:32] [I] DLACore: 
[11/01/2021-09:15:32] [I] Plugins:
[11/01/2021-09:15:32] [I] === Inference Options ===
[11/01/2021-09:15:32] [I] Batch: 1
[11/01/2021-09:15:32] [I] Input inference shapes: model
[11/01/2021-09:15:32] [I] Iterations: 10
[11/01/2021-09:15:32] [I] Duration: 3s (+ 200ms warm up)
[11/01/2021-09:15:32] [I] Sleep time: 0ms
[11/01/2021-09:15:32] [I] Streams: 1
[11/01/2021-09:15:32] [I] ExposeDMA: Disabled
[11/01/2021-09:15:32] [I] Data transfers: Enabled
[11/01/2021-09:15:32] [I] Spin-wait: Disabled
[11/01/2021-09:15:32] [I] Multithreading: Disabled
[11/01/2021-09:15:32] [I] CUDA Graph: Disabled
[11/01/2021-09:15:32] [I] Separate profiling: Disabled
[11/01/2021-09:15:32] [I] Time Deserialize: Disabled
[11/01/2021-09:15:32] [I] Time Refit: Disabled
[11/01/2021-09:15:32] [I] Skip inference: Disabled
[11/01/2021-09:15:32] [I] Inputs:
[11/01/2021-09:15:32] [I] === Reporting Options ===
[11/01/2021-09:15:32] [I] Verbose: Enabled
[11/01/2021-09:15:32] [I] Averages: 10 inferences
[11/01/2021-09:15:32] [I] Percentile: 99
[11/01/2021-09:15:32] [I] Dump refittable layers:Disabled
[11/01/2021-09:15:32] [I] Dump output: Disabled
[11/01/2021-09:15:32] [I] Profile: Disabled
[11/01/2021-09:15:32] [I] Export timing to JSON file: 
[11/01/2021-09:15:32] [I] Export output to JSON file: 
[11/01/2021-09:15:32] [I] Export profile to JSON file: 
[11/01/2021-09:15:32] [I] 
[11/01/2021-09:15:32] [I] === Device Information ===
[11/01/2021-09:15:32] [I] Selected Device: Xavier
[11/01/2021-09:15:32] [I] Compute Capability: 7.2
[11/01/2021-09:15:32] [I] SMs: 8
[11/01/2021-09:15:32] [I] Compute Clock Rate: 1.377 GHz
[11/01/2021-09:15:32] [I] Device Global Memory: 15816 MiB
[11/01/2021-09:15:32] [I] Shared Memory per SM: 96 KiB
[11/01/2021-09:15:32] [I] Memory Bus Width: 256 bits (ECC disabled)
[11/01/2021-09:15:32] [I] Memory Clock Rate: 1.377 GHz
[11/01/2021-09:15:32] [I] 
[11/01/2021-09:15:32] [I] TensorRT version: 8001
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::CropAndResizeDynamic version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::NMSDynamic_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::Proposal version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::ProposalDynamic version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[11/01/2021-09:15:32] [V] [TRT] Registered plugin creator - ::Split version 1
[11/01/2021-09:15:33] [I] [TRT] [MemUsageChange] Init CUDA: CPU +353, GPU +0, now: CPU 439, GPU 10609 (MiB)
[11/01/2021-09:15:33] [I] [TRT] Loaded engine size: 67 MB
[11/01/2021-09:15:33] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 439 MiB, GPU 10609 MiB
[11/01/2021-09:15:33] [V] [TRT] Using cublas a tactic source
[11/01/2021-09:15:33] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +227, GPU +295, now: CPU 669, GPU 10973 (MiB)
[11/01/2021-09:15:33] [V] [TRT] Using cuDNN as a tactic source
[11/01/2021-09:15:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +307, GPU +398, now: CPU 976, GPU 11371 (MiB)
[11/01/2021-09:15:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 976, GPU 11368 (MiB)
[11/01/2021-09:15:34] [V] [TRT] Deserialization required 1569689 microseconds.
[11/01/2021-09:15:34] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 976 MiB, GPU 11368 MiB
[11/01/2021-09:15:34] [I] Engine loaded in 2.50073 sec.
[11/01/2021-09:15:34] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 908 MiB, GPU 11300 MiB
[11/01/2021-09:15:34] [V] [TRT] Using cublas a tactic source
[11/01/2021-09:15:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 908, GPU 11300 (MiB)
[11/01/2021-09:15:34] [V] [TRT] Using cuDNN as a tactic source
[11/01/2021-09:15:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +4, now: CPU 908, GPU 11304 (MiB)
[11/01/2021-09:15:34] [V] [TRT] Total per-runner device memory is 68746240
[11/01/2021-09:15:34] [V] [TRT] Total per-runner host memory is 94624
[11/01/2021-09:15:34] [V] [TRT] Allocated activation device memory of size 43777024
[11/01/2021-09:15:34] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 908 MiB, GPU 11412 MiB
[11/01/2021-09:15:34] [I] Created input binding for Input with dimensions 1x3x384x1248
[11/01/2021-09:15:34] [I] Created output binding for BatchedNMS with dimensions 1x1
[11/01/2021-09:15:34] [I] Created output binding for BatchedNMS_1 with dimensions 1x200x4
[11/01/2021-09:15:34] [I] Created output binding for BatchedNMS_2 with dimensions 1x200
[11/01/2021-09:15:34] [I] Created output binding for BatchedNMS_3 with dimensions 1x200
[11/01/2021-09:15:34] [I] Starting inference
[11/01/2021-09:15:38] [I] Warmup completed 9 queries over 200 ms
[11/01/2021-09:15:38] [I] Timing trace has 130 queries over 3.05179 s
[11/01/2021-09:15:38] [I] 
[11/01/2021-09:15:38] [I] === Trace details ===
[11/01/2021-09:15:38] [I] Trace averages of 10 runs:
[11/01/2021-09:15:38] [I] Average on 10 runs - GPU latency: 23.3161 ms - Host latency: 23.471 ms (end to end 23.4806 ms, enqueue 1.26517 ms)
[11/01/2021-09:15:38] [I] Average on 10 runs - GPU latency: 23.3321 ms - Host latency: 23.4872 ms (end to end 23.4954 ms, enqueue 1.14398 ms)
[11/01/2021-09:15:38] [I] Average on 10 runs - GPU latency: 23.3143 ms - Host latency: 23.4692 ms (end to end 23.4774 ms, enqueue 1.09438 ms)
[11/01/2021-09:15:38] [I] Average on 10 runs - GPU latency: 23.2999 ms - Host latency: 23.4548 ms (end to end 23.4637 ms, enqueue 1.07769 ms)
[11/01/2021-09:15:38] [I] Average on 10 runs - GPU latency: 23.3279 ms - Host latency: 23.4828 ms (end to end 23.4927 ms, enqueue 1.07867 ms)
[11/01/2021-09:15:38] [I] Average on 10 runs - GPU latency: 23.3138 ms - Host latency: 23.4686 ms (end to end 23.4775 ms, enqueue 1.09276 ms)
[11/01/2021-09:15:38] [I] Average on 10 runs - GPU latency: 23.3294 ms - Host latency: 23.4843 ms (end to end 23.4938 ms, enqueue 1.07729 ms)
[11/01/2021-09:15:38] [I] Average on 10 runs - GPU latency: 23.3175 ms - Host latency: 23.4723 ms (end to end 23.4805 ms, enqueue 1.09846 ms)
[11/01/2021-09:15:38] [I] Average on 10 runs - GPU latency: 23.3053 ms - Host latency: 23.4603 ms (end to end 23.4681 ms, enqueue 1.09309 ms)
[11/01/2021-09:15:38] [I] Average on 10 runs - GPU latency: 23.3012 ms - Host latency: 23.4562 ms (end to end 23.4624 ms, enqueue 1.07896 ms)
[11/01/2021-09:15:38] [I] Average on 10 runs - GPU latency: 23.2908 ms - Host latency: 23.4458 ms (end to end 23.4544 ms, enqueue 1.08989 ms)
[11/01/2021-09:15:38] [I] Average on 10 runs - GPU latency: 23.3076 ms - Host latency: 23.4624 ms (end to end 23.4699 ms, enqueue 1.10022 ms)
[11/01/2021-09:15:38] [I] Average on 10 runs - GPU latency: 23.297 ms - Host latency: 23.4526 ms (end to end 23.4625 ms, enqueue 1.11892 ms)
[11/01/2021-09:15:38] [I] 
[11/01/2021-09:15:38] [I] === Performance summary ===
[11/01/2021-09:15:38] [I] Throughput: 42.5979 qps
[11/01/2021-09:15:38] [I] Latency: min = 23.3889 ms, max = 23.5685 ms, mean = 23.4667 ms, median = 23.4631 ms, percentile(99%) = 23.5569 ms
[11/01/2021-09:15:38] [I] End-to-End Host Latency: min = 23.3938 ms, max = 23.5757 ms, mean = 23.4753 ms, median = 23.4705 ms, percentile(99%) = 23.569 ms
[11/01/2021-09:15:38] [I] Enqueue Time: min = 1.01788 ms, max = 1.59755 ms, mean = 1.10842 ms, median = 1.09326 ms, percentile(99%) = 1.56818 ms
[11/01/2021-09:15:38] [I] H2D Latency: min = 0.150635 ms, max = 0.157227 ms, mean = 0.151296 ms, median = 0.151306 ms, percentile(99%) = 0.152344 ms
[11/01/2021-09:15:38] [I] GPU Compute Time: min = 23.2334 ms, max = 23.4141 ms, mean = 23.3118 ms, median = 23.3079 ms, percentile(99%) = 23.4016 ms
[11/01/2021-09:15:38] [I] D2H Latency: min = 0.00268555 ms, max = 0.00415039 ms, mean = 0.00365777 ms, median = 0.00366211 ms, percentile(99%) = 0.00415039 ms
[11/01/2021-09:15:38] [I] Total Host Walltime: 3.05179 s
[11/01/2021-09:15:38] [I] Total GPU Compute Time: 3.03053 s
[11/01/2021-09:15:38] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/01/2021-09:15:38] [V] 
[11/01/2021-09:15:38] [V] === Explanations of the performance metrics ===
[11/01/2021-09:15:38] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[11/01/2021-09:15:38] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[11/01/2021-09:15:38] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[11/01/2021-09:15:38] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[11/01/2021-09:15:38] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[11/01/2021-09:15:38] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[11/01/2021-09:15:38] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[11/01/2021-09:15:38] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[11/01/2021-09:15:38] [V] End-to-End Host Latency: the duration from when the H2D of a query is called to when the D2H of the same query is completed, which includes the latency to wait for the completion of the previous query. This is the latency of a query if multiple queries are enqueued consecutively.
[11/01/2021-09:15:38] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --loadEngine=trt.engine --verbose
[11/01/2021-09:15:38] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 928, GPU 11365 (MiB)

So it seems to load the engine fine and the bindings seem correct which makes me think I’m just not preprocessing the image correctly. Which is why I’m wondering if the code for “tao inference” is available so I can try exactly copying the preprocessing part.

The specific code I’m running is at:

The serialized engine is built using the tao example jupyter notebooks as-is. Mine is at the following link but it won’t do you any good since it’s machine specific, right?

And the image is just the first image from the tao examples but here’s a link to a copy of it:

Again, if you could just provide me the preprocessing and inference code for “tao yolo_v3 inference…” I’m sure I could figure it out quickly.

Hi,

This looks like TAO related, we are moving this post to TAO forum to get better help.

Thank you.

@bwallach
Please refer to some topics , for example, Inferring Yolo_v3.trt model in python - #33 by Morganh , YOLO v4 inference with TensorRT after training with TLT 3.0 - #3 by johan_b