script used for NVIDIA Deep Learning Inference Performance

Hi all, could you please provide the script used to get the NVIDIA Deep Learning Inference TensorRT Performance shown in this link?:

[url]https://developer.nvidia.com/deep-learning-performance-training-inference[/url]

Hello,

Data is from our monthly performance baselines. Unfortunately, I cannot share the infrastructure for that beyond what is already described in the blog. All benchmarks are using the standard examples that’s shipped with each container.

For example :
trtexec for the CNN performance
sampleNMT for the RNN performance data

regards,
NVES

Hi NVES,

What is the location of the script trtexec within the container image nvcr.io/nvidia/tensorflow:18.10-py3
?, I can’t find it.

Also, does it use direct tensorRT or integrated tensorflow-tensorrt?

never mind, I have found it within the container image tensorrt 18.10-py3. It uses direct tensorrt. Thanks

hi NVES, could you please provide a command line sample to run resnet50 network with synthetic data as described in the blog, using the trtexec script for CNN performance?. I have used the below command but don’t see a flag to specify synthetic data :

./trtexec --uff=/workspace/tensorrt/python/data/resnet50/resnet50-infer-5.uff --output=Binary_3 --uffInput=Input_0,1,28,28 --useDLA=1 --int8 --allowGPUFallback

Also , I am getting the below errors:
tracefile

uff: /workspace/tensorrt/python/data/resnet50/resnet50-infer-5.uff
output: prob
uffInput: input_tensor,3,256,256
useDLA: 1
int8
allowGPUFallback
UFFParser: Parser error: input: Invalid number of Dimensions 0
Engine could not be created
Engine could not be created

I am running the test from the container image tensorrt 18.10-py3, Driver Version: 410.72, 10.0, GPU T4

WORKING!.

Command used:

./trtexec --uff=/workspace/tensorrt/python/data/resnet50/resnet50-infer-5.uff --output=GPU_0/tower_0/Softmax --uffInput=input,3,224,224

Hi all, I ran the test with 128 batch size but I didn’t get similar result as described in the blog.

What flags did you use to get that performance?

Here is the command used:

./trtexec --uff=/workspace/tensorrt/python/data/resnet50/resnet50-infer-5.uff --output=GPU_0/tower_0/Softmax --uffInput=input,3,224,224 --iterations=40 --int8 --batch=128 --device=4 --workspace=1024 --avgRuns=100

Tracefile

uff: /workspace/tensorrt/python/data/resnet50/resnet50-infer-5.uff
output: GPU_0/tower_0/Softmax
uffInput: input,3,224,224
iterations: 40
int8
batch: 128
device: 4
workspace: 1024
avgRuns: 100
name=input, bindingIndex=0, buffers.size()=2
name=GPU_0/tower_0/Softmax, bindingIndex=1, buffers.size()=2
Average over 100 runs is 34.1673 ms (host walltime is 34.2887 ms, 99% percentile time is 37.038).
Average over 100 runs is 34.3756 ms (host walltime is 34.4982 ms, 99% percentile time is 35.1684).
Average over 100 runs is 34.6299 ms (host walltime is 34.7522 ms, 99% percentile time is 35.4479).
Average over 100 runs is 34.8164 ms (host walltime is 34.9412 ms, 99% percentile time is 35.3997).
Average over 100 runs is 35.0114 ms (host walltime is 35.134 ms, 99% percentile time is 35.5574).
Average over 100 runs is 35.2389 ms (host walltime is 35.3653 ms, 99% percentile time is 36.0468).
Average over 100 runs is 34.7587 ms (host walltime is 34.8917 ms, 99% percentile time is 35.8625).
Average over 100 runs is 34.816 ms (host walltime is 34.9391 ms, 99% percentile time is 35.3997).
Average over 100 runs is 34.9231 ms (host walltime is 35.045 ms, 99% percentile time is 35.418).
Average over 100 runs is 35.1043 ms (host walltime is 35.2264 ms, 99% percentile time is 35.5972).
Average over 100 runs is 35.2336 ms (host walltime is 35.3578 ms, 99% percentile time is 35.8802).
Average over 100 runs is 35.4735 ms (host walltime is 35.6041 ms, 99% percentile time is 36.2429).
Average over 100 runs is 35.5457 ms (host walltime is 35.659 ms, 99% percentile time is 36.217).
Average over 100 runs is 35.813 ms (host walltime is 35.9252 ms, 99% percentile time is 36.3454).
Average over 100 runs is 35.9142 ms (host walltime is 36.0264 ms, 99% percentile time is 36.7248).
Average over 100 runs is 36.093 ms (host walltime is 36.2069 ms, 99% percentile time is 36.6874).
Average over 100 runs is 36.2502 ms (host walltime is 36.3641 ms, 99% percentile time is 37.2336).
Average over 100 runs is 36.3277 ms (host walltime is 36.4782 ms, 99% percentile time is 37.2342).
Average over 100 runs is 36.4765 ms (host walltime is 36.6239 ms, 99% percentile time is 37.3).
Average over 100 runs is 36.6658 ms (host walltime is 36.7935 ms, 99% percentile time is 37.2819).
Average over 100 runs is 36.7708 ms (host walltime is 36.9009 ms, 99% percentile time is 37.8935).
Average over 100 runs is 36.9433 ms (host walltime is 37.0748 ms, 99% percentile time is 37.6883).
Average over 100 runs is 37.0133 ms (host walltime is 37.1455 ms, 99% percentile time is 37.6402).
Average over 100 runs is 37.1856 ms (host walltime is 37.3171 ms, 99% percentile time is 37.9271).
Average over 100 runs is 37.3445 ms (host walltime is 37.4794 ms, 99% percentile time is 38.1726).
Average over 100 runs is 37.5002 ms (host walltime is 37.6331 ms, 99% percentile time is 38.1787).
Average over 100 runs is 37.5798 ms (host walltime is 37.7111 ms, 99% percentile time is 38.3487).
Average over 100 runs is 37.749 ms (host walltime is 37.8806 ms, 99% percentile time is 38.6729).
Average over 100 runs is 37.8694 ms (host walltime is 37.9995 ms, 99% percentile time is 38.8259).
Average over 100 runs is 38.0057 ms (host walltime is 38.1356 ms, 99% percentile time is 38.9382).
Average over 100 runs is 38.1585 ms (host walltime is 38.288 ms, 99% percentile time is 39.2839).
Average over 100 runs is 38.2864 ms (host walltime is 38.4141 ms, 99% percentile time is 39.1737).
Average over 100 runs is 38.3479 ms (host walltime is 38.4745 ms, 99% percentile time is 39.2103).
Average over 100 runs is 38.4953 ms (host walltime is 38.6273 ms, 99% percentile time is 39.386).
Average over 100 runs is 38.6081 ms (host walltime is 38.7308 ms, 99% percentile time is 39.6979).
Average over 100 runs is 38.7908 ms (host walltime is 38.9145 ms, 99% percentile time is 39.7195).
Average over 100 runs is 38.9192 ms (host walltime is 39.0432 ms, 99% percentile time is 40.2901).
Average over 100 runs is 39.0548 ms (host walltime is 39.1784 ms, 99% percentile time is 40.0624).
Average over 100 runs is 39.1139 ms (host walltime is 39.2349 ms, 99% percentile time is 40.2408).
Average over 100 runs is 39.2214 ms (host walltime is 39.3428 ms, 99% percentile time is 40.2525).

Throughput = 128/39.2214 ms = ~3,264 img/s

use the caffe sample to reproduce the benchmarks

unfortunately the perf. is still worse for V100 if TF-TensorRT benchmark is done as shown below - trtexec benchmark and practical workload on TF-TRT has lots of gap :)

resnet_v1_50 1 249 4.01 FP32
resnet_v1_50 2 409 4.89 FP32
resnet_v1_50 4 626 6.74 FP32
resnet_v1_50 8 876 9.46 FP32
resnet_v1_50 16 1089 14.96 FP32
resnet_v1_50 32 1241 26.3 FP32
resnet_v1_50 64 1221 55.34 FP32
resnet_v1_50 128 1192 110.83 FP32

see attached file for better view
r50.PNG