Inference speed on A30 seems slow

A30 running RetinaNet / ResNet18 backbone.
TensorRT 8.0.1 CUDA 11.5

Inference speed on INT 8 engine ~550 fps
running on T4 is ~305.
Should the A30 run more fps?

May I know how did you check the inference speed for A30 and T4?

Hi Morgan, thank you for your reply. The check was done using trtexec.
so I followed the following steps:
first step = build libnvinfer.so.8.0.1
second step = copy new libnvinfer.so.8.0.1 to be seen by trtexec
third step = run trtexec using the int8 engine converted from the retinanet unpruned tlt file.
for the A30 this result = 550 fps, for T4 this result = 305 fps

   12  git clone -b release/8.0 https://github.com/nvidia/TensorRT
   13  cd TensorRT/
   14  git submodule update --init --recursive
   15  export TRT_SOURCE=`pwd`
   16  cd $TRT_SOURCE
   17  mkdir build
   18  cd build
   19  find / -name cmake
   20  cmake .. -DGPU_ARCHS=80  -DTRT_LIB_DIR=/usr/lib/x86_64-linux-gnu/ -DCMAKE_C_COMPILER=/usr/bin/gcc -DTRT_BIN_DIR=`pwd`/out
   21  ls
   22  make nvinfer_plugin -j$(nproc)
   23  ls
cp /home/dell/Deepstream_6.0_Triton_w_TRTOSS/TensorRT/build/libnvinfer_plugin.so.8.0.1    /usr/lib/x86_64-linux-gnu/libnvinfer_plugin.so.8.0.1
root@8c264ed9cce8:/opt/nvidia/deepstream/deepstream-6.0/TensorRT/build# /usr/src/tensorrt/bin/trtexec --batch=1 --loadEngine=/home/dell/TAO_experiments/retinanet/save_export/trt.int8.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --batch=1 --loadEngine=/home/dell/TAO_experiments/retinanet/save_export/trt.int8.engine

[02/18/2022-03:48:34] [I] === Performance summary ===
[02/18/2022-03:48:34] [I] Throughput: 550.952 qps
[02/18/2022-03:48:34] [I] Latency: min = 2.02393 ms, max = 3.0589 ms, mean = 2.07785 ms, median = 2.07349 ms, percentile(99%) = 2.12457                      ms
[02/18/2022-03:48:34] [I] End-to-End Host Latency: min = 2.03735 ms, max = 5.41118 ms, mean = 3.54507 ms, median = 3.54333 ms, percentil                     e(99%) = 3.55957 ms
[02/18/2022-03:48:34] [I] Enqueue Time: min = 0.575287 ms, max = 0.791992 ms, mean = 0.590547 ms, median = 0.581482 ms, percentile(99%)                      = 0.662598 ms
[02/18/2022-03:48:34] [I] H2D Latency: min = 0.220947 ms, max = 0.36792 ms, mean = 0.264913 ms, median = 0.262451 ms, percentile(99%) =                      0.310577 ms
[02/18/2022-03:48:34] [I] GPU Compute Time: min = 1.79404 ms, max = 2.74124 ms, mean = 1.80616 ms, median = 1.80432 ms, percentile(99%)                      = 1.8114 ms
[02/18/2022-03:48:34] [I] D2H Latency: min = 0.00585938 ms, max = 0.00968933 ms, mean = 0.00678125 ms, median = 0.00671387 ms, percentil                     e(99%) = 0.00799561 ms
[02/18/2022-03:48:34] [I] Total Host Walltime: 3.00389 s
[02/18/2022-03:48:34] [I] Total GPU Compute Time: 2.98919 s
[02/18/2022-03:48:34] [I] Explanations of the performance metrics are printed in the verbose logs.
[02/18/2022-03:48:34] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --batch=1 --loadEngine=/home/dell/TAO_experiments/retinane                     t/save_export/trt.int8.engine
[02/18/2022-03:48:34] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1600, GPU 1166 (MiB)

Thanks for the info.
The A30 can get better fps than T4. It is expected.

yes, so was hoping that the A30 would be more like a factor of x4 but am only seeing x ~(550/305). Do you think should expect more?

judging just from the spec sheet, A30 = 330 TOPS, T4 = 130 TOPS

but of course many other considerations. just would be interested in your thoughts …

also would you know if there is a retinanet / resnet 18 benchmark?

Could you share the spec? Thanks.

A30 Tensor Core GPU for AI Inference | NVIDIA

t4-tensor-core-datasheet-951643.pdf (nvidia.com)

Thanks for the info. Could you try to check more results when set to different batch-size? Below result in A30 Tensor Core GPU for AI Inference | NVIDIA are also comparing with different bs.

Hi Morgan, thank you for the comparison info, it is very helpful. am continuing to make some comparisons. for the INT 8, unpruned case with ResNet18 backbone :
batch size 8 = 1025 fps
batch size 16 = 1096 fps

I see from the caption on the diagram you posted the following:

TensorRT, NGC Container 20.12, Latency <7ms, Dataset=Synthetic,​ 1x GPU: T4 (BS=31, INT8)  |  V100 (BS=43, Mixed precision)  |  A30 (BS=96, INT8)  |  A100 (BS=174, INT8)

Just wondering are you aware of any other flags used when calling the A30 or T4 in this case?

Please note that above figure is about image classification.

right thank you once again for the information!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.