Performance on T4 is over 3x slower than that on 2080Ti on DeepStream5

geralt_of_rivia · July 31, 2020, 6:20am

I’m running DeepStream app on RTX 2080Ti as well as on T4 and I have observed for the same configuration setting, the performance on T4 is over 3x slower. Following is the config file:

[application]
enable-perf-measurement=1
perf-measurement-interval-sec=5

[tiled-display]
enable=1
rows=1
columns=1
width=1920
height=1080
gpu-id=0
#(0): nvbuf-mem-default - Default memory allocated, specific to particular platform
#(1): nvbuf-mem-cuda-pinned - Allocate Pinned/Host cuda memory, applicable for Tesla
#(2): nvbuf-mem-cuda-device - Allocate Device cuda memory, applicable for Tesla
#(3): nvbuf-mem-cuda-unified - Allocate Unified cuda memory, applicable for Tesla
#(4): nvbuf-mem-surface-array - Allocate Surface Array memory, applicable for Jetson
nvbuf-memory-type=0

[source0]
enable=1
#Type - 1=CameraV4L2 2=URI 3=MultiURI
type=3
uri=file:///root/apps/video.mp4
num-sources=40
gpu-id=0
# (0): memtype_device   - Memory type Device
# (1): memtype_pinned   - Memory type Host Pinned
# (2): memtype_unified  - Memory type Unified
cudadec-memtype=0

[sink0]
enable=0
#Type - 1=FakeSink 2=EglSink 3=File
type=2
sync=0
source-id=0
gpu-id=0
nvbuf-memory-type=0

[sink1]
enable=0
type=3
#1=mp4 2=mkv
container=1
#1=h264 2=h265
codec=1
sync=0
#iframeinterval=200
bitrate=4000000
output-file=out.mp4
source-id=0

[osd]
enable=0
gpu-id=0
border-width=3
text-size=15
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Serif
show-clock=0
clock-x-offset=800
clock-y-offset=820
clock-text-size=12
clock-color=1;0;0;0
nvbuf-memory-type=0

[streammux]
gpu-id=0
##Boolean property to inform muxer that sources are live
live-source=0
batch-size=40
##time out in usec, to wait after the first buffer is available
##to push the batch even if the complete batch is not formed
batched-push-timeout=40000
## Set muxer output width and height
width=1920
height=1080
##Enable to maintain aspect ratio wrt source, and allow black borders, works
##along with width, height properties
enable-padding=0
nvbuf-memory-type=0

# config-file property is mandatory for any gie section.
# Other properties are optional and if set will override the properties set in
# the infer config file.
[primary-gie]
enable=1
gpu-id=0
labelfile-path=/root/apps/model.txt
batch-size=40
bbox-border-color0=1;0;0;1
bbox-border-color1=0;1;1;1
bbox-border-color2=0;0;1;1
bbox-border-color3=0;1;0;1
interval=1
gie-unique-id=1
nvbuf-memory-type=0
config-file=model.txt

[tracker]
enable=1
tracker-width=640
tracker-height=384
gpu-id=0
ll-lib-file=/opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_nvdcf.so
ll-config-file=/root/apps/tracker.yml
enable-batch-process=1

[tests]
file-loop=0

In the above screenshot, we can see RTX on the left side is about 17 FPS and on the right is T4 which is about 4-5 FPS. They are running the exact same configuration, video, and other settings. I have spent a lot of time trying to debug the cause, even if RTS is slightly powerful than T4 but I don’t think it is 3x slower. The GPU utilization on RTX is about 89-92% while that on T4 is 96-100%. Any reason why this might be happening?

• Hardware Platform: RTX 2080Ti and T4
• DeepStream Version: 5
• TensorRT Version: 7.0
• NVIDIA GPU Driver Version (valid for GPU only): 440.33.01 on both the cards

Amycao · July 31, 2020, 10:44am

Hi,
This is for 2080 Ti compute capability:
 14.2 TFLOPS1 of peak single precision (FP32) performance
 28.5 TFLOPS1 of peak half precision (FP16) performance

this is for T4 compute capability:
SINGLE PRECISION PERFORMANCE (FP32) 8.1 TFLOPS
MIXED PRECISION (FP16/FP32) 65 FP16 TFLOPS
INT8 PRECISION 130 INT8 TOPS

You can find from https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf for 2080 Ti
NVIDIA T4 Tensor Core GPU for AI Inference | NVIDIA Data Center for T4
in which mode your infer running on?

besides, you should compare the performance when both not reach GPU top utility, in your test, i see almost reach full GPU utility on T4, you could set interval of nvinfer in config to a large value to skip more batch frames to ease GPU loading and see what happened.

geralt_of_rivia · July 31, 2020, 11:13am

FPS I have received is consistent across the execution of the program. And I’m not sure if I can use a larger interval setting since I’m already using interval=1 right now.

Edit:

I was using FP16 on RTX and FP32 on T4. If I use FP16 on T4, I’m getting ~9 FPS which is still way to less compared to RTX.

Edit2:
I read in the DeepStream Release Notes section 3.4 that in order to avoid reduced performance on T4 I need CUDA 10.1 installed with CUDA 10.2 and use NVIDIA driver 418.126.02. Is this bug related to the same?

geralt_of_rivia · August 2, 2020, 4:56am

@Amycao any update on this? I’m stuck because of this bug and there’s no way around. Thanks.

Amycao · August 3, 2020, 10:01am

You could try DeepStream Release Notes section 3.4

qianyilong · August 10, 2020, 5:14pm

Did those release notes get updated? Section 3.4 no longer seems to have a section about the mixed cuda solution. I was working on getting that fix applied last week and I come back to it today and the release notes seem to have changed.

geralt_of_rivia · August 10, 2020, 5:16pm

Yeah that’s what I was wondering as well. I will try the new GA release and update here. Thanks.