Performance on T4 is over 3x slower than that on 2080Ti on DeepStream5

I’m running DeepStream app on RTX 2080Ti as well as on T4 and I have observed for the same configuration setting, the performance on T4 is over 3x slower. Following is the config file:

[application]
enable-perf-measurement=1
perf-measurement-interval-sec=5

[tiled-display]
enable=1
rows=1
columns=1
width=1920
height=1080
gpu-id=0
#(0): nvbuf-mem-default - Default memory allocated, specific to particular platform
#(1): nvbuf-mem-cuda-pinned - Allocate Pinned/Host cuda memory, applicable for Tesla
#(2): nvbuf-mem-cuda-device - Allocate Device cuda memory, applicable for Tesla
#(3): nvbuf-mem-cuda-unified - Allocate Unified cuda memory, applicable for Tesla
#(4): nvbuf-mem-surface-array - Allocate Surface Array memory, applicable for Jetson
nvbuf-memory-type=0

[source0]
enable=1
#Type - 1=CameraV4L2 2=URI 3=MultiURI
type=3
uri=file:///root/apps/video.mp4
num-sources=40
gpu-id=0
# (0): memtype_device   - Memory type Device
# (1): memtype_pinned   - Memory type Host Pinned
# (2): memtype_unified  - Memory type Unified
cudadec-memtype=0

[sink0]
enable=0
#Type - 1=FakeSink 2=EglSink 3=File
type=2
sync=0
source-id=0
gpu-id=0
nvbuf-memory-type=0

[sink1]
enable=0
type=3
#1=mp4 2=mkv
container=1
#1=h264 2=h265
codec=1
sync=0
#iframeinterval=200
bitrate=4000000
output-file=out.mp4
source-id=0

[osd]
enable=0
gpu-id=0
border-width=3
text-size=15
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Serif
show-clock=0
clock-x-offset=800
clock-y-offset=820
clock-text-size=12
clock-color=1;0;0;0
nvbuf-memory-type=0

[streammux]
gpu-id=0
##Boolean property to inform muxer that sources are live
live-source=0
batch-size=40
##time out in usec, to wait after the first buffer is available
##to push the batch even if the complete batch is not formed
batched-push-timeout=40000
## Set muxer output width and height
width=1920
height=1080
##Enable to maintain aspect ratio wrt source, and allow black borders, works
##along with width, height properties
enable-padding=0
nvbuf-memory-type=0

# config-file property is mandatory for any gie section.
# Other properties are optional and if set will override the properties set in
# the infer config file.
[primary-gie]
enable=1
gpu-id=0
labelfile-path=/root/apps/model.txt
batch-size=40
bbox-border-color0=1;0;0;1
bbox-border-color1=0;1;1;1
bbox-border-color2=0;0;1;1
bbox-border-color3=0;1;0;1
interval=1
gie-unique-id=1
nvbuf-memory-type=0
config-file=model.txt

[tracker]
enable=1
tracker-width=640
tracker-height=384
gpu-id=0
ll-lib-file=/opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_nvdcf.so
ll-config-file=/root/apps/tracker.yml
enable-batch-process=1

[tests]
file-loop=0

In the above screenshot, we can see RTX on the left side is about 17 FPS and on the right is T4 which is about 4-5 FPS. They are running the exact same configuration, video, and other settings. I have spent a lot of time trying to debug the cause, even if RTS is slightly powerful than T4 but I don’t think it is 3x slower. The GPU utilization on RTX is about 89-92% while that on T4 is 96-100%. Any reason why this might be happening?

• Hardware Platform: RTX 2080Ti and T4
• DeepStream Version: 5
• TensorRT Version: 7.0
• NVIDIA GPU Driver Version (valid for GPU only): 440.33.01 on both the cards

Hi,
This is for 2080 Ti compute capability:
 14.2 TFLOPS1 of peak single precision (FP32) performance
 28.5 TFLOPS1 of peak half precision (FP16) performance

this is for T4 compute capability:
SINGLE PRECISION PERFORMANCE (FP32) 8.1 TFLOPS
MIXED PRECISION (FP16/FP32) 65 FP16 TFLOPS
INT8 PRECISION 130 INT8 TOPS

You can find from https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf for 2080 Ti
https://www.nvidia.com/en-in/data-center/tesla-t4/ for T4
in which mode your infer running on?

besides, you should compare the performance when both not reach GPU top utility, in your test, i see almost reach full GPU utility on T4, you could set interval of nvinfer in config to a large value to skip more batch frames to ease GPU loading and see what happened.

FPS I have received is consistent across the execution of the program. And I’m not sure if I can use a larger interval setting since I’m already using interval=1 right now.

Edit:

I was using FP16 on RTX and FP32 on T4. If I use FP16 on T4, I’m getting ~9 FPS which is still way to less compared to RTX.

Edit2:
I read in the DeepStream Release Notes section 3.4 that in order to avoid reduced performance on T4 I need CUDA 10.1 installed with CUDA 10.2 and use NVIDIA driver 418.126.02. Is this bug related to the same?

@amycao any update on this? I’m stuck because of this bug and there’s no way around. Thanks.

You could try DeepStream Release Notes section 3.4

Did those release notes get updated? Section 3.4 no longer seems to have a section about the mixed cuda solution. I was working on getting that fix applied last week and I come back to it today and the release notes seem to have changed.

Yeah that’s what I was wondering as well. I will try the new GA release and update here. Thanks.