Tesla V100 SXM2 poor performance compared to Jetson AGX Xavier

yu.adamenko1 · March 18, 2022, 4:48pm

We faced with an issue, that we couldn’t explain to ourselves. May be somebody here can help us.

We’ve got AGX Xavier, that we leverage to train custom YoloV4-tiny model (416x416). The inference performance with this model on Xavier is about 300 FPS while using TensorRT and Deepstream.

Recently we’ve rent an Oracle Cloud server with Tesla V100 16Gb on board and expected ~10x performance increase with most of the tasks we used to execute.
As we know V100 has exactly 10x more cores (512 to 5120) compared to Xavier AGX according to the hardware specs.

But unexpectedly, we see only slightly better performance compared to Xavier AGX:

the same 300FPS inference speed
2,5x training speed (on Darknet framework)

What are we doing wrong?

Some details on our environment:

Xavier:
32Gb Memory
ARM Processor
TensorRT 7
CUDA 10.2

V100
16Gb Memory
TensorRT 8
CUDA 11.4

Pipeline we are using:

gst-launch-1.0 filesrc location=/mnt/v100/eval3.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! queue ! \
mux.sink_0 nvstreammux name=mux batch-size=1 width=1920 height=1080 ! queue ! nvvideoconvert ! queue ! \
nvinfer config-file-path=config_infer_primary_yoloV4.txt batch-size=1 ! queue ! \
nvtracker tracker-width=512 tracker-height=320 ll-lib-file=/opt/nvidia/deepstream/deepstream-6.0/lib/libnvds_nvmultiobjecttracker.so ! queue ! \
nvdsosd ! queue ! nvvideoconvert ! nvv4l2h264enc ! h264parse ! qtmux ! filesink location=out_720.mp4

Fiona.Chen · March 21, 2022, 2:14am

First, your pipeline will use lot of Nvidia resources(CPU, codec, memory,…) but not only GPU. The performance of the whole pipeline depends on all components in the pipeline but not only inference.

Second, even consider the inference only, the different models will get different improve rates from Jetson AGX Xavier to Tesla V100. Please refer to
Performance — DeepStream 6.1.1 Release documentation, the FaceDetectIR- ResNet18 model sample FPS on AGX Xavier is 2007 and 5578 on A100, but for PeopleNet- ResNet34 model, the FPS improve from 305 to 3345 with AGX Xavier and A100.

yu.adamenko1 · March 24, 2022, 1:04pm

Thank you, Fiona, it makes sense.

We conducted several experiments and figured out, that the bottleneck in most of our Deepstream pipelines was video decoder. No matter what you do, you won’t get more than 500 FPS of decoded video on the output of nvv4l2decoder. Seems like the HW decoder can’t be shared across multiple streams in a pipeline even if you try to connect them in parallel.

You also can’t involve more than one nvinfer at the same time (adding queue doesn’t help). If you try to connect them in parallel, looks like they are working in a blocking manner and the processing time grows linearly as the number of nvinfers in a pipeline grow.

The only scenario where we’ve got impressive 3000 FPS was this one:

gst-launch-1.0 \
filesrc location = sample.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! queue ! tee name=t \
t. ! queue ! m.sink_0 \
t. ! queue ! m.sink_1 \
t. ! queue ! m.sink_2 \
t. ! queue ! m.sink_3 \
t. ! queue ! m.sink_4 \
t. ! queue ! m.sink_5 \
t. ! queue ! m.sink_6 \
t. ! queue ! m.sink_7 \
nvstreammux name=m width=1920 height=1080 batch-size=8 ! queue ! \
nvinfer config-file-path=config_infer_primary_yoloV4.txt batch-size=8 ! fakesink

Here we use a single decoder (~500FPS output), then multiple it by 8 with the «tee» element (~500FPS*8 output) then send it in batch=8 to a single nvinfer (batch=8) with nvstreammux, and Voila. We see ~3000FPS of clean inference, that is not bottlenecked by video decoder.

P.S. All experiments were performed on the same Tesla V100 dGPU

Fiona.Chen · March 25, 2022, 1:37am

Tesla V100 has only one NVDEC core. So it is not powerful enough.

Video Encode and Decode GPU Support Matrix [NEW] | NVIDIA Developer

system · April 12, 2022, 3:15am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance drop when using multiple sources DeepStream SDK	27	930	April 29, 2024
Low FPS in Deepstream and Yolov4 on Jetson AGX Xavier DeepStream SDK	7	2230	October 12, 2021
PERF issues with DeepStream6.2 + YOLOv8 in Jetson Xavier DeepStream SDK jetson-inference , performance , yolo , fps , deepstream	8	993	September 26, 2023
Deepstream 4.0 on YoloV3(Webcam vs. Video) DeepStream SDK	4	1106	October 12, 2021
Increase FPS on Yolo4 model Jetson AGX Xavier python , linux , machine-learning	3	2864	March 30, 2022
Deepstream 4 + yolov3 multi source slow DeepStream SDK	9	1816	October 12, 2021
DeepStream 5 vs 6 inference time and calculate fps in the pipeline on Jetson Nano DeepStream SDK	9	2853	January 14, 2022
Not able to use nvdsmetamux correctly DeepStream SDK deepstream	7	51	April 14, 2025
Playing 25 parallel streams casusing issue DeepStream SDK gstreamer	12	457	October 8, 2021
Nvvidconv is slow Jetson Xavier NX gstreamer	7	1580	March 2, 2022

Tesla V100 SXM2 poor performance compared to Jetson AGX Xavier

Related topics