We faced with an issue, that we couldn’t explain to ourselves. May be somebody here can help us.
We’ve got AGX Xavier, that we leverage to train custom YoloV4-tiny model (416x416). The inference performance with this model on Xavier is about 300 FPS while using TensorRT and Deepstream.
Recently we’ve rent an Oracle Cloud server with Tesla V100 16Gb on board and expected ~10x performance increase with most of the tasks we used to execute.
As we know V100 has exactly 10x more cores (512 to 5120) compared to Xavier AGX according to the hardware specs.
But unexpectedly, we see only slightly better performance compared to Xavier AGX:
the same 300FPS inference speed
2,5x training speed (on Darknet framework)
What are we doing wrong?
Some details on our environment:
Xavier:
32Gb Memory
ARM Processor
TensorRT 7
CUDA 10.2
First, your pipeline will use lot of Nvidia resources(CPU, codec, memory,…) but not only GPU. The performance of the whole pipeline depends on all components in the pipeline but not only inference.
Second, even consider the inference only, the different models will get different improve rates from Jetson AGX Xavier to Tesla V100. Please refer to Performance — DeepStream 6.1.1 Release documentation, the FaceDetectIR- ResNet18 model sample FPS on AGX Xavier is 2007 and 5578 on A100, but for PeopleNet- ResNet34 model, the FPS improve from 305 to 3345 with AGX Xavier and A100.
We conducted several experiments and figured out, that the bottleneck in most of our Deepstream pipelines was video decoder. No matter what you do, you won’t get more than 500 FPS of decoded video on the output of nvv4l2decoder. Seems like the HW decoder can’t be shared across multiple streams in a pipeline even if you try to connect them in parallel.
You also can’t involve more than one nvinfer at the same time (adding queue doesn’t help). If you try to connect them in parallel, looks like they are working in a blocking manner and the processing time grows linearly as the number of nvinfers in a pipeline grow.
The only scenario where we’ve got impressive 3000 FPS was this one:
gst-launch-1.0 \
filesrc location = sample.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! queue ! tee name=t \
t. ! queue ! m.sink_0 \
t. ! queue ! m.sink_1 \
t. ! queue ! m.sink_2 \
t. ! queue ! m.sink_3 \
t. ! queue ! m.sink_4 \
t. ! queue ! m.sink_5 \
t. ! queue ! m.sink_6 \
t. ! queue ! m.sink_7 \
nvstreammux name=m width=1920 height=1080 batch-size=8 ! queue ! \
nvinfer config-file-path=config_infer_primary_yoloV4.txt batch-size=8 ! fakesink
Here we use a single decoder (~500FPS output), then multiple it by 8 with the «tee» element (~500FPS*8 output) then send it in batch=8 to a single nvinfer (batch=8) with nvstreammux, and Voila. We see ~3000FPS of clean inference, that is not bottlenecked by video decoder.
P.S. All experiments were performed on the same Tesla V100 dGPU