Video Transcoding using multiple GPUs (32 live streaming jobs)

tesl4 · April 17, 2019, 8:13pm

Hello everyone.

I’m having problems with ffmpeg transcoding.

I have the following configuration:

1 Quadro RTX4000

6 Quadro M4000

32 GB RAM

Intel i9-9900X (10 cores, 20 threads)

Motherboard Asus WS SAGE X299

2x250GB SSD RAID0

I know that one Quadro M4000 is capable to transcode 8 jobs with following characteristics:

Video Input:
1080p, 24 fps, h264

Vide Output
720p, 24 fps, h264
480p, 24 fps, h264
360p, 24 fps, h264
180p, 24 fps, h264

Live streaming Input (Ethernet) → Transcoding → Live streaming output (localhost) (Multi resolution)

These jobs are live streaming. This means that I need to keep an 1x speed.

I’m using the following command:

ffmpeg -stream_loop -1 -hwaccel_device 0 -hwaccel cuda -hwaccel_output_format cuda \
-ignore_unknown -threads 1 -re -i 'http://dash.akamaized.net/dash264/TestCasesHD/2b/qualcomm/2/MultiRes.mpd' \
-filter_complex '[0:v:3]yadif_cuda,scale_cuda=1280:720,split=4[720p][v1][v2][v3]; \
[v1]scale_cuda=284:180[180p];[v2]scale_cuda=640:360[360p];[v3]scale_cuda=640:480[480p]' \
-c:v h264_nvenc -map '[180p]' -b:v 256k -maxrate 256k -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 \
-r 24 -map '0:4' -c:a copy -b:a 32k    -f mpegts 'udp://127.0.0.1:10000?pkt_size=1316' -c:v h264_nvenc \
-map '[360p]' -b:v 1228800 -maxrate 1228800 -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 -r 24 \
-map '0:4' -c:a copy -b:a 32k -f mpegts 'udp://127.0.0.1:20000?pkt_size=1316' -c:v h264_nvenc \
-map '[480p]' -b:v 2048000 -maxrate 2048000 -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 -r 24 \
-map '0:4' -c:a copy -b:a 32k -f mpegts 'udp://127.0.0.1:30000?pkt_size=1316' -c:v h264_nvenc \
-map '[720p]' -b:v 3072000 -maxrate 3072000 -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 -r 24 \
-map '0:4' -c:a copy -b:a 32k -f mpegts 'udp://127.0.0.1:40000?pkt_size=1316'

I want to transcode 56 of these jobs (7 GPUs with 8 jobs each) but today I’m able to transcode only 32 jobs (4 GPUs with 8 jobs each). If I launch another job, the speed starts to going down.

With 32 jobs, CPU load average is too high but CPU consumption is 20% . RAM bandwidth is loaded at 4% of its
capacity. I’m not writing to SSD.

I have run several analysis using VTune and results say that I have problems with Frontend and Backend but I’m not sure how to interpret those results. I think that the nature of the jobs (live streaming) produce cache misses and branches misprediction resulting in stalls. The VTune results also says that CPI is to high (>2.5) and instruction retiring is approximate 15% of clock ticks.

This is an image from VTune:
https://drive.google.com/file/d/1naqotnGl1pr8osi9p5mLcSH5dGoVF6ZS/view?usp=sharing

Somebody has a similar configuration? What do you recommend to improve the performance? Do you think a server with two Sockets could improve the performance?

This is the output of nvidia-smi topo --matrix

GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    CPU Affinity
GPU0     X      PIX     PIX     PIX     SYS     SYS     SYS     0-19
GPU1    PIX      X      PIX     PIX     SYS     SYS     SYS     0-19
GPU2    PIX     PIX      X      PIX     SYS     SYS     SYS     0-19
GPU3    PIX     PIX     PIX      X      SYS     SYS     SYS     0-19
GPU4    SYS     SYS     SYS     SYS      X      PIX     PIX     0-19
GPU5    SYS     SYS     SYS     SYS     PIX      X      PIX     0-19
GPU6    SYS     SYS     SYS     SYS     PIX     PIX      X      0-19

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

malakudi · April 19, 2019, 10:02am

Probably your issue is related to my findings here:
https://devtalk.nvidia.com/default/topic/1049717/video-codec-and-optical-flow-sdk/performance-limit-at-around-2500-fps-/

You will need to patch and recompile ffmpeg with the patch I sent and test if it also fixes your issue.

tesl4 · April 22, 2019, 3:48pm

Hello malakudi,

Thanks for your comment. I’m going to patch ffmpeg and comment the results

tesl4 · April 23, 2019, 6:52pm

Hello malakudi,

I have applied your patch and it is working.
I’m transcoding 64 jobs using seven GPUs (RTX 4000: 16 jobs, M4000: 8 jobs each). Have you found an explanation for this patch?

malakudi · April 23, 2019, 6:57pm

If you want, follow up the open ticket at #7674 (ffmpeg with cuvid transcoding after version 3.4.1 work unstable on heavy load CUDA card) – FFmpeg with your user case and confirm the fix I posted works for your case too.

My understanding of the code is not enough to explain why this code affects performance and commenting it out brings performance back. I found it by cherry picking commits.

Topic		Replies	Views
multiple gpus(3 quadro m4000) not able to run 100 transcode ffmpeg(cuda,nvdec,cuvid) services Linux	0	618	March 13, 2020
4k encoding Linux	0	1135	March 28, 2018
Performance limit at around 2500 fps? Video Processing & Optical Flow	13	1998	August 20, 2022
Buy several GTX cards or a simple Quadro card Video Processing & Optical Flow	3	860	December 5, 2018
FFMPEG Transcoding Perfromance not good on Tesla p4 GPU-Accelerated Libraries	1	2068	July 27, 2019
How many concurrent streams can Quadro RTX4000 transcode (e.g. 1080p, 30FPS, H264)? CUDA Programming and Performance	1	5785	August 26, 2020
FFMPEG live transcoding -multiple UDP outputs not producing playable video General Topics and Other SDKs cuda	0	530	April 17, 2023
Multiple FFMpeg-Cuda-HLS-Transcoding Instances -> Deadlock Behavior Linux	13	2556	May 20, 2020
NVENC and NVDEC for video streaming over internet General Topics and Other SDKs	0	1689	February 9, 2019
Quadro RTX 4000 - maximum GPU Power reached Linux cuda , gpu	0	1241	March 2, 2021

Video Transcoding using multiple GPUs (32 live streaming jobs)

Related topics