I’m having problems with ffmpeg transcoding.
I have the following configuration:
I know that one Quadro M4000 is capable to transcode 8 jobs with following characteristics:
1080p, 24 fps, h264
720p, 24 fps, h264
480p, 24 fps, h264
360p, 24 fps, h264
180p, 24 fps, h264
Live streaming Input (Ethernet) → Transcoding → Live streaming output (localhost) (Multi resolution)
These jobs are live streaming. This means that I need to keep an 1x speed.
I’m using the following command:
ffmpeg -stream_loop -1 -hwaccel_device 0 -hwaccel cuda -hwaccel_output_format cuda \ -ignore_unknown -threads 1 -re -i 'http://dash.akamaized.net/dash264/TestCasesHD/2b/qualcomm/2/MultiRes.mpd' \ -filter_complex '[0:v:3]yadif_cuda,scale_cuda=1280:720,split=4[720p][v1][v2][v3]; \ [v1]scale_cuda=284:180[180p];[v2]scale_cuda=640:360[360p];[v3]scale_cuda=640:480[480p]' \ -c:v h264_nvenc -map '[180p]' -b:v 256k -maxrate 256k -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 \ -r 24 -map '0:4' -c:a copy -b:a 32k -f mpegts 'udp://127.0.0.1:10000?pkt_size=1316' -c:v h264_nvenc \ -map '[360p]' -b:v 1228800 -maxrate 1228800 -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 -r 24 \ -map '0:4' -c:a copy -b:a 32k -f mpegts 'udp://127.0.0.1:20000?pkt_size=1316' -c:v h264_nvenc \ -map '[480p]' -b:v 2048000 -maxrate 2048000 -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 -r 24 \ -map '0:4' -c:a copy -b:a 32k -f mpegts 'udp://127.0.0.1:30000?pkt_size=1316' -c:v h264_nvenc \ -map '[720p]' -b:v 3072000 -maxrate 3072000 -vsync 1 -sc_threshold 0 -g 90 -keyint_min 30 -r 24 \ -map '0:4' -c:a copy -b:a 32k -f mpegts 'udp://127.0.0.1:40000?pkt_size=1316'
I want to transcode 56 of these jobs (7 GPUs with 8 jobs each) but today I’m able to transcode only 32 jobs (4 GPUs with 8 jobs each). If I launch another job, the speed starts to going down.
With 32 jobs, CPU load average is too high but CPU consumption is 20% . RAM bandwidth is loaded at 4% of its
capacity. I’m not writing to SSD.
I have run several analysis using VTune and results say that I have problems with Frontend and Backend but I’m not sure how to interpret those results. I think that the nature of the jobs (live streaming) produce cache misses and branches misprediction resulting in stalls. The VTune results also says that CPI is to high (>2.5) and instruction retiring is approximate 15% of clock ticks.
This is an image from VTune:
Somebody has a similar configuration? What do you recommend to improve the performance? Do you think a server with two Sockets could improve the performance?
This is the output of nvidia-smi topo --matrix
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 CPU Affinity GPU0 X PIX PIX PIX SYS SYS SYS 0-19 GPU1 PIX X PIX PIX SYS SYS SYS 0-19 GPU2 PIX PIX X PIX SYS SYS SYS 0-19 GPU3 PIX PIX PIX X SYS SYS SYS 0-19 GPU4 SYS SYS SYS SYS X PIX PIX 0-19 GPU5 SYS SYS SYS SYS PIX X PIX 0-19 GPU6 SYS SYS SYS SYS PIX PIX X 0-19 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge) PIX = Connection traversing a single PCIe switch NV# = Connection traversing a bonded set of # NVLinks