Slow parallel performance when using three (3) Nvidia p100 for encoding/decoding on the same server.

Gonzalo_Guerrero · May 4, 2018, 10:43am

Background:

Right now we have a server setup with three (3) Nvidia p100, mainly for video encoding/decoding using nvenc/nvdec, we are using Ffmpeg compiled against Cuda.

cat /usr/local/cuda/version.txt
CUDA Version 9.1.85

Everything goes fine if i start lets say 12 ffmpeg processes for live encoding from h264 to hevc using ffmpeg all of them on card “0”.

no problems, everything goes fine.

but as soon as i start using any of the other two nvidia cards ( 1 or 2 ) the output fps on the first 12 process (runing on card 0) goes down.

As far as i understand this shouldn’t be an issue because each ffmpeg process is assigned to a different card, i’m using hwaccel cuvid & scale_npp from ffmpeg.

Topology:

GPU0    GPU1    GPU2    CPU Affinity
GPU0     X      SYS     SYS     0-11,24-35
GPU1    SYS      X      NODE    12-23,36-47
GPU2    SYS     NODE     X      12-23,36-47

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

nvidia-smi -q | grep Link -A2
        GPU Link Info
            PCIe Generation
                Max                 : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        GPU Link Info
            PCIe Generation
                Max                 : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        GPU Link Info
            PCIe Generation
                Max                 : 3
            Link Width
                Max                 : 16x
                Current             : 16x

Any help will be greatly appreciated,
thanks in advance.

cantavffdo · December 18, 2018, 6:08pm

Gonzalo,

I’m dealing with the same issue. Did you ever found a solution?

So far, I’ve following this problem up to the PCIE and QPI buses being at max transfer rate, but can’t explain yet why cards on different buses and running jobs CPU pinned to each corresponding bus would need to use QPI or the other PCIE bus at all.

Topic		Replies	Views
Encoding performance degrades when the GPU used is not connected to the first PCIe slot Video Processing & Optical Flow	2	858	July 1, 2021
Performance limit at around 2500 fps? Video Processing & Optical Flow	13	1974	August 20, 2022
Using more than 1 CUDA card at a time. Physics simulations flat out flying on GPU CUDA Programming and Performance	12	12693	March 12, 2010
Buy several GTX cards or a simple Quadro card Video Processing & Optical Flow	3	855	December 5, 2018
NVEnc Details General Topics and Other SDKs	1	857	May 21, 2021
CUDA and openCL support for multiple GPU/PCI devices? CUDA Programming and Performance	7	5486	April 11, 2009
Nvidia docker decoder and cuda function performance issue on multiple cards General Discussion	0	977	November 2, 2022
Encoding multiple video limited to 2 encodes CUDA Programming and Performance	8	8010	December 19, 2016
Use nvcuvid with several board CUDA Programming and Performance	7	1120	October 20, 2010
Memory bandwidth CUDA Programming and Performance	31	38739	October 5, 2007

Slow parallel performance when using three (3) Nvidia p100 for encoding/decoding on the same server.

Related topics