Slow parallel performance when using three (3) Nvidia p100 for encoding/decoding on the same server.

Background:

Right now we have a server setup with three (3) Nvidia p100, mainly for video encoding/decoding using nvenc/nvdec, we are using Ffmpeg compiled against Cuda.

cat /usr/local/cuda/version.txt
CUDA Version 9.1.85

Everything goes fine if i start lets say 12 ffmpeg processes for live encoding from h264 to hevc using ffmpeg all of them on card “0”.

no problems, everything goes fine.

but as soon as i start using any of the other two nvidia cards ( 1 or 2 ) the output fps on the first 12 process (runing on card 0) goes down.

As far as i understand this shouldn’t be an issue because each ffmpeg process is assigned to a different card, i’m using hwaccel cuvid & scale_npp from ffmpeg.

Topology:

GPU0    GPU1    GPU2    CPU Affinity
GPU0     X      SYS     SYS     0-11,24-35
GPU1    SYS      X      NODE    12-23,36-47
GPU2    SYS     NODE     X      12-23,36-47

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks
nvidia-smi -q | grep Link -A2
        GPU Link Info
            PCIe Generation
                Max                 : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        GPU Link Info
            PCIe Generation
                Max                 : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        GPU Link Info
            PCIe Generation
                Max                 : 3
            Link Width
                Max                 : 16x
                Current             : 16x

Any help will be greatly appreciated,
thanks in advance.

Gonzalo,

I’m dealing with the same issue. Did you ever found a solution?

So far, I’ve following this problem up to the PCIE and QPI buses being at max transfer rate, but can’t explain yet why cards on different buses and running jobs CPU pinned to each corresponding bus would need to use QPI or the other PCIE bus at all.