NVENC and NVDEC work on only one GPU with Multi-GPU setups with NVIDIA Container Toolkit in Driver >=565

The above are critical issues where NVENC and NVDEC work on only one GPU with Multi-GPU setups with NVIDIA Container Toolkit in driver versions >565, which is >=570.

This is in relation to NVENC crashing (due to not finding a CUDA device) when using multiple NVIDIA GPUs while trying to use any index other than ‘0’. Many efforts tried to only expose devices using NVIDIA_VISIBLE_DEVICES envvar and assigning them using index or GPU-UUID.

Only one GPU works (it may be the first GPU, last GPU, or anything in between), and everything else fails in FFmpeg:

[h264_nvenc @ 0x] OpenEncodeSessionEx failed: unsupported device (2): (no details)
[h264_nvenc @ 0x] No capable devices found

Moreover, GStreamer also fails in a similar way when FFmpeg fails:

nvh264encoder gstnvh264encoder.cpp:2158:gst_nv_h264_encoder_register_cuda:<cudacontext0> Failed to open session
nvh265encoder gstnvh265encoder.cpp:2196:gst_nv_h265_encoder_register_cuda:<cudacontext0> Failed to open session
nvenc gstnvenc.c:685:gst_nv_enc_register: NvEncOpenEncodeSessionEx failed: codec h264, device 0, error code 2
nvenc gstnvenc.c:685:gst_nv_enc_register: NvEncOpenEncodeSessionEx failed: codec h265, device 0, error code 2

The above is on driver version 580.82.07 with five NVIDIA Titan Xp GPUs.

Driver versions 565 or 550 work fine, but this is a regression of the driver version 570 or higher; therefore, I am bringing this up in the forum to the driver team.

This is widely known to happen in Kubernetes, but it may also happen in Docker.

CC @amrits @generix

We are also closely monitoring this issue. Based on our test, ffmpeg NVENC functionality within K8S pods is working well on Tesla T4 nodes with multiple GPU cards. Since issue NVENC Fails in Kubernetes Pods on all but the last GPU with Driver 570.x or 580.x · Issue #1249 · NVIDIA/nvidia-container-toolkit · GitHub have indicated stable performance on V100 GPUs, and considering the current findings, we suspect there might be some driver-level issues affecting NVENC support for the GeForce series—particularly models like the 3060, 4090, and 5090. We’re continuing to look into this and will provide updates as we learn more.

It has been confirmed that driver version 565.57.01 does not have this issue, but both the 570 and 580 series are affected. What is the current status regarding this problem?

@ktsong I have confirmed that the 580 driver series have the issue on Nvidia RTX 5070 Ti GPUs. This forces us to downgrade drivers and OS. When should we expect a fix?

Is anyone looking into this issue? We are also forced to use older drivers which blocks the use of our newly purchased RTX 5090 GPUs in our cluster …

I’ve figured out why it doesn’t work. It has nothing to do with the mismatch of /dev/nvidiaX between container and host.

When NVENC is initialized, NVENC’s user‑space stack (libnvcuvid/libnvidia-encode) queries the NVIDIA Resource Manager via /dev/nvidiactl and gets an “attached GPU IDs” list (NV0000_CTRL_CMD_GPU_GET_ATTACHED_IDS, cmd 0x201) that includes all host GPUs, even inside a 1‑GPU pod.

When that list contains multiple GPUs, the NVENC open path takes a multi‑GPU/peer‑init branch and tries to touch the other GPU’s device node (/dev/nvidiaY), which is not mounted in the pod.

That peer‑init step fails, so the code bails out before class enumeration (0x00800201) and before allocating the required RM object (class 0xC661), returning NV_ENC_ERR_UNSUPPORTED_DEVICE even though the target GPU itself is fine.

The issue needs to be fixed by Nvidia.

@amrits @generix Can a ticket be opened in NVIDIA regarding the above for 590, 580, and 570 (all currently supported driver branches affected)?

I can also confirm this problem. This issue currently prohibits us from bumping drivers in our cluster and start using 5090s - would be great if this is fixed ASAP!

I did more investigation, figured out the internal logic.

  1. libnvidia-encode.so calls libnvcuvid.so to setup/init GPUs
  2. libnvcuvid.so communicates with RM via /dev/nvidiactl, and it can see all GPUs
  3. When there are multiple GPUs available, it picks one as “primary” GPU. That’s the GPU with the “lexicographically smallest” uuid. Even if it can get which GPU is really available from libcuda.so

So, when you have multiple GPUs in the host. Only the container which has the “smallest” GPU uuid works.

It doesn’t mean in the host NVENC can only works on one GPU. That “primary” GPU setup is only during GPU init phase. If that phase passes, real nvenc coding work can be done on non primary GPU.

I’m not sure if it’s intended, since it has been long time no fix from Nvidia.