ehfd
October 11, 2025, 9:19am
1
opened 07:42AM - 18 Aug 25 UTC
### Summary
We are experiencing a critical regression with NVENC hardware encod… ing when using NVIDIA driver version `570.x` in a multi-GPU Kubernetes environment. On a node with four identical GPUs, any containerized application managed by the GPU Operator can only use the NVENC encoder successfully if it is scheduled on the last enumerated GPU (e.g., GPU 3 of 4). Pods scheduled on any other GPU (0, 1, or 2) fail to initialize the encoder.
This issue is a clear regression, as the entire setup works perfectly with the `550.x` driver series. Host-level encoding works on all cards, and we have confirmed there is **no** driver version mismatch between the host and the container. The problem appears to be specific to how the 570.x driver exposes NVENC capabilities to the containerized environment in a multi-GPU configuration.
### Environment Details
* **Hardware:**
* **CPU:** AMD Ryzen Threadripper 7970X (32-Cores)
* **GPU:** 4 x NVIDIA GeForce RTX 4080 SUPER
* **Motherboard:** ASUSTeK Pro WS TRX50-SAGE WIFI
* **Software:**
* **Orchestrator:** Kubernetes
* **GPU Management:** NVIDIA GPU Operator
* **Host Driver (Problematic):** `570.x` (e.g., 570.124.06)
* **Host Driver (Working):** `550.x` series
* **Container:** Using a container with correctly matched user-space libraries for the host driver.
* **Application:** An Unreal Engine-based rendering service, and standard `ffmpeg`.
### Steps to Reproduce
1. Configure a Kubernetes node with multiple identical GPUs (e.g., 4x 4080 SUPER) and install NVIDIA host driver `570.x`.
2. Deploy the NVIDIA GPU Operator.
3. Deploy a Kubernetes `Deployment` that requests a single GPU (`spec.containers.resources.limits: nvidia.com/gpu: 1`).
4. Ensure pods from the Deployment are scheduled on different physical GPUs (e.g., GPU 0, GPU 1, etc.).
5. Inside a pod scheduled on any GPU *except the last one*, attempt to initialize an NVENC session using any application (`ffmpeg`, custom code, etc.).
### Expected Behavior
The containerized application should be able to successfully initialize the NVENC hardware encoder and perform video encoding, regardless of which physical GPU (0, 1, 2, or 3) is assigned to the pod.
### Actual Behavior
1. **Consistent Failure on first N-1 GPUs:** NVENC initialization fails on pods assigned to GPU 0, GPU 1, and GPU 2.
2. **Consistent Success on the last GPU:** A pod that is scheduled on GPU 3 works perfectly and can encode video without issue.
3. **Application-Agnostic Failure:** The issue is not tied to our application. A standard `ffmpeg` command inside a failing pod reproduces the error perfectly:
```bash
$ ffmpeg -f lavfi -i testsrc=size=1920x1080:rate=30 -t 10 -c:v h264_nvenc -f null -
...
[h264_nvenc @ 0x55de29791c00] OpenEncodeSessionEx failed: unsupported device (2): (no details)
[h264_nvenc @ 0x55de29791c00] No capable devices found
Error initializing output stream 0:0 -- Error while opening encoder for output stream #0:0
```
4. Our Unreal Engine application logs corresponding errors:
```
LogAVCodecs: Error: Error Creating: Failed to create encoder [NVENC 2]
LogPixelStreaming: Error: Could not create encoder.
```
### Troubleshooting and Analysis
* **This is a clear regression,** as downgrading the host driver to the `550.x` series resolves the issue completely on the exact same hardware and software stack.
* The issue is **specific to the container environment.** Running `ffmpeg` with NVENC directly on the host OS works correctly for all 4 GPUs simultaneously.
* The problem is tied to the **logical GPU index**, not a specific faulty card. Physically swapping the GPUs does not change the behavior; the failure always occurs on the first N-1 logical GPUs.
* Based on this evidence, the behavior strongly suggests a bug in the `570.x` driver or a related component of the GPU Operator toolkit. The issue likely lies in the enumeration or initialization process for NVENC capabilities when exposing them to a container in a multi-GPU system.
### Workaround
The only known workaround is to **downgrade the NVIDIA host driver to a version in the 550.x series.**
opened 01:07AM - 30 Jul 25 UTC
So, this is a weird and pretty specific problem. I am at a loss at what I can do… next because I am unsure if this is an issue with nvidia or ffmpeg (nvdec specifically). Issue has been observed in the frigate tensorrt images and ubuntu cuda images with ffmpeg 7.
This is in relation to ffmpeg crashing (due to not finding a CUDA device) when using multiple nvidia gpus while trying to use any other index than '0'. I have tried to only expose devices using NVIDIA_VISIBLE_DEVICES env var and assigning using index or GPU-UUID. The weird part is that I can load ONNX models onto GPU 1, which kind of leads towards this may be something specific to ffmpeg.
I am opening this issue to seek advice and to see if there are any other users having this issue.
Error from ffmpeg output:
```logs
2025-07-29 16:57:53.773105842 [2025-07-29 16:57:53] ffmpeg.alley.detect ERROR : [AVHWDeviceContext @ 0x5dfde7cb58c0] cu->cuDeviceGet(&hwctx->internal->cuda_device, device_idx) failed -> CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2025-07-29 16:57:53.773621224 [2025-07-29 16:57:53] ffmpeg.alley.detect ERROR : Device creation failed: -542398533.
2025-07-29 16:57:53.774103329 [2025-07-29 16:57:53] ffmpeg.alley.detect ERROR : [vist#0:0/h264 @ 0x5dfde7c75f40] [dec:h264 @ 0x5dfde7d32440] No device available for decoder: device type cuda needed for codec h264.
2025-07-29 16:57:53.774607022 [2025-07-29 16:57:53] ffmpeg.alley.detect ERROR : [vist#0:0/h264 @ 0x5dfde7c75f40] [dec:h264 @ 0x5dfde7d32440] Hardware device setup failed for decoder: Generic error in an external library
2025-07-29 16:57:53.775085378 [2025-07-29 16:57:53] ffmpeg.alley.detect ERROR : [vost#0:0/rawvideo @ 0x5dfde7c80000] Error initializing a simple filtergraph
2025-07-29 16:57:53.775575817 [2025-07-29 16:57:53] ffmpeg.alley.detect ERROR : Error opening output file pipe:.
2025-07-29 16:57:53.776048525 [2025-07-29 16:57:53] ffmpeg.alley.detect ERROR : Error opening output files: Generic error in an external library
```
Edit: Other issues where this error occurs:
- https://github.com/blakeblackshear/frigate/discussions/18018
- https://github.com/blakeblackshear/frigate/discussions/18722
I have also opened a ticket on the ffmpeg bug tracker: https://trac.ffmpeg.org/ticket/11694
opened 01:21PM - 06 Jun 25 UTC
lifecycle/stale
### 🐛 Describe the bug
When deploying GPU-bound pods using the NVIDIA device pl… ugin (`nvidia-device-plugin` Helm chart v0.17.1), **FFmpeg NVENC fails inside the container unless the assigned GPU is mounted at the path `/dev/nvidiaN` where `N` matches its `index` in `nvidia-smi`.**
This issue occurs **only when using `deviceListStrategy: volume-mounts`**, which is required for secure GPU isolation in our multi-tenant environment. Using `envvar` is not an option, as users can override `NVIDIA_VISIBLE_DEVICES` in untrusted Docker images.
As a result, **only pods where the assigned GPU's `nvidia-smi` index matches the container path `/dev/nvidiaN` succeed. All others fail with `unsupported device` errors in FFmpeg**.
---
### 🛠️ Helm values
```yaml
deviceIDStrategy: uuid
deviceListStrategy: volume-mounts
runtimeClassName: nvidia
```
---
### 🧠 Root cause
NVENC appears to rely on the assumption that:
```
/dev/nvidiaN <—> GPU with index N from `nvidia-smi`
```
If this alignment is broken (e.g. GPU with `index: 0` is mounted as `/dev/nvidia5`), the encoder fails:
```
[h264_nvenc @ 0x637317ea8e80] OpenEncodeSessionEx failed: unsupported device (2): (no details)
[h264_nvenc @ 0x637317ea8e80] No capable devices found
```
This behavior is reproducible and consistent across all tested environments.
---
### 🖥️ Host configuration
* 6× NVIDIA RTX 4090 (UUID-assigned, known-good hardware)
* Host `/dev/nvidia[0-5]` layout matches `nvidia-smi` output
* `nvidia-smi`, CUDA, and NVENC work fine directly on host
* Issue **only occurs inside container** when mount path/index diverge from `nvidia-smi`
---
### ✅ Working pod example
* GPU UUID: `GPU-46b5dd79-...`
* `nvidia-smi index`: `0`
* Mounted as: `/dev/nvidia0`
* ✅ `ffmpeg -c:v h264_nvenc` works
---
### ❌ Failing pod example
* GPU UUID: `GPU-dada647b-...`
* `nvidia-smi index`: `0`
* Mounted as: `/dev/nvidia5`
* ❌ `ffmpeg -c:v h264_nvenc` fails with:
```text
[h264_nvenc @ 0x637317ea8e80] OpenEncodeSessionEx failed: unsupported device (2): (no details)
[h264_nvenc @ 0x637317ea8e80] No capable devices found
```
---
### 🔍 Additional observations
* All expected character devices (`nvidia[0-9]`, `nvidiactl`, `uvm`, etc.) are present inside the pod.
* The mounted `/dev/nvidiaX` files have correct major/minor numbers.
* The issue **only depends on the alignment between `nvidia-smi index` and the mounted path**.
* The `Device Minor:` in `/proc/driver/nvidia/gpus/.../information` **does not determine NVENC success**, only the mount path does.
---
### ✅ Expected behavior
All GPUs assigned to a container should be fully usable via NVENC — regardless of physical or logical index — **as long as the device is properly mounted**.
The device plugin should ensure that **`/dev/nvidiaN` always maps to the GPU with `nvidia-smi index N`**, or NVENC workloads will fail.
---
### 🌎 Environment
* **Host OS:** Ubuntu 22.04
* **GPUs:** 6× NVIDIA RTX 4090
* **Container runtime:** containerd
* **Kubernetes:** v1.32.x (K3s)
* **NVIDIA Driver:** 570.133.20 (also tested with 575)
* **NVIDIA device plugin:** v0.17.1 (Helm)
* **nvidia-container-runtime:** 3.14.0-1
* **nvidia-container-toolkit:** 1.17.6-1
* **NVIDIA_DRIVER_CAPABILITIES:** `compute,video,utility,graphics,display` (set in the deployment image)
* **FFmpeg:** NVENC-enabled build (confirmed working directly on host)
---
### 🧪 Steps to reproduce
1. Deploy multiple pods with:
```yaml
resources:
limits:
nvidia.com/gpu: 1
```
2. Inside each pod, run:
```bash
nvidia-smi --query-gpu=gpu_uuid,index,name --format=csv,noheader
ls -l /dev/nvidia[0-9]
ffmpeg -hide_banner -f lavfi -i testsrc=duration=3:size=1280x720:rate=30 -c:v h264_nvenc -y /tmp/test.mp4
```
3. Observe:
* If `/dev/nvidiaN` matches the `index: N` reported by `nvidia-smi`, encoding works.
* If not, FFmpeg fails.
---
### 💡 Suggested improvement
Ensure the device plugin **mounts GPU devices inside the pod at the `/dev/nvidiaN` path where `N` is the GPU's index reported by `nvidia-smi`**.
This will restore NVENC compatibility and likely benefit other workloads that rely on this path/index alignment.
---
### 🚫 Partial workaround
None identified.
Detecting the mismatch inside user space (via `nvidia-smi` + `ls -l /dev/nvidia*`) lets us fail fast, but does not resolve the root problem — NVENC will still fail to initialize.
opened 11:15AM - 16 Apr 25 UTC
bug
### NVIDIA Open GPU Kernel Modules Version
570.86.16
### Please confirm this i… ssue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- [x] I confirm that this does not happen with the proprietary driver package.
### Operating System and Version
Ubuntu 22.04.5 LTS
### Kernel Release
5.15.0-113-generic
### Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- [x] I am running on a stable kernel release.
### Hardware: GPU
NVIDIA GeForce RTX 4090
### Describe the bug
After installing version 570.86.16 of the open GPU kernel modules, I encountered an error when using the NVENC functionality inside a container with Nvidia Container Runtime. The error indicates that the device is unsupported. However, when running the application directly on the host machine, the NVENC feature works correctly.
I confirm that this does not happen with the proprietary driver package.
### To Reproduce
`docker run --runtime=nvidia --gpus '"device=0,1"' jrottenberg/ffmpeg:4.1-nvidia -report -loglevel debug -f lavfi -i testsrc=duration=5:size=1280x720:rate=30 -c:v h264_nvenc -preset fast -y /tmp/test_output.mp4`

### Bug Incidence
Always
### nvidia-bug-report.log.gz
[nvidia-bug-report.log.gz](https://github.com/user-attachments/files/19776094/nvidia-bug-report.log.gz)
### More Info
Similar issues were #104 and #378, but the nvenc problem occurs again in the new version of the driver
The above are critical issues where NVENC and NVDEC work on only one GPU with Multi-GPU setups with NVIDIA Container Toolkit in driver versions >565, which is >=570.
This is in relation to NVENC crashing (due to not finding a CUDA device) when using multiple NVIDIA GPUs while trying to use any index other than ‘0’. Many efforts tried to only expose devices using NVIDIA_VISIBLE_DEVICES envvar and assigning them using index or GPU-UUID.
Only one GPU works (it may be the first GPU, last GPU, or anything in between), and everything else fails in FFmpeg:
[h264_nvenc @ 0x] OpenEncodeSessionEx failed: unsupported device (2): (no details)
[h264_nvenc @ 0x] No capable devices found
Moreover, GStreamer also fails in a similar way when FFmpeg fails:
nvh264encoder gstnvh264encoder.cpp:2158:gst_nv_h264_encoder_register_cuda:<cudacontext0> Failed to open session
nvh265encoder gstnvh265encoder.cpp:2196:gst_nv_h265_encoder_register_cuda:<cudacontext0> Failed to open session
nvenc gstnvenc.c:685:gst_nv_enc_register: NvEncOpenEncodeSessionEx failed: codec h264, device 0, error code 2
nvenc gstnvenc.c:685:gst_nv_enc_register: NvEncOpenEncodeSessionEx failed: codec h265, device 0, error code 2
The above is on driver version 580.82.07 with five NVIDIA Titan Xp GPUs.
Driver versions 565 or 550 work fine, but this is a regression of the driver version 570 or higher; therefore, I am bringing this up in the forum to the driver team.
This is widely known to happen in Kubernetes, but it may also happen in Docker.
CC @amrits @generix
ktsong
October 23, 2025, 6:03am
2
We are also closely monitoring this issue. Based on our test, ffmpeg NVENC functionality within K8S pods is working well on Tesla T4 nodes with multiple GPU cards. Since issue NVENC Fails in Kubernetes Pods on all but the last GPU with Driver 570.x or 580.x · Issue #1249 · NVIDIA/nvidia-container-toolkit · GitHub have indicated stable performance on V100 GPUs, and considering the current findings, we suspect there might be some driver-level issues affecting NVENC support for the GeForce series—particularly models like the 3060, 4090, and 5090. We’re continuing to look into this and will provide updates as we learn more.
It has been confirmed that driver version 565.57.01 does not have this issue, but both the 570 and 580 series are affected. What is the current status regarding this problem?
@ktsong I have confirmed that the 580 driver series have the issue on Nvidia RTX 5070 Ti GPUs. This forces us to downgrade drivers and OS. When should we expect a fix?
Is anyone looking into this issue? We are also forced to use older drivers which blocks the use of our newly purchased RTX 5090 GPUs in our cluster …
tzmtl
February 17, 2026, 1:32am
6
I’ve figured out why it doesn’t work. It has nothing to do with the mismatch of /dev/nvidiaX between container and host.
When NVENC is initialized, NVENC’s user‑space stack (libnvcuvid/libnvidia-encode) queries the NVIDIA Resource Manager via /dev/nvidiactl and gets an “attached GPU IDs” list (NV0000_CTRL_CMD_GPU_GET_ATTACHED_IDS, cmd 0x201) that includes all host GPUs, even inside a 1‑GPU pod.
When that list contains multiple GPUs, the NVENC open path takes a multi‑GPU/peer‑init branch and tries to touch the other GPU’s device node (/dev/nvidiaY), which is not mounted in the pod.
That peer‑init step fails, so the code bails out before class enumeration (0x00800201) and before allocating the required RM object (class 0xC661), returning NV_ENC_ERR_UNSUPPORTED_DEVICE even though the target GPU itself is fine.
The issue needs to be fixed by Nvidia.
ehfd
February 17, 2026, 8:14am
7
@amrits @generix Can a ticket be opened in NVIDIA regarding the above for 590, 580, and 570 (all currently supported driver branches affected)?
I can also confirm this problem. This issue currently prohibits us from bumping drivers in our cluster and start using 5090s - would be great if this is fixed ASAP!
tzmtl
February 27, 2026, 10:50pm
9
I did more investigation, figured out the internal logic.
libnvidia-encode.so calls libnvcuvid.so to setup/init GPUs
libnvcuvid.so communicates with RM via /dev/nvidiactl, and it can see all GPUs
When there are multiple GPUs available, it picks one as “primary” GPU. That’s the GPU with the “lexicographically smallest” uuid. Even if it can get which GPU is really available from libcuda.so
So, when you have multiple GPUs in the host. Only the container which has the “smallest” GPU uuid works.
It doesn’t mean in the host NVENC can only works on one GPU. That “primary” GPU setup is only during GPU init phase. If that phase passes, real nvenc coding work can be done on non primary GPU.
I’m not sure if it’s intended, since it has been long time no fix from Nvidia.