GPU decoder unit has been locked on 100% usage

cjhih43 · April 16, 2021, 11:08am

When i convert some contents with GPU(cuvid, hwupload_coda filter, nvenc) on ffmpeg, only decode units are locked on 100% usage, even if i killed process.

I was using multi processing with nvidia-patch (GitHub - keylase/nvidia-patch: This patch removes restriction on maximum number of simultaneous NVENC video encoding sessions imposed by Nvidia to consumer-grade GPUs.)

I tried to recovery system with this commends

sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia
sudo modprobe nvidia

But after that commends that PC has been locked(?)

GPU type: RTX 3070, RTX 3090

pc1
OS: Ubuntu 20.04.2 LTS (GNU/Linux 5.8.0-45-generic x86_64)
GPU: RTX 3090
Driver Ver: NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2

pc2
OS: Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-66-lowlatency x86_64)
GPU: RTX 3070 2ea
Driver Ver: NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2

pc3
OS: Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-66-generic x86_64)
GPU: RTX 3070
Driver Ver: NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2

How can i solve it on ubuntu?

Robert_Crovella · April 16, 2021, 1:37pm

reboot may be one possible way to solve it.

it may be that that patch is doing something that leads to this condition

cjhih43 · April 16, 2021, 5:26pm

I found some strange on PC3
Cuz i was runch unsupported patch. So that server was not patched.
But that server are throwing same situations.

vvivvy1999 · April 25, 2021, 3:31am

Hi,

We have observed similar issue on a server with two RTX 3090 cards, ubuntu 18.04, driver 460.39, cuda 11.2.

The program is base on DeepStream which utilizes GPU decoder to process ~40 H265 streams. It runs in a Docker container of nvcr.io/nvidia/deepstream:5.1-21.02-base.
The GPU decoder randomly goes up to 100% usage, say GPU 1. nvidia-smi pmon shows no program is using decoder on GPU 1, while the other card GPU 0 is running well. Terminating all processes reported by nvidia-smi -i 1 don’t resolve the usage.

Full reboot does solve the case.

Also we tried nvidia-smi -r -i 1 to reset GPU 1, after all processes on it terminated. The operation reports successful but after the reset, subsequent nvidia-smi calls report Unable to determine device handle: Unknown Error.
Then we tried nvidia-smi -r -i 0 to reset the other running GPU, with a successful result. Magically both device could be recognized again! However manually closing all services – reset all gpus – restart all services is still painful.

The same decoding process is tested on T4 which never encountered such case.

The nvidia-bug-report.log.gz is attached here if it helps. Thanks. nvidia-bug-report.log.gz (1.1 MB)

Edit: the bug report is collected after reset GPU 1, before reset GPU 0.