11 GB of GPU RAM used, and no process listed by nvidia-smi

In my GPU #0, 11341MiB of GPU RAM is used, and no process is listed by nvidia-smi. How is that possible, and how can I get my memory back?

Thu Aug 18 14:27:58 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63     Driver Version: 352.63         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:02:00.0     Off |                  N/A |
| 29%   61C    P2    71W / 250W |  11341MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:03:00.0     Off |                  N/A |
| 22%   42C    P0    71W / 250W |     23MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:82:00.0     Off |                  N/A |
| 22%   35C    P0    69W / 250W |     23MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 0000:83:00.0     Off |                  N/A |
|  0%   33C    P0    60W / 250W |     23MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I had launched a Theano Python script with a lib.cnmem=0.9 flag, which explains why it used 11341MiB of GPU memory (the CNMeM library is a “simple library to help the Deep Learning frameworks manage CUDA memory.”.). However, I killed the script, and was expecting the GPU memory to get released. pkill -9 python did not help.

I use a GeForce GTX Titan Maxwell with Ubuntu 14.04.4 LTS x64.

It’s probably the result of a corrupted context on the GPU, perhaps associated with your killed script.

you can try using nvidia-smi to reset the GPUs. If that doesn’t work, reboot the server.

Thanks, following your comment I tried

sudo nvidia-smi --gpu-reset -i 0

but it didn’t work:

Unable to reset this GPU because it’s being used by some other process (e.g. CUDA application, graphics application like X server, monitoring application like other instance of nvidia-smi). Please first kill all processes using this GPU and all compute applications running in the system (even when they are running on other GPUs) and then try to reset the GPU again.
Terminating early due to previous errors.

Any other ideas?

I’d rather avoid resetting the server, as other processes are running on it.

Thanks for your help,
Franck

  1. log out of the username that issued the interrupted work to that gpu

  2. as root, find all running processes associated with the username that issued the interrupted work on that gpu:

ps -ef|grep username

  1. as root, kill all of those

  2. as root, retry the nvidia-smi gpu reset

If that doesn’t work, I’m out of ideas.

Apart from nvidia-smi, on Linux you can check which processes might be using the GPU using the command

sudo fuser -v /dev/nvidia*

(this will list processes that have NVIDIA GPU device nodes open).

3 Likes

@monoid Thanks, unfortunately it didn’t list any unwanted process.

@txbob Thanks, I’ll keep it as a last resort as they are a few of the processes being run by the same user. I do end up using it, I’ll let you know how it goes.

Any updates here? I just met the same problem recently.
Is it possible to reset the gpu device without system reboot?

@nicklhy Sorry, I don’t have any more information on my side. Did txbob’s suggestion work for you? I could not try it as I had to keep alive some processes, then one day the server rebooted as a result of a power outage. I haven’t had the issue since then.

I was facing the same problem while working with python. In my case, I simply killed python process from the system monitor and it worked.

1 Like

In my case,

I killed all process belonging to the user.

pkill -u [username]

2 Likes

I use nvtop https://github.com/Syllo/nvtop for monitoring GPU anyway (useful program). It lists processes like htop, but only those using GPU. You can kill them directly from its console. This helped me because nvidia-smi -r gives me GPU Reset couldn't run because GPU 00000000:01:00.0 is the primary GPU.

If you’re OK with killing all python processes (set /dev/nvidia# with the GPU number):

for i in $(sudo lsof /dev/nvidia0 | grep python  | awk '{print $2}' | sort -u); do kill -9 $i; done

Please refer to this: https://stackoverflow.com/questions/4354257/can-i-stop-all-processes-using-cuda-in-linux-without-rebooting

1 Like

killing the python process worked for me.
i am using a Jupyter notebook. in subsequent runs, i shutdown the notebook kernel by going to Kernel->Shutdown in the notebook. this releases the memory. used, watch nvidia-smi, to track GPU memory usage.