GPU memory is empty, but CUDA out of memory error occurs

During training this code with ray tune (1 gpu for 1 trial), after few hours
of training (about 20 trials) CUDA out of memory error occurred from GPU:0,1. And even after terminated the training process, the GPUS still give out of memory error.

nvidia-smi result

As above, currently, all of my GPU devices are empty. And there is no other python process running except these two.

import torch
torch.rand(1, 2).to('cuda:0') # cuda out of memory error
torch.rand(1, 2).to('cuda:1') # cuda out of memory error
torch.rand(1, 2).to('cuda:2') # working
torch.rand(1, 2).to('cuda:3') # working

torch.cuda.device_count() # 4
torch.cuda.memory_reserved() # 0
torch.cuda.is_available() # True
# error message of GPU 0, 1
RuntimeError: CUDA error: out of memory

However, GPU:0,1 give out of memory errors. If I reboot the computer(ubunutu 18.04.3), it returns to normal, but if I train the code again, the same problem occurs.

  • numba and tensorflow have same problem, so it seems like it is not because of pytorch.
>>> from numba import cuda
>>> device = cuda.get_current_device()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/api.py", line 460, in get_current_device
    return current_context().device
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 212, in get_context
    return _runtime.get_or_create_context(devnum)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 138, in get_or_create_context
    return self._get_or_create_context_uncached(devnum)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 153, in _get_or_create_context_uncached
    return self._activate_context_for(0)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 169, in _activate_context_for
    newctx = gpu.get_primary_context()
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 542, in get_primary_context
    driver.cuDevicePrimaryCtxRetain(byref(hctx), self.id)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 302, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 342, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_OUT_OF_MEMORY
  • dmesg result. (No GPU has fallen off the bus error)
dmesg | grep -i -e nvidia -e nvrm
[    5.946174] nvidia: loading out-of-tree module taints kernel.
[    5.946181] nvidia: module license 'NVIDIA' taints kernel.
[    5.956595] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    5.968280] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[    5.970485] nvidia 0000:09:00.0: enabling device (0000 -> 0003)
[    5.970571] nvidia 0000:09:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    6.015145] nvidia 0000:0a:00.0: enabling device (0000 -> 0003)
[    6.015394] nvidia 0000:0a:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    6.064993] nvidia 0000:42:00.0: enabling device (0000 -> 0003)
[    6.065072] nvidia 0000:42:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    6.115778] nvidia 0000:43:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    6.164680] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  460.27.04  Fri Dec 11 23:35:05 UTC 2020
[    6.174137] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  460.27.04  Fri Dec 11 23:24:19 UTC 2020
[    6.176472] [drm] [nvidia-drm] [GPU ID 0x00000900] Loading driver
[    6.176567] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:09:00.0 on minor 0
[    6.176635] [drm] [nvidia-drm] [GPU ID 0x00000a00] Loading driver
[    6.176636] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0a:00.0 on minor 1
[    6.176709] [drm] [nvidia-drm] [GPU ID 0x00004200] Loading driver
[    6.176710] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:42:00.0 on minor 2
[    6.176760] [drm] [nvidia-drm] [GPU ID 0x00004300] Loading driver
[    6.176761] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:43:00.0 on minor 3
[    6.189768] nvidia-uvm: Loaded the UVM driver, major device number 511.
[    6.744582] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input12
[    6.744664] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input15
[    6.744755] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input17
[    6.744852] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input19
[    6.744952] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input11
[    6.745301] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input16
[    6.745739] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input18
[    6.746280] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input20
[    7.117377] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input9
[    7.117453] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input10
[    7.117505] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input13
[    7.117559] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input14
[    7.117591] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input21
[    7.117650] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input22
[    7.117683] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input23
[    7.117720] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input24
[    9.462521] caller os_map_kernel_space.part.8+0x74/0x90 [nvidia] mapping multiple BARs

How can I debug this problem, or resolve it without rebooting?

setting

  • ubuntu 18.04.3
  • RTX 2080ti
  • CUDA version 10.2
  • nvidia driver version: 460.27.04
  • cudnn 7.6.4.38
  • Python 3.8.4
  • pytorch 1.7.0, 1.9.0, 1.9.0+cu111
  • cpu: (AMD Ryzen Threadripper 2950X 16-Core Processor)x32
  • memory: 125G
  • power: 2000W

We are having what appears to be the same issue with RTX 2080 Ti GPUs on our HPC cluster. We have found we can recover them by running nvidia-smi --gpu-reset -i $n where $n is the id of the GPU returning the OOM message. This avoids having to reboot the host, but we are still searching for the underlying cause.