GPU memory is empty, but CUDA out of memory error occurs

kjyong2 · September 3, 2021, 10:39am

During training this code with ray tune (1 gpu for 1 trial), after few hours
of training (about 20 trials) CUDA out of memory error occurred from GPU:0,1. And even after terminated the training process, the GPUS still give out of memory error.

As above, currently, all of my GPU devices are empty. And there is no other python process running except these two.

import torch
torch.rand(1, 2).to('cuda:0') # cuda out of memory error
torch.rand(1, 2).to('cuda:1') # cuda out of memory error
torch.rand(1, 2).to('cuda:2') # working
torch.rand(1, 2).to('cuda:3') # working

torch.cuda.device_count() # 4
torch.cuda.memory_reserved() # 0
torch.cuda.is_available() # True

# error message of GPU 0, 1
RuntimeError: CUDA error: out of memory

However, GPU:0,1 give out of memory errors. If I reboot the computer(ubunutu 18.04.3), it returns to normal, but if I train the code again, the same problem occurs.

numba and tensorflow have same problem, so it seems like it is not because of pytorch.

>>> from numba import cuda
>>> device = cuda.get_current_device()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/api.py", line 460, in get_current_device
    return current_context().device
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 212, in get_context
    return _runtime.get_or_create_context(devnum)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 138, in get_or_create_context
    return self._get_or_create_context_uncached(devnum)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 153, in _get_or_create_context_uncached
    return self._activate_context_for(0)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 169, in _activate_context_for
    newctx = gpu.get_primary_context()
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 542, in get_primary_context
    driver.cuDevicePrimaryCtxRetain(byref(hctx), self.id)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 302, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 342, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_OUT_OF_MEMORY

dmesg result. (No GPU has fallen off the bus error)

dmesg | grep -i -e nvidia -e nvrm
[    5.946174] nvidia: loading out-of-tree module taints kernel.
[    5.946181] nvidia: module license 'NVIDIA' taints kernel.
[    5.956595] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    5.968280] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[    5.970485] nvidia 0000:09:00.0: enabling device (0000 -> 0003)
[    5.970571] nvidia 0000:09:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    6.015145] nvidia 0000:0a:00.0: enabling device (0000 -> 0003)
[    6.015394] nvidia 0000:0a:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    6.064993] nvidia 0000:42:00.0: enabling device (0000 -> 0003)
[    6.065072] nvidia 0000:42:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    6.115778] nvidia 0000:43:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    6.164680] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  460.27.04  Fri Dec 11 23:35:05 UTC 2020
[    6.174137] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  460.27.04  Fri Dec 11 23:24:19 UTC 2020
[    6.176472] [drm] [nvidia-drm] [GPU ID 0x00000900] Loading driver
[    6.176567] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:09:00.0 on minor 0
[    6.176635] [drm] [nvidia-drm] [GPU ID 0x00000a00] Loading driver
[    6.176636] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0a:00.0 on minor 1
[    6.176709] [drm] [nvidia-drm] [GPU ID 0x00004200] Loading driver
[    6.176710] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:42:00.0 on minor 2
[    6.176760] [drm] [nvidia-drm] [GPU ID 0x00004300] Loading driver
[    6.176761] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:43:00.0 on minor 3
[    6.189768] nvidia-uvm: Loaded the UVM driver, major device number 511.
[    6.744582] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input12
[    6.744664] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input15
[    6.744755] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input17
[    6.744852] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input19
[    6.744952] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input11
[    6.745301] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input16
[    6.745739] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input18
[    6.746280] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input20
[    7.117377] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input9
[    7.117453] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input10
[    7.117505] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input13
[    7.117559] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input14
[    7.117591] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input21
[    7.117650] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input22
[    7.117683] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input23
[    7.117720] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input24
[    9.462521] caller os_map_kernel_space.part.8+0x74/0x90 [nvidia] mapping multiple BARs

How can I debug this problem, or resolve it without rebooting?

setting

ubuntu 18.04.3
RTX 2080ti
CUDA version 10.2
nvidia driver version: 460.27.04
cudnn 7.6.4.38
Python 3.8.4
pytorch 1.7.0, 1.9.0, 1.9.0+cu111
cpu: (AMD Ryzen Threadripper 2950X 16-Core Processor)x32
memory: 125G
power: 2000W

stm · September 7, 2021, 1:09pm

We are having what appears to be the same issue with RTX 2080 Ti GPUs on our HPC cluster. We have found we can recover them by running nvidia-smi --gpu-reset -i $n where $n is the id of the GPU returning the OOM message. This avoids having to reboot the host, but we are still searching for the underlying cause.

kjyong2 · September 27, 2021, 6:36am

Thanks for the comment!
Fortunately, it seems like the issue is not happening after upgrading pytorch version to 1.9.1+cu111.
I will try --gpu-reset if the problem occurs again. Thanks for your help.

stm · September 27, 2021, 12:25pm

Thank you also! We will try upgrading pytorch to the same version to see if it prevents the GPUs from entering the unusable state.

910477365 · January 18, 2023, 6:49am

Do you have any ideas to solve this problem now? I got the same issue. If my memory is correct, “GPU memory is empty, but CUDA out of memory” occurred after I killed the process with P-ID.

317255300 · September 19, 2024, 3:48am

Reduce batch size

Topic		Replies	Views
Did TensorFlow caused GPU memory crash? CUDA Setup and Installation	5	4941	April 26, 2017
Nvidia driver conflict CUDA_ERROR_NO_DEVICE Linux	10	9746	June 28, 2018
Always got this warning when nvprof cuda file "This can happen if device ran out of memory or if a device kernel was stopped due to an assertion" on just HellowWorld GPU CUDA Programming and Performance	9	2557	January 31, 2019
Ubuntu 20.04 - CUDA 11.1.1: Missing nvidia-uvm Frameworks cuda	4	7895	October 12, 2021
Failed call to cuInit CUDA_ERROR_NOT_INITIALIZED (Device mapping: no known devices) CUDA Setup and Installation	7	6383	November 27, 2018
deviceQuery passes but other demos fail CUDA Programming and Performance	7	2516	January 22, 2009
ENOMEM when running CUDA sample on host GPU where another GPU is passed through via IOMMU/vfio-pci Linux	1	768	May 19, 2019
Got out of memory from cudaMemcpy CUDA Programming and Performance	13	3967	January 28, 2022
deviceQuery passes and then fails CUDA Setup and Installation	4	2144	July 6, 2016
Install Problem CUDA Programming and Performance	32	12703	December 17, 2009

GPU memory is empty, but CUDA out of memory error occurs

setting

Related topics