CUDA Out of Memory on RTX 3060 with TF/Pytorch

Dear NVIDIA developer team,

This week, I have recently just updated my graphic cards from rtx2060 to rtx3060 because it has more VRAM, so that I could train deep learning experiments faster.

The problem is, now, I cannot even training with the new GPU due to constant OOM issue. I have tested that both Pytorch (1.7.1cu11.0, 1.8.0cu11.1) and Tensorflow-gpu (2.4.3 cu11.1) give the same OOM error.

But from my observation, the GPU usage rises with Tensorflow-gpu (although in the end it cries OOM) to 9.xx G from the available 12GB of VRAM. However, I didn’t observe any spike in the GPU memory usage when using Pytorch-gpu.

Hence, I am wondering, is this might be an issue in the cuda driver itself, which probably doesn’t support RTX3060 (yet, since it is <1 month old)?

Reproduce the issue

Pytorch
I have tried this and this, but without much help.

To test pytorch, here.

Tensorflow
To test tensorflow:
test_tf.py (2.5 KB)
Error snapshot:

1 Like

Hi @briliantnugraha,

thanks for raising this issue.
If I understand the use case correctly, you are seeing an OOM error on your 3060 using the PyTorch 1.8.0+CUDA11.1 binaries (pip wheels or conda binaries) by running the CIFAR10 script?

If so, could you run a quick test and try to allocate a single tensor on this device via:

import torch

x = torch.randn(1024**3, device='cuda')
print(x.shape)

and check, if this would also run OOM?
This would allocate 4GB on your device and should work fine.

Since you are seeing an OOM using the CIFAR10 example, I guess the OOM might be a red herring, as this example should not use the complete device memory.

x = torch.randn(1024**3, device=‘cuda’)
Traceback (most recent call last):
File “”, line 1, in
RuntimeError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 7.79 GiB total capacity; 5.61 GiB already allocated; 107.19 MiB free; 5.61 GiB reserved in total by PyTorch)

It seems that you’ve already allocated data on this device before running the code.
Could you empty the device and run:

import torch
print(torch.cuda.memory_summary())
x = torch.randn(1024**3, device=‘cuda’)
print(torch.cuda.memory_summary())

Hello, I have the same problem described, and I’ve tried with the test code propossed,…:

The environment:

I’m running the code on Ubuntu 21.04, under PyCharm pro.

My test code:

import torch
try:
    print(torch.cuda.memory_summary())
    x = torch.randn(1024**3, device='cuda')
    print(torch.cuda.memory_summary())
except Exception as ex:
    print(str(ex))
    print(torch.cuda.memory_summary())

The output:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

Traceback (most recent call last):
  File "/home/jero/Proyectos/EmocionesBasicas/emociones/services/test_cuda.py", line 4, in <module>
    x = torch.randn(1024**3, device='cuda')
RuntimeError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 5.81 GiB total capacity; 0 bytes already allocated; 3.94 GiB free; 0 bytes reserved in total by PyTorch)
(env) jero@nassat:~/Proyectos/EmocionesBasicas/emociones/services$ python test_cuda.py 
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 5.81 GiB total capacity; 0 bytes already allocated; 3.94 GiB free; 0 bytes reserved in total by PyTorch)
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 1            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

When I run my project … the GPU memory ussage is:

Do you know if can I move the last rwo processes out of the GPU? The last one is a web service that runs the models.

When I get the OOM, the memory usage is:

2021-08-26 08:37:48,875  log_emociones - ERROR - Detectada excepción al crear el clasficador de Emociones Básicas: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 5.81 GiB total capacity; 418.75 MiB already allocated; 12.69 MiB free; 472.00 MiB reserved in total by PyTorch)
2021-08-26 08:37:48,875  log_emociones - ERROR - Resumen de la memoria de NVIDIA:
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 1            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  428797 KB |  428797 KB |  428797 KB |       0 B  |
|       from large pool |  428288 KB |  428288 KB |  428288 KB |       0 B  |
|       from small pool |     509 KB |     509 KB |     509 KB |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |  428797 KB |  428797 KB |  428797 KB |       0 B  |
|       from large pool |  428288 KB |  428288 KB |  428288 KB |       0 B  |
|       from small pool |     509 KB |     509 KB |     509 KB |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  483328 KB |  483328 KB |  483328 KB |       0 B  |
|       from large pool |  481280 KB |  481280 KB |  481280 KB |       0 B  |
|       from small pool |    2048 KB |    2048 KB |    2048 KB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |   54530 KB |   54552 KB |  265212 KB |  210681 KB |
|       from large pool |   52992 KB |   52992 KB |  263168 KB |  210176 KB |
|       from small pool |    1538 KB |    2044 KB |    2044 KB |     505 KB |
|---------------------------------------------------------------------------|
| Allocations           |     203    |     203    |     203    |       0    |
|       from large pool |      75    |      75    |      75    |       0    |
|       from small pool |     128    |     128    |     128    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |     203    |     203    |     203    |       0    |
|       from large pool |      75    |      75    |      75    |       0    |
|       from small pool |     128    |     128    |     128    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      21    |      21    |      21    |       0    |
|       from large pool |      20    |      20    |      20    |       0    |
|       from small pool |       1    |       1    |       1    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      19    |      19    |      20    |       1    |
|       from large pool |      18    |      18    |      19    |       1    |
|       from small pool |       1    |       1    |       1    |       0    |
|===========================================================================|