Got out of memory from cudaMemcpy

sjiagc.dev · June 15, 2021, 4:18pm

Hello NV experts, I have a 4 Tesla P4 GPU server, GPU0’s memroy is almost full. I did:

cudaSetDevice to GPU1
cudaMalloc
Do other things
And may from another thread cudaMemcpy from the memory allocated in GPU1 to CPU, then I got out of memory
If I cudaSetDevice(1) before cudaMemcpy, it runs OK.

I have to set device before cudaMemcpy call?

nvidia-smi shows

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P4            On   | 00000000:00:09.0 Off |                    0 |
| N/A   45C    P0    23W /  75W |   7575MiB /  7611MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P4            On   | 00000000:00:0A.0 Off |                    0 |
| N/A   44C    P0    23W /  75W |    595MiB /  7611MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P4            On   | 00000000:00:0B.0 Off |                    0 |
| N/A   41C    P0    23W /  75W |   1441MiB /  7611MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P4            On   | 00000000:00:0C.0 Off |                    0 |
| N/A   41C    P0    23W /  75W |   1479MiB /  7611MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

My CUDA Toolkit version is 10.1.

Please help, thanks!

Robert_Crovella · June 15, 2021, 4:30pm

when you start a new thread, it defaults to device 0. If you want a thread to use device 1, you must do that explicitly in that thread. It will not pick up device changes from other threads.

Regarding the rest of your statements, I can’t explain it. cudaMemcpy doesn’t return out of memory error in my experience. My guess is that is coming from a cudaMalloc call unintentionally directed at your “full” device. It might be something else of course. I can’t describe behavior of code I haven’t seen, so just guesswork here.

sjiagc.dev · June 15, 2021, 4:37pm

Thanks for replying!

cudaMalloc, I check cudaError_t, it’s ok, because it’s on GPU1.

But the cudaMemcpy call is to copy memory from GPU1 to CPU. I thought I no need to set device, because the source pointer is in GPU1. So, for cudaMemcpy I still need to set device to GPU1? Otherwise, it will use GPU0 to copy, then may allocate buffer in GPU0?

Please help to confirm. Thanks!

Robert_Crovella · June 15, 2021, 4:49pm

for cudaMemcpy, specifically, it should not be necessary to do cudaSetDevice before an H->D or D->H transfer. The device is inferred from the device pointer passed to the cuda runtime.

sjiagc.dev · June 15, 2021, 5:00pm

This was my understanding too.

But without setting device to GPU1, it returns out of memory; adding setting device to GPU1 fixes the issue. I tried several times forth and back.

Maybe you can also reproduce it on your side. Just try:

Get 2 GPU machine
cudaMalloc until GPU0 is full(make sure memory free is small enough)
Set device to GPU1 and cudaMalloc(a three-channel 1920x1080 image size)
Set device to GPU0
cudaMemcpy from device memory allocated in step 3 to host.

See whether it will happen or not.

Thanks!

Robert_Crovella · June 15, 2021, 5:11pm

On a system with 2 GPUs , device zero is a GTX 960 with 2GB. Device 1 is a GT640 with 1GB. CUDA 11.3, driver 465.19.01

test case:

$ cat t156.cu
#include <iostream>
const int s2 = 6*1048576;

int main(){
  char *d1, *d2;
  char *h1;
  cudaError_t err = cudaSetDevice(0);
  err = cudaMalloc(&d1, 1048576ULL*1961); // 1962 instead of 1961 fails with OOM
  if (err != cudaSuccess) std::cout << "err 1" << std::endl;
  err = cudaSetDevice(1);
  if (err != cudaSuccess) std::cout << "err 2" << std::endl;
  err = cudaMalloc(&d2, s2);
  if (err != cudaSuccess) std::cout << "err 3" << std::endl;
  err = cudaSetDevice(0);
  if (err != cudaSuccess) std::cout << "err 4" << std::endl;
  h1 = new char[s2];
  err = cudaMemcpy(h1, d2, s2,cudaMemcpyDeviceToHost);
  std::cout << cudaGetErrorString(err) << std::endl;
}
$ nvcc -o t156 t156.cu
$ cuda-memcheck ./t156
========= CUDA-MEMCHECK
no error
========= ERROR SUMMARY: 0 errors
$

nvidia-smi output (at idle):

$ nvidia-smi
Tue Jun 15 13:09:50 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| 35%   36C    P0    25W / 130W |      0MiB /  2001MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:03:00.0 N/A |                  N/A |
| N/A   35C    P0    N/A /  N/A |      0MiB /   981MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

sjiagc.dev · June 16, 2021, 1:59am

Thanks Bob for trying it out!

My side the GPU load was from other processes. I will investigate more, if any findings will update.

sjiagc.dev · June 16, 2021, 2:49am

I can constantly reproduce it in two processes: One process allocates GPU0 memory until it’s full; another allocates memory in GPU1 and copy to host. If same code in a single process, no error; only can reproduce when in two processes.

gpu-alloc.cpp

#include <iostream>
#include <cuda_runtime.h>

int
main()
{
    constexpr size_t MEM_PIECE_SIZE = 1024*1024;
    size_t memAllocated = 0;
    cudaSetDevice(0);
    do {
        uint8_t *d;
        if (cudaMalloc(&d, MEM_PIECE_SIZE) != cudaSuccess) {
            break;
        }
        memAllocated += MEM_PIECE_SIZE;
    } while (true);
    std::cout << memAllocated << " allocated" << std::endl;
    std::cout << "Input anything to quit" << std::endl;
    char c;
    std::cin >> c;
    // As a test, no need to free memory
}

gpu-test.cpp

#include <iostream>
#include <cuda_runtime.h>

int
main()
{
    uint8_t *d1;
    constexpr size_t IMAGE_SIZE = 1920 * 1080 * 3;
    cudaSetDevice(1);
    cudaError_t e;
    if ((e = cudaMalloc(&d1, IMAGE_SIZE)) != cudaSuccess) {
        std::cerr << "Cannot allocate image:" << cudaGetErrorString(e) << std::endl;
        return -1;
    }
    cudaSetDevice(0);
    uint8_t *p = new uint8_t[IMAGE_SIZE];
    if ((e = cudaMemcpy(p, d1, IMAGE_SIZE, cudaMemcpyDeviceToHost)) != cudaSuccess) {
        std::cerr << "Cannot copy image:" << cudaGetErrorString(e) << std::endl;
        return -1;
    }
    // As a test, no need to free memory
}

Build with command

$ g++ -o gpu-test -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -std=c++11 gpu-test.cpp -lcudart
$ g++ -o gpu-alloc -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -std=c++11 gpu-alloc.cpp -lcudart

Then

Run ./gpu-alloc first and keep it running;
In another terminal run ./gpu-test, I can see “Cannot copy image:out of memory”.
Run ./gpu-test with cuda-memcheck, I got

cuda-memcheck ./gpu-test
========= CUDA-MEMCHECK
========= Program hit cudaErrorMemoryAllocation (error 2) due to "out of memory" on CUDA API call to cudaMemcpy. 
Cannot copy image:=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3343d3]
out of memory=========     Host Frame:/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1 (cudaMemcpy + 0x1df) [0x3b8ff]
=========     Host Frame:./gpu-test [0xb1f]

=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
=========     Host Frame:./gpu-test [0x999]
=========
========= ERROR SUMMARY: 1 error

I am using CUDA Toolkit 10.1, not sure whether it is fixed on 11.3.

Robert_Crovella · June 16, 2021, 4:21am

I see the same issue on my CUDA 11.3 machine. I suggest filing a bug.

You already have a possible workaround; do cudaSetDevice() for the relevant pointer before the cudaMemcpy() operation.

sjiagc.dev · June 16, 2021, 5:28am

Thanks Bob!

I filed a bug 3326958. Not sure about the “Relevant Area”, if not appropriate, please help to correct. And the format of the description looks not working. :-)

Robert_Crovella · June 18, 2021, 9:07pm

So, in this case, cudaMemcpy can return “out of memory”

cudaMemcpy is both a synchronizing and blocking call. It synchronizes on device activity, and blocks the host thread (i.e. it is not asynchronous like a kernel launch). Synchronizing on device activity means that it waits until previously issued activity to the device is complete, before the copy is allowed to begin.

It appears this is expected behavior and not a bug. We can demonstrate(*) with a fairly simple test case that when you do this:

cudaSetDevice(0);
cudaMalloc(&d_x, some_size);
cudaSetDevice(1);
cudaMemcpy(h_x, d_x, some_size, cudaMemcpyDeviceToHost);

that the final cudaMemcpy operation is synchronizing both for device 0 and for device 1. In order to synchronize on both devices, the cuda runtime must have available to it a context on both devices. This context creation will occur prior to the actual synchronization, if needed.

Given that, your case is expected. There was not enough memory left on device 0 to allow the 2nd process to create a context as needed by the runtime API, and so the error occurred.

To phrase this in a general way, I would say that if you intend to do a cudaSetDevice(x) (whether explicitly or implicitly - all threads start out with an implicit cudaSetDevice(0) if you use any runtime API calls prior to explicitly issuing a cudaSetDevice() call) in your process, you should be confident that you have permission and ability to establish a context on that device x. If not, the CUDA runtime may not work properly.

This also explains why it requires 2 processes to witness the issue. My single-process test case cannot demonstrate the issue.

sjiagc.dev · July 20, 2021, 6:18am

Bob, thank you for the detailed explanation! I understand now. Sorry for my late reply! The bug can be closed.

system · September 18, 2021, 6:19am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Robert_Crovella · January 28, 2022, 4:16pm

It occurs to me that CUDA_VISIBLE_DEVICES might be usable to work around this issue, if the GPUs being used are in disjoint sets, from one process to another. This particular idea is not applicable here, but might be for other users who are experiencing this.

Topic		Replies	Views
FAO: Nvidia Engineers:- Memory Leak in cudaMemcpyAsync Only occurs on Host To Device memory transfer CUDA Programming and Performance	4	5873	August 18, 2010
Device to host data copy may not reflect on host side using graphs CUDA Programming and Performance	5	238	September 6, 2023
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16303	September 29, 2017
MultiGPU start help CUDA Programming and Performance	8	10523	August 10, 2010
CUDA Zero Copy On TX1 Jetson TX1	20	6827	October 18, 2021
cudaMemcpy error, all data not being transferred. but cudaMemcpy returns cudaSuccess CUDA Programming and Performance	5	15227	August 7, 2019
Huge memory leak CUDA Programming and Performance	16	5619	July 27, 2016
Whole system freezes when using cudaMallocManaged CUDA Programming and Performance	18	2479	February 11, 2019
P2P broken on Cuda5.0 driver CUDA Programming and Performance	5	3608	July 2, 2012
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2634	August 2, 2010

Got out of memory from cudaMemcpy

Related topics