Got out of memory from cudaMemcpy

Hello NV experts, I have a 4 Tesla P4 GPU server, GPU0’s memroy is almost full. I did:

  1. cudaSetDevice to GPU1
  2. cudaMalloc
  3. Do other things
  4. And may from another thread cudaMemcpy from the memory allocated in GPU1 to CPU, then I got out of memory
  5. If I cudaSetDevice(1) before cudaMemcpy, it runs OK.

I have to set device before cudaMemcpy call?

nvidia-smi shows

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P4            On   | 00000000:00:09.0 Off |                    0 |
| N/A   45C    P0    23W /  75W |   7575MiB /  7611MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P4            On   | 00000000:00:0A.0 Off |                    0 |
| N/A   44C    P0    23W /  75W |    595MiB /  7611MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P4            On   | 00000000:00:0B.0 Off |                    0 |
| N/A   41C    P0    23W /  75W |   1441MiB /  7611MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P4            On   | 00000000:00:0C.0 Off |                    0 |
| N/A   41C    P0    23W /  75W |   1479MiB /  7611MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

My CUDA Toolkit version is 10.1.

Please help, thanks!

when you start a new thread, it defaults to device 0. If you want a thread to use device 1, you must do that explicitly in that thread. It will not pick up device changes from other threads.

Regarding the rest of your statements, I can’t explain it. cudaMemcpy doesn’t return out of memory error in my experience. My guess is that is coming from a cudaMalloc call unintentionally directed at your “full” device. It might be something else of course. I can’t describe behavior of code I haven’t seen, so just guesswork here.

Thanks for replying!

cudaMalloc, I check cudaError_t, it’s ok, because it’s on GPU1.

But the cudaMemcpy call is to copy memory from GPU1 to CPU. I thought I no need to set device, because the source pointer is in GPU1. So, for cudaMemcpy I still need to set device to GPU1? Otherwise, it will use GPU0 to copy, then may allocate buffer in GPU0?

Please help to confirm. Thanks!

for cudaMemcpy, specifically, it should not be necessary to do cudaSetDevice before an H->D or D->H transfer. The device is inferred from the device pointer passed to the cuda runtime.

This was my understanding too.

But without setting device to GPU1, it returns out of memory; adding setting device to GPU1 fixes the issue. I tried several times forth and back.

Maybe you can also reproduce it on your side. Just try:

  1. Get 2 GPU machine
  2. cudaMalloc until GPU0 is full(make sure memory free is small enough)
  3. Set device to GPU1 and cudaMalloc(a three-channel 1920x1080 image size)
  4. Set device to GPU0
  5. cudaMemcpy from device memory allocated in step 3 to host.

See whether it will happen or not.

Thanks!

On a system with 2 GPUs , device zero is a GTX 960 with 2GB. Device 1 is a GT640 with 1GB. CUDA 11.3, driver 465.19.01

test case:

$ cat t156.cu
#include <iostream>
const int s2 = 6*1048576;

int main(){
  char *d1, *d2;
  char *h1;
  cudaError_t err = cudaSetDevice(0);
  err = cudaMalloc(&d1, 1048576ULL*1961); // 1962 instead of 1961 fails with OOM
  if (err != cudaSuccess) std::cout << "err 1" << std::endl;
  err = cudaSetDevice(1);
  if (err != cudaSuccess) std::cout << "err 2" << std::endl;
  err = cudaMalloc(&d2, s2);
  if (err != cudaSuccess) std::cout << "err 3" << std::endl;
  err = cudaSetDevice(0);
  if (err != cudaSuccess) std::cout << "err 4" << std::endl;
  h1 = new char[s2];
  err = cudaMemcpy(h1, d2, s2,cudaMemcpyDeviceToHost);
  std::cout << cudaGetErrorString(err) << std::endl;
}
$ nvcc -o t156 t156.cu
$ cuda-memcheck ./t156
========= CUDA-MEMCHECK
no error
========= ERROR SUMMARY: 0 errors
$

nvidia-smi output (at idle):

$ nvidia-smi
Tue Jun 15 13:09:50 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| 35%   36C    P0    25W / 130W |      0MiB /  2001MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:03:00.0 N/A |                  N/A |
| N/A   35C    P0    N/A /  N/A |      0MiB /   981MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Thanks Bob for trying it out!

My side the GPU load was from other processes. I will investigate more, if any findings will update.

I can constantly reproduce it in two processes: One process allocates GPU0 memory until it’s full; another allocates memory in GPU1 and copy to host. If same code in a single process, no error; only can reproduce when in two processes.

gpu-alloc.cpp

#include <iostream>
#include <cuda_runtime.h>

int
main()
{
    constexpr size_t MEM_PIECE_SIZE = 1024*1024;
    size_t memAllocated = 0;
    cudaSetDevice(0);
    do {
        uint8_t *d;
        if (cudaMalloc(&d, MEM_PIECE_SIZE) != cudaSuccess) {
            break;
        }
        memAllocated += MEM_PIECE_SIZE;
    } while (true);
    std::cout << memAllocated << " allocated" << std::endl;
    std::cout << "Input anything to quit" << std::endl;
    char c;
    std::cin >> c;
    // As a test, no need to free memory
}

gpu-test.cpp

#include <iostream>
#include <cuda_runtime.h>

int
main()
{
    uint8_t *d1;
    constexpr size_t IMAGE_SIZE = 1920 * 1080 * 3;
    cudaSetDevice(1);
    cudaError_t e;
    if ((e = cudaMalloc(&d1, IMAGE_SIZE)) != cudaSuccess) {
        std::cerr << "Cannot allocate image:" << cudaGetErrorString(e) << std::endl;
        return -1;
    }
    cudaSetDevice(0);
    uint8_t *p = new uint8_t[IMAGE_SIZE];
    if ((e = cudaMemcpy(p, d1, IMAGE_SIZE, cudaMemcpyDeviceToHost)) != cudaSuccess) {
        std::cerr << "Cannot copy image:" << cudaGetErrorString(e) << std::endl;
        return -1;
    }
    // As a test, no need to free memory
}

Build with command

$ g++ -o gpu-test -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -std=c++11 gpu-test.cpp -lcudart
$ g++ -o gpu-alloc -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -std=c++11 gpu-alloc.cpp -lcudart

Then

  • Run ./gpu-alloc first and keep it running;
  • In another terminal run ./gpu-test, I can see “Cannot copy image:out of memory”.
  • Run ./gpu-test with cuda-memcheck, I got
cuda-memcheck ./gpu-test
========= CUDA-MEMCHECK
========= Program hit cudaErrorMemoryAllocation (error 2) due to "out of memory" on CUDA API call to cudaMemcpy. 
Cannot copy image:=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3343d3]
out of memory=========     Host Frame:/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1 (cudaMemcpy + 0x1df) [0x3b8ff]
=========     Host Frame:./gpu-test [0xb1f]

=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
=========     Host Frame:./gpu-test [0x999]
=========
========= ERROR SUMMARY: 1 error

I am using CUDA Toolkit 10.1, not sure whether it is fixed on 11.3.

I see the same issue on my CUDA 11.3 machine. I suggest filing a bug.

You already have a possible workaround; do cudaSetDevice() for the relevant pointer before the cudaMemcpy() operation.

Thanks Bob!

I filed a bug 3326958. Not sure about the “Relevant Area”, if not appropriate, please help to correct. And the format of the description looks not working. :-)

So, in this case, cudaMemcpy can return “out of memory”

cudaMemcpy is both a synchronizing and blocking call. It synchronizes on device activity, and blocks the host thread (i.e. it is not asynchronous like a kernel launch). Synchronizing on device activity means that it waits until previously issued activity to the device is complete, before the copy is allowed to begin.

It appears this is expected behavior and not a bug. We can demonstrate(*) with a fairly simple test case that when you do this:

cudaSetDevice(0);
cudaMalloc(&d_x, some_size);
cudaSetDevice(1);
cudaMemcpy(h_x, d_x, some_size, cudaMemcpyDeviceToHost);

that the final cudaMemcpy operation is synchronizing both for device 0 and for device 1. In order to synchronize on both devices, the cuda runtime must have available to it a context on both devices. This context creation will occur prior to the actual synchronization, if needed.

Given that, your case is expected. There was not enough memory left on device 0 to allow the 2nd process to create a context as needed by the runtime API, and so the error occurred.

To phrase this in a general way, I would say that if you intend to do a cudaSetDevice(x) (whether explicitly or implicitly - all threads start out with an implicit cudaSetDevice(0) if you use any runtime API calls prior to explicitly issuing a cudaSetDevice() call) in your process, you should be confident that you have permission and ability to establish a context on that device x. If not, the CUDA runtime may not work properly.

This also explains why it requires 2 processes to witness the issue. My single-process test case cannot demonstrate the issue.

Bob, thank you for the detailed explanation! I understand now. Sorry for my late reply! The bug can be closed.