when you start a new thread, it defaults to device 0. If you want a thread to use device 1, you must do that explicitly in that thread. It will not pick up device changes from other threads.
Regarding the rest of your statements, I can’t explain it. cudaMemcpy doesn’t return out of memory error in my experience. My guess is that is coming from a cudaMalloc call unintentionally directed at your “full” device. It might be something else of course. I can’t describe behavior of code I haven’t seen, so just guesswork here.
cudaMalloc, I check cudaError_t, it’s ok, because it’s on GPU1.
But the cudaMemcpy call is to copy memory from GPU1 to CPU. I thought I no need to set device, because the source pointer is in GPU1. So, for cudaMemcpy I still need to set device to GPU1? Otherwise, it will use GPU0 to copy, then may allocate buffer in GPU0?
for cudaMemcpy, specifically, it should not be necessary to do cudaSetDevice before an H->D or D->H transfer. The device is inferred from the device pointer passed to the cuda runtime.
I can constantly reproduce it in two processes: One process allocates GPU0 memory until it’s full; another allocates memory in GPU1 and copy to host. If same code in a single process, no error; only can reproduce when in two processes.
gpu-alloc.cpp
#include <iostream>
#include <cuda_runtime.h>
int
main()
{
constexpr size_t MEM_PIECE_SIZE = 1024*1024;
size_t memAllocated = 0;
cudaSetDevice(0);
do {
uint8_t *d;
if (cudaMalloc(&d, MEM_PIECE_SIZE) != cudaSuccess) {
break;
}
memAllocated += MEM_PIECE_SIZE;
} while (true);
std::cout << memAllocated << " allocated" << std::endl;
std::cout << "Input anything to quit" << std::endl;
char c;
std::cin >> c;
// As a test, no need to free memory
}
gpu-test.cpp
#include <iostream>
#include <cuda_runtime.h>
int
main()
{
uint8_t *d1;
constexpr size_t IMAGE_SIZE = 1920 * 1080 * 3;
cudaSetDevice(1);
cudaError_t e;
if ((e = cudaMalloc(&d1, IMAGE_SIZE)) != cudaSuccess) {
std::cerr << "Cannot allocate image:" << cudaGetErrorString(e) << std::endl;
return -1;
}
cudaSetDevice(0);
uint8_t *p = new uint8_t[IMAGE_SIZE];
if ((e = cudaMemcpy(p, d1, IMAGE_SIZE, cudaMemcpyDeviceToHost)) != cudaSuccess) {
std::cerr << "Cannot copy image:" << cudaGetErrorString(e) << std::endl;
return -1;
}
// As a test, no need to free memory
}
In another terminal run ./gpu-test, I can see “Cannot copy image:out of memory”.
Run ./gpu-test with cuda-memcheck, I got
cuda-memcheck ./gpu-test
========= CUDA-MEMCHECK
========= Program hit cudaErrorMemoryAllocation (error 2) due to "out of memory" on CUDA API call to cudaMemcpy.
Cannot copy image:========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3343d3]
out of memory========= Host Frame:/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudart.so.10.1 (cudaMemcpy + 0x1df) [0x3b8ff]
========= Host Frame:./gpu-test [0xb1f]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
========= Host Frame:./gpu-test [0x999]
=========
========= ERROR SUMMARY: 1 error
I am using CUDA Toolkit 10.1, not sure whether it is fixed on 11.3.
I filed a bug 3326958. Not sure about the “Relevant Area”, if not appropriate, please help to correct. And the format of the description looks not working. :-)
So, in this case, cudaMemcpy can return “out of memory”
cudaMemcpy is both a synchronizing and blocking call. It synchronizes on device activity, and blocks the host thread (i.e. it is not asynchronous like a kernel launch). Synchronizing on device activity means that it waits until previously issued activity to the device is complete, before the copy is allowed to begin.
It appears this is expected behavior and not a bug. We can demonstrate(*) with a fairly simple test case that when you do this:
that the final cudaMemcpy operation is synchronizing both for device 0 and for device 1. In order to synchronize on both devices, the cuda runtime must have available to it a context on both devices. This context creation will occur prior to the actual synchronization, if needed.
Given that, your case is expected. There was not enough memory left on device 0 to allow the 2nd process to create a context as needed by the runtime API, and so the error occurred.
To phrase this in a general way, I would say that if you intend to do a cudaSetDevice(x) (whether explicitly or implicitly - all threads start out with an implicit cudaSetDevice(0) if you use any runtime API calls prior to explicitly issuing a cudaSetDevice() call) in your process, you should be confident that you have permission and ability to establish a context on that device x. If not, the CUDA runtime may not work properly.
This also explains why it requires 2 processes to witness the issue. My single-process test case cannot demonstrate the issue.
It occurs to me that CUDA_VISIBLE_DEVICES might be usable to work around this issue, if the GPUs being used are in disjoint sets, from one process to another. This particular idea is not applicable here, but might be for other users who are experiencing this.