I am wondering how unified memory works with multiple gpu systems on nvlink and non-nvlink based systems. I provided a simple code. In that case, I have 4 gpus and I am binding a thread with gpu using OpenMP here. I guess, openmp makes kernel submission concurrent to different gpus. I am doing some calculating with different address of same array “a” and “b”.
My questions are that:
Is that code valid for both systems ?
Will cuda runtime does P2P GPU transfer for array "a", even if there is only read here ?
Will cuda runtime does P2P GPU transfer for array "b" ?
Thanks in advance
__global__ void kernel(int *a, int *b, int n) {
int tid = ...
b[tid] = a[tid];
}
int main(int argc, char const *argv[]){
int *a,*b, n=65536;
cudaMallocManaged(a, n*sizeof(int));
cudaMallocManaged(b, n*sizeof(int));
int num_of_gpu = 4;
#pragma omp parallel for
for (int i = 0; i < num_of_gpu; ++i) {
cudaSetDevice(i);
kernel<<< 1, 1024 >>>(&a[(n/num_of_gpu)*i], &b[(n/num_of_gpu)*i], n/num_of_gpu);
}
return 0;
}
How to allocate unified memory that can use all available GPU memory?
I currently found that unified memory will only consider one GPU when calling cudaMallocManaged even we have multiple GPUs on the system.
It can’t be done. There is no facility in CUDA to have a single pointer allocation, part of which is associated with one GPU and part of it is associated with another GPU.
Having said that, you can allocate unified memory that is larger than the memory available on a single GPU. That requires a oversubscription ready system (pascal or newer GPU, and linux) with sufficient host memory, and in a multi-GPU setup there are additional issues to consider. These issues depend on system configuration (whether the GPUs are on the same fabric, or not). You can get insight by reading the managed memory section of the programming guide. I’ve already provided the links above.