multiple gpu and unified memory


I am wondering how unified memory works with multiple gpu systems on nvlink and non-nvlink based systems. I provided a simple code. In that case, I have 4 gpus and I am binding a thread with gpu using OpenMP here. I guess, openmp makes kernel submission concurrent to different gpus. I am doing some calculating with different address of same array “a” and “b”.

My questions are that:

  • Is that code valid for both systems ?
  • Will cuda runtime does P2P GPU transfer for array "a", even if there is only read here ?
  • Will cuda runtime does P2P GPU transfer for array "b" ?

Thanks in advance

__global__ void kernel(int *a, int *b, int n) {
  int tid = ...
  b[tid] = a[tid];

int main(int argc, char const *argv[]){
  int *a,*b, n=65536;
  cudaMallocManaged(a, n*sizeof(int));
  cudaMallocManaged(b, n*sizeof(int));

  int num_of_gpu = 4;
  #pragma omp parallel for
  for (int i = 0; i < num_of_gpu; ++i) {
    kernel<<< 1, 1024 >>>(&a[(n/num_of_gpu)*i], &b[(n/num_of_gpu)*i], n/num_of_gpu);
  return 0;

You may want to read the documentation:

There’s not enough information in your question to answer it. nvlink doesn’t make code that was previously valid become “invalid”.