Persistent Memory on Devices Multi-GPU Programming

Dear All,

I’d like to ask you, how to make a memory on devices persistent after leaving the kernel? Because every time I want to call my kernel on different devices, I have to allocate the memory on device, and copy it from host to device.

for(devID = 0; devID < N_GPU; devID++) {
      cudaSetDevice(devID);
      cublasCreate(&handle_M[devID]);
      cudaMalloc((void**) &d_R_M[devID], vectSize_M);
      cudaMalloc((void**) &d_B_M[devID], vectSize_M);
      cudaMalloc((void**) &d_data, bdiaSize);
      cudaMalloc((void**) &d_offsets, offsetSize);
      cudaMalloc((void**) &d_X, vectSize);
      cudaMalloc((void**) &d_V_M[devID], vectSize_M);

      cudaMemcpy(d_B_M[devID], h_B_M[devID], vectSize_M, cudaMemcpyDefault);
      cudaMemcpy(d_R_M[devID], h_R_M[devID], vectSize_M, cudaMemcpyDefault);
      cudaMemcpy(d_data, h_data, bdiaSize, cudaMemcpyDefault);
      cudaMemcpy(d_offsets, h_offsets, offsetSize, cudaMemcpyDefault);
      cudaMemcpy(d_X, h_X0, vectSize, cudaMemcpyDefault);
      // // R = B (Data Parallel)
      cublasScopy(handle_M[devID], N_M, d_B_M[devID], 1, d_R_M[devID], 1);

      spMV_bdia_gh_kernel_M <<< grid_M, threads_M >>> (d_data, d_X, d_offsets, d_V_M[devID]);
      cudaDeviceSynchronize();
      cublasSaxpy(handle_M[devID], N_M, &minus, d_V_M[devID], 1, d_R_M[devID], 1);
}

The next operation also do the same, I have to allocate the same data on devices memory, and copy it from host to devices before calling my kernel on each device. Otherwise, it will give illegal memory access’ error.

Any suggestion?

Thanks,

Problem solved, I should’ve use different variable for each device, so instead of d_data, I should’ve used d_data[devID]