Managed memory slow to copy back to host

My kernels have completed and I want the results in CPU memory. This takes a long time:

After calling the kernel multiple times, I do:
CHECK_CUDA_ERRORS( cudaPeekAtLastError() );
CHECK_CUDA_ERRORS( cudaDeviceSynchronize() );
CHECK_CUDA_ERRORS( cudaMemPrefetchAsync(ld_data_managed, Nx * Ny * sizeof(double), cudaCpuDeviceId, NULL));
CHECK_CUDA_ERRORS( cudaDeviceSynchronize() );

. Each transfer is 0.2ms with a 0.6ms gap. Without that prefetch call, the copy back is even slower.

Can you identify what is going on based on the signature above? I really want to use unified memory instead of doing the allocs and copies myself, but I don’t know how to improve the speed.

I doubt I will be able to identify anything, but a few questions:

Tesla P100 SXM2 is generally not found “by itself” in a machine. There are usually 4 or 8 of those, typically in a 2-socket server

  • is anything else going on with respect to the other GPUs in the system when you capture this?
  • have you done process placement to make sure your application process is running on the socket that is “closest” to the GPU you are using?
  • how big is Nx*Ny? does this correspond to the 4M elements you mentioned in your other question?
  • is the CPU you are on “idle” other than the application process you are running?
  • it looks to me like you are doing everything in the NULL stream, is that correct?
1 Like

I don’t think I need any more help. on a V100 it was a lot better. There are still 20us gaps but overall time is very good. Just to close out, some of the answers to your questions:
Yes, it is a system with 4 P100’s.
I don’t think there was anything else going on. It is very repeatable a
Yes, 2k x 2k row-major array of doubles.
Yes, only using the NULL stream. This is for no good reason other than I wanted to optimize the serial runtime first.

Thank you for your help. Your questions alone triggered me to keep working on Managed Memory.