Managed memory slow to copy back to host

robert.lugg · January 7, 2021, 9:04pm

My kernels have completed and I want the results in CPU memory. This takes a long time:

After calling the kernel multiple times, I do:
CHECK_CUDA_ERRORS( cudaPeekAtLastError() );
CHECK_CUDA_ERRORS( cudaDeviceSynchronize() );
CHECK_CUDA_ERRORS( cudaMemPrefetchAsync(ld_data_managed, Nx * Ny * sizeof(double), cudaCpuDeviceId, NULL));
CHECK_CUDA_ERRORS( cudaDeviceSynchronize() );

. Each transfer is 0.2ms with a 0.6ms gap. Without that prefetch call, the copy back is even slower.

Can you identify what is going on based on the signature above? I really want to use unified memory instead of doing the allocs and copies myself, but I don’t know how to improve the speed.

Robert_Crovella · January 7, 2021, 9:45pm

I doubt I will be able to identify anything, but a few questions:

Tesla P100 SXM2 is generally not found “by itself” in a machine. There are usually 4 or 8 of those, typically in a 2-socket server

is anything else going on with respect to the other GPUs in the system when you capture this?
have you done process placement to make sure your application process is running on the socket that is “closest” to the GPU you are using?
how big is Nx*Ny? does this correspond to the 4M elements you mentioned in your other question?
is the CPU you are on “idle” other than the application process you are running?
it looks to me like you are doing everything in the NULL stream, is that correct?

robert.lugg · January 11, 2021, 9:22pm

I don’t think I need any more help. on a V100 it was a lot better. There are still 20us gaps but overall time is very good. Just to close out, some of the answers to your questions:
Yes, it is a system with 4 P100’s.
I don’t think there was anything else going on. It is very repeatable a
Yes, 2k x 2k row-major array of doubles.
Yes, only using the NULL stream. This is for no good reason other than I wanted to optimize the serial runtime first.

Thank you for your help. Your questions alone triggered me to keep working on Managed Memory.

Topic		Replies	Views
Tesla P100 asynchronous prefetching of managed memory slow - < 1GB/s CUDA Programming and Performance	3	628	May 5, 2018
Pascal & capabilities 6.0 show cudaDevAttrConcurrentManagedAccess is 0 CUDA Programming and Performance	15	1375	December 27, 2018
Slow down with multiple CUDA files CUDA Programming and Performance	8	4716	September 7, 2010
Memory copy improvement ? CUDA Programming and Performance	6	3072	April 25, 2012
copy memory slow? CUDA Programming and Performance	2	4800	February 12, 2009
Accessing Managed Memory During Asynchronous Copies CUDA Programming and Performance	4	422	March 4, 2024
Large allocations with cudaMallocManaged slow down synchronization CUDA Programming and Performance	11	1666	October 26, 2020
Is there any way to copy data from device to host more efficiently in this case? CUDA Programming and Performance	4	935	December 14, 2018
Managed memory vs cudaHostAlloc - TK1 Jetson TK1	6	2012	February 15, 2016
Handful of Slow Memory Transfers CUDA Programming and Performance	7	813	June 17, 2016

Managed memory slow to copy back to host

Related topics