Hello there
I’ve been experimenting with unified memory lately. I have a very simple test case which I copied from the blog post https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/.
My goal is to parallelize the H2D and D2H prefetch traffic like a pipeline, as in the blog post. I allocated the memory using cudaMallocManaged
and initialized it on the host end. Then I do explicit prefetches to GPU to prefetch data and later do prefetch to the host to evict data.
I was able to overlap the transfers with smaller memory chunks, specifically,
- If I use 3 buffers of 1GBytes each, all the memory transfer overlapped correctly
- If I use 3 buffers of 2GBytes each, D2H and H2D traffic is serialized
- If I use 3 buffers of 1.3GBytes each, D2H and H2D traffic is serialized, Sometimes. And other times it’s overlapped
The attached picture denotes the third case. I run the same function three times and profiled it.
Link to a larger image:https://ibb.co/NKX8gDJ
What I’m not sure is, why the same sequence of operations resulted in different patterns, and only for larger memory chunks?
For your reference, this is the code:
checkCudaErrors(cudaMallocManaged(&dinput1, bytes_alloc));
checkCudaErrors(cudaMallocManaged(&dinput2, bytes_alloc));
checkCudaErrors(cudaMallocManaged(&dinput3, bytes_alloc));
// init data to all 0
init_data_cpu(dinput1, num_floats);
init_data_cpu(dinput2, num_floats);
init_data_cpu(dinput3, num_floats);
checkCudaErrors(cudaDeviceSynchronize());
// do first prefetch for the first kernel
cudaMemPrefetchAsync(dinput1, bytes_alloc, GPU_DEVICE_ID, s2);
cudaEventRecord(e1, s2);
// start the first kernel on stream 1, but wait until the prefetch has completed
cudaEventSynchronize(e1);
cudaEventSynchronize(e2);
// The kernel just increments 1 to each array elem, linear grids and blocks, 1024 threads/block
launchSimpleKern(dinput1, num_floats, s1);
cudaEventRecord(e1, s1);
// launch the prefetch for the second kernel
// to be paralleled with compute, and scheduled after prefetch 1 start on s2
cudaStreamSynchronize(s2);
cudaMemPrefetchAsync(dinput2, bytes_alloc, GPU_DEVICE_ID, s2);
cudaEventRecord(e2, s2);
// launch the offload of buffer to s1, so it happens after the compute kernel
// no event wait because this happens after compute on s1 is done
cudaMemPrefetchAsync(dinput1, bytes_alloc, CPU_DEVICE_ID, s1);
// launch the second kernel on s2 (right after prefetch 2)
// it waits on e1 (kernel 1) and e2 (prefetch)
cudaEventSynchronize(e1);
cudaEventSynchronize(e2);
launchSimpleKern(dinput2, num_floats, s2);
cudaEventRecord(e1, s2);
// launch the prefetch for the third kernel
// to be paralleled with compute, launch on s3
// cudaStreamSynchronize(s2);
cudaMemPrefetchAsync(dinput3, bytes_alloc, GPU_DEVICE_ID, s3);
cudaEventRecord(e2, s3);
// launch the third kernel, wait on e1(kern2) and e2(pref3)
cudaEventSynchronize(e1);
launchSimpleKern(dinput3, num_floats, s3);
cudaEventRecord(e1, s3);
// prefetch the content for 1
cudaMemPrefetchAsync(dinput1, bytes_alloc, GPU_DEVICE_ID, s1);
cudaEventRecord(e2, s1);
// offload content for 2, after compute 2 finishes on s2
cudaMemPrefetchAsync(dinput2, bytes_alloc, CPU_DEVICE_ID, s2);
Thanks a lot for your help.