Unexpected managed (unified) memory behaviour

Hello there

I’ve been experimenting with unified memory lately. I have a very simple test case which I copied from the blog post https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/.

My goal is to parallelize the H2D and D2H prefetch traffic like a pipeline, as in the blog post. I allocated the memory using cudaMallocManaged and initialized it on the host end. Then I do explicit prefetches to GPU to prefetch data and later do prefetch to the host to evict data.

I was able to overlap the transfers with smaller memory chunks, specifically,

  1. If I use 3 buffers of 1GBytes each, all the memory transfer overlapped correctly
  2. If I use 3 buffers of 2GBytes each, D2H and H2D traffic is serialized
  3. If I use 3 buffers of 1.3GBytes each, D2H and H2D traffic is serialized, Sometimes. And other times it’s overlapped

The attached picture denotes the third case. I run the same function three times and profiled it.


Link to a larger image:https://ibb.co/NKX8gDJ

What I’m not sure is, why the same sequence of operations resulted in different patterns, and only for larger memory chunks?

For your reference, this is the code:

checkCudaErrors(cudaMallocManaged(&dinput1, bytes_alloc));
    checkCudaErrors(cudaMallocManaged(&dinput2, bytes_alloc));
    checkCudaErrors(cudaMallocManaged(&dinput3, bytes_alloc));

    // init data to all 0
    init_data_cpu(dinput1, num_floats);
    init_data_cpu(dinput2, num_floats);
    init_data_cpu(dinput3, num_floats);
    checkCudaErrors(cudaDeviceSynchronize());

    // do first prefetch for the first kernel
    cudaMemPrefetchAsync(dinput1, bytes_alloc, GPU_DEVICE_ID, s2);
    cudaEventRecord(e1, s2);

    // start the first kernel on stream 1, but wait until the prefetch has completed
    cudaEventSynchronize(e1);
    cudaEventSynchronize(e2);
    // The kernel just increments 1 to each array elem, linear grids and blocks, 1024 threads/block
    launchSimpleKern(dinput1, num_floats, s1);  
    cudaEventRecord(e1, s1);

    // launch the prefetch for the second kernel
    // to be paralleled with compute, and scheduled after prefetch 1 start on s2
    cudaStreamSynchronize(s2);
    cudaMemPrefetchAsync(dinput2, bytes_alloc, GPU_DEVICE_ID, s2);
    cudaEventRecord(e2, s2);

    // launch the offload of buffer to s1, so it happens after the compute kernel
    // no event wait because this happens after compute on s1 is done
    cudaMemPrefetchAsync(dinput1, bytes_alloc, CPU_DEVICE_ID, s1);

    // launch the second kernel on s2 (right after prefetch 2)
    // it waits on e1 (kernel 1) and e2 (prefetch)
    cudaEventSynchronize(e1);
    cudaEventSynchronize(e2);
    launchSimpleKern(dinput2, num_floats, s2);
    cudaEventRecord(e1, s2);

// launch the prefetch for the third kernel
    // to be paralleled with compute, launch on s3
    // cudaStreamSynchronize(s2);
    cudaMemPrefetchAsync(dinput3, bytes_alloc, GPU_DEVICE_ID, s3);
    cudaEventRecord(e2, s3);

    // launch the third kernel, wait on e1(kern2) and e2(pref3)
    cudaEventSynchronize(e1);
    launchSimpleKern(dinput3, num_floats, s3);
    cudaEventRecord(e1, s3);

    // prefetch the content for 1
    cudaMemPrefetchAsync(dinput1, bytes_alloc, GPU_DEVICE_ID, s1);
    cudaEventRecord(e2, s1);

    // offload content for 2, after compute 2 finishes on s2
    cudaMemPrefetchAsync(dinput2, bytes_alloc, CPU_DEVICE_ID, s2);

Thanks a lot for your help.