Unexpected managed (unified) memory behaviour

lyskule · May 29, 2019, 4:18pm

Hello there

I’ve been experimenting with unified memory lately. I have a very simple test case which I copied from the blog post https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/.

My goal is to parallelize the H2D and D2H prefetch traffic like a pipeline, as in the blog post. I allocated the memory using cudaMallocManaged and initialized it on the host end. Then I do explicit prefetches to GPU to prefetch data and later do prefetch to the host to evict data.

I was able to overlap the transfers with smaller memory chunks, specifically,

If I use 3 buffers of 1GBytes each, all the memory transfer overlapped correctly
If I use 3 buffers of 2GBytes each, D2H and H2D traffic is serialized
If I use 3 buffers of 1.3GBytes each, D2H and H2D traffic is serialized, Sometimes. And other times it’s overlapped

The attached picture denotes the third case. I run the same function three times and profiled it.

Link to a larger image:https://ibb.co/NKX8gDJ

What I’m not sure is, why the same sequence of operations resulted in different patterns, and only for larger memory chunks?

For your reference, this is the code:

checkCudaErrors(cudaMallocManaged(&dinput1, bytes_alloc));
    checkCudaErrors(cudaMallocManaged(&dinput2, bytes_alloc));
    checkCudaErrors(cudaMallocManaged(&dinput3, bytes_alloc));

    // init data to all 0
    init_data_cpu(dinput1, num_floats);
    init_data_cpu(dinput2, num_floats);
    init_data_cpu(dinput3, num_floats);
    checkCudaErrors(cudaDeviceSynchronize());

    // do first prefetch for the first kernel
    cudaMemPrefetchAsync(dinput1, bytes_alloc, GPU_DEVICE_ID, s2);
    cudaEventRecord(e1, s2);

    // start the first kernel on stream 1, but wait until the prefetch has completed
    cudaEventSynchronize(e1);
    cudaEventSynchronize(e2);
    // The kernel just increments 1 to each array elem, linear grids and blocks, 1024 threads/block
    launchSimpleKern(dinput1, num_floats, s1);  
    cudaEventRecord(e1, s1);

    // launch the prefetch for the second kernel
    // to be paralleled with compute, and scheduled after prefetch 1 start on s2
    cudaStreamSynchronize(s2);
    cudaMemPrefetchAsync(dinput2, bytes_alloc, GPU_DEVICE_ID, s2);
    cudaEventRecord(e2, s2);

    // launch the offload of buffer to s1, so it happens after the compute kernel
    // no event wait because this happens after compute on s1 is done
    cudaMemPrefetchAsync(dinput1, bytes_alloc, CPU_DEVICE_ID, s1);

    // launch the second kernel on s2 (right after prefetch 2)
    // it waits on e1 (kernel 1) and e2 (prefetch)
    cudaEventSynchronize(e1);
    cudaEventSynchronize(e2);
    launchSimpleKern(dinput2, num_floats, s2);
    cudaEventRecord(e1, s2);

// launch the prefetch for the third kernel
    // to be paralleled with compute, launch on s3
    // cudaStreamSynchronize(s2);
    cudaMemPrefetchAsync(dinput3, bytes_alloc, GPU_DEVICE_ID, s3);
    cudaEventRecord(e2, s3);

    // launch the third kernel, wait on e1(kern2) and e2(pref3)
    cudaEventSynchronize(e1);
    launchSimpleKern(dinput3, num_floats, s3);
    cudaEventRecord(e1, s3);

    // prefetch the content for 1
    cudaMemPrefetchAsync(dinput1, bytes_alloc, GPU_DEVICE_ID, s1);
    cudaEventRecord(e2, s1);

    // offload content for 2, after compute 2 finishes on s2
    cudaMemPrefetchAsync(dinput2, bytes_alloc, CPU_DEVICE_ID, s2);

Thanks a lot for your help.

Topic		Replies	Views
Overlapping memcpy and compute is serialized on some machines CUDA Programming and Performance cuda , nsight , performance	2	368	May 19, 2024
Performance Issue using cudaMemPrefetchAsync CUDA Programming and Performance	7	2393	May 24, 2017
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1254	May 14, 2019
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1618	October 18, 2021
Sharing GPU global memory with multiple CPU threads CUDA Programming and Performance	5	2754	February 26, 2019
Unified memory oversubscription and page faults CUDA Programming and Performance	7	2824	March 23, 2018
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9733	September 22, 2007
Unpredictable nature of GPU action timing in Nsight CUDA Programming and Performance	10	1203	October 18, 2015
Can I create a pinned memory buffer to support overlapping compute/copy without cudaMallocHost overhead CUDA Programming and Performance cuda	13	811	November 3, 2020
Processing Order with Cuda Streams in 7.5 CUDA Programming and Performance	13	2003	June 24, 2016

Unexpected managed (unified) memory behaviour

Related topics