Asynchronous memory transfer on Jetson TX1

Hi,

Due to the fact that pinned memory allocated by cudaHostAlloc (or cudaMallocHost) on Tegra is uncacheable,
we cannot easily asynchronously transfer memory blocks and overlap data transfers.
A simple solution to the uncacheable problem is to use Unified Memory, and we can allocate page-locked memory through cudaMallocManaged.
However, what we want to do is explicitly and asynchronously transferring memory blocks and overlapping memory transfers for performance.

ref.
https://stackoverflow.com/questions/27972491/cpu-memory-access-latency-of-data-allocated-with-malloc-vs-cudahostalloc-on
https://devtalk.nvidia.com/default/topic/908507/pinned-memory-slows-cpu-computation/
https://devtalk.nvidia.com/default/topic/979941/jetson-tk1/cuda-memory-performance/
https://devtalk.nvidia.com/default/topic/949519/jetson-tx1/uncached-memory-created-by-cudahostalloc-and-cudamemcpyasync-issues-on-tx1/

Now my question is that is there any way to overlap memory transfers (and kernel executions) on Jetson TX1 (or Tegra) ?

Here is my attempts on Jetson TX1.

  1. transfer managed memory on the host to device memory (outside the managed mem. )through cudaMemcpyAsync.
    int* managed, *device;
    CHECK(cudaMalloc((void **)&device, sizeof(int)*N));                                                                                                                                                     
    CHECK(cudaMallocManaged((void **)&managed, sizeof(int)*N));
    dim3 block(128);                                                                                                                                                                                    
    dim3 grid((SZ + block.x-1) / block.x);
    
    for(auto i = 0, n = 0; i < N-SZ; i+=SZ, ++n) {                                                                                                                                                                                                                                                                                                                   
           CHECK(cudaMemcpyAsync((void *)(device+i), (void *)(managed+i), sizeof(int)*SZ, cudaMemcpyHostToDevice, streams[n]));                                                                               
           kernel<<<grid, block, 0, streams[n]>>>(device+i, SZ);                                                                                                                                                                                                                         
    }
    

    Kernel executions and memory transfers(H->D) were overlapped.

  2. transfer device memory (outside the managed mem.) to managed mem. on the host through cudaMemcpyAsync.
    for(auto i = 0, n = 0; i < N-SZ; i+=SZ, ++n) {                                                                                                                                                                                                                                       
           kernel<<<grid, block, 0, streams[n]>>>(device+i, SZ);                                                                                                                                              
           CHECK(cudaMemcpyAsync((void *)(managed+i), (void *)(device+i), sizeof(int)*SZ, cudaMemcpyDeviceToHost, streams[n]));                                                                               
    }
    

    Kernel executions and memory transfers(D->H) were overlapped.

  3. both
    for(auto i = 0, n = 0; i < N-SZ; i+=SZ, ++n) {                                                                                                                                                          
                                                                                                                                                                  
           CHECK(cudaMemcpyAsync((void *)(device+i), (void *)(managed+i), sizeof(int)*SZ, cudaMemcpyHostToDevice, streams[n]));                                                                               
           kernel<<<grid, block, 0, streams[n]>>>(device+i, SZ);                                                                                                                                              
           CHECK(cudaMemcpyAsync((void *)(managed+i), (void *)(device+i), sizeof(int)*SZ, cudaMemcpyDeviceToHost, streams[n]));                                                                               
    }
    

    Every streams were completely serialized ! why ?

Thanks !

Hi,

Here are two suggestions:

1. Please check our relevant CUDA sample:
/usr/local/cuda-9.0/samples/0_Simple/UnifiedMemoryStreams/UnifiedMemoryStreams.cu

2. Check the CUDA stream behavior in detail with NVVP

Thanks.

Thanks for your answer !

I probably could roughly implement the desired behavior thanks to the example.
(TX1 doesn’t support Unified Memory profiling…)

The results are based on nvvp (see the attachments), and explicit H->D and D->H were not overlapped if each stream was attached to a managed memory region.

i guess i’m digressing, but i have tried two approaches shown below (for pipelining host procedures and kernels) and i want to ask why the first one is slower than the second…,though explicit memory transfers must always be more efficient than UM migrations.

CHECK(cudaMalloc((void **)&device, sizeof(int)*N));                                                                                                                                                     
                                                                                                                                                                                                            
CHECK(cudaMallocManaged((void**)&managed, sizeof(int*)*NUM_STREAMS));                                                                                                                                   
for (auto i = 0; i < NUM_STREAMS; ++i) {                                                                                                                                                                
    CHECK(cudaMallocManaged((void **)&managed[i], sizeof(int)*SZ));   
    CHECK(cudaStreamAttachMemAsync(streams[i], (void *)managed[i], 0, cudaMemAttachSingle));                                                                                                                                  
} 
for(auto i = 1; i < NUM_STREAMS; ++i) { 
    CHECK(cudaStreamSynchronize(streams[i-1]));                                                                                                                                                                                                                                                                                                                      
    CHECK(cudaMemcpyAsync((void *)(device+i*SZ), (void *)managed[i], sizeof(int)*SZ, cudaMemcpyHostToDevice, streams[i]));                                                                              
    kernel<<<grid, block, 0, streams[i]>>>(device+i*SZ, SZ);                                                                                                                                            
    CHECK(cudaMemcpyAsync((void *)(managed[i]), (void *)(device+i*SZ), sizeof(int)*SZ, cudaMemcpyDeviceToHost, streams[i]));   
    cpu_process(managed[i-1], SZ);                                                                                                                                                                                                                                                           
}
CHECK(cudaMallocManaged((void**)&managed, sizeof(int*)*NUM_STREAMS));                                                                                                                                   
for (auto i = 0; i < NUM_STREAMS; ++i) {                                                                                                                                                                
    CHECK(cudaMallocManaged((void **)&managed[i], sizeof(int)*SZ));
    CHECK(cudaStreamAttachMemAsync(streams[i], (void *)managed[i], 0, cudaMemAttachSingle));                                                                                                                                    
} 
for(auto i = 1; i < NUM_STREAMS; ++i) {  
    CHECK(cudaStreamSynchronize(streams[i-1]));                                                                                                                                                                                                                                                                                                                                                                                             
    kernel<<<grid, block, 0, streams[i]>>>(device+i*SZ, SZ); 
    cpu_process(managed[i-1], SZ);                                                                                                                                                                                                                                                                                                                                                                                                       
}

Thank you.

Hi,

Quick check your source, this line may cause GPU to wait the stream i-1 finished before starting the next job.
It’s recommended to check it further.

CHECK(cudaStreamSynchronize(streams[i-1]));

By the way, could you profile the default UnifiedMemoryStreams sample with NVVP on your environment.
We check it on Jetson TX2 and the concurrent stream can run correctly in our environment.

Thanks.

i profiled the default sample, and , as you said, kernels in different streams concurrently run on TX1 too.
However, what i want to ask is not this.

When i copy data from managed regions in the host to (not managed) device memory block through cudaMemcpyAsync API,
the following things can be observed.

  • kernels in different streams can run concurrently.
  • memory transfers and kernel executions in different streams can be overlapped.
  • H->D transfers and D->H transfers in different streams cannot run concurrently.

For example, DtoH and HtoD from different streams in this code cannot be overlapped as you can see in the attachment.

cudaMalloc(&device, sizeof(int)*N);

cudaMallocManaged(&managed, sizeof(int*)*NUM_STREAMS);
for (auto i = 0; i < NUM_STREAMS; ++i) {
    cudaMallocManaged(&managed[i], sizeof(int)*SZ);
}
for(auto i = 0; i < NUM_STREAMS; ++i) {
    cudaStreamAttachMemAsync(streams[i], (void *)managed[i], 0, cudaMemAttachSingle);
}
for(auto i = 0; i < NUM_STREAMS; ++i) {
    cudaMemcpyAsync((void *)(device+i*SZ), (void *)managed[i], sizeof(int)*SZ, cudaMemcpyHostToDevice, streams[i]);
    cudaMemcpyAsync((void *)(managed[i]), (void *)(device+i*SZ), sizeof(int)*SZ, cudaMemcpyDeviceToHost, streams[i]);
}

Note:
The reason why i want to do such a strange thing is because UnifiedMemory is necessary for (cacheable) pege-locked memory due to the uncacheable behaviour of Tegra and explicit memory transfers, instead of migrations, are desirable for performance.
Is this approach wrong ?

Thanks.

Screenshot from 2018-01-09 19-22-28.png

Hi,

In case you hit the memory bus limit, could you share the buffer amount you transfer a time?
More, could you also arrange a sample to reproduce this sample?

Thanks.

Hi,

I self-solved this. Thanks for your answering.
It was a simple reason. TX1 has only one copy engine… so memory transfers cannot run concurrently.

BTW, can i assume that a managed host memory block is page-locked ?

Hi,

1. Page-lock is CPU pinned memory.
2. Unified memory is two memory buffer (CPU & GPU) with the same address, which is auto-syncing via CUDA driver.

Thanks

Yes, I understand the difference between their purposes, but explicitly transferring managed memory blocks is not forbidden and i can transfer them via cudaMemcpyAsync.
So i wondering if i can see unified memory on CPU as cached pinned memory unless i use unified memory on devices.

Thanks.

Hi,

You can check our document:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd

Unified Memory offers a “single-pointer-to-data” model that is conceptually similar to CUDA’s zero-copy memory. One key difference between the two is that with zero-copy allocations the physical location of memory is pinned in CPU system memory such that a program may have fast or slow access to it depending on where it is being accessed from. Unified Memory, on the other hand, decouples memory and execution spaces so that all data accesses are fast.

Thanks.