Asynchronous memory transfer on Jetson TX1

Nawnaw · December 25, 2017, 7:28am

Hi,

Due to the fact that pinned memory allocated by cudaHostAlloc (or cudaMallocHost) on Tegra is uncacheable,
we cannot easily asynchronously transfer memory blocks and overlap data transfers.
A simple solution to the uncacheable problem is to use Unified Memory, and we can allocate page-locked memory through cudaMallocManaged.
However, what we want to do is explicitly and asynchronously transferring memory blocks and overlapping memory transfers for performance.

ref.
https://stackoverflow.com/questions/27972491/cpu-memory-access-latency-of-data-allocated-with-malloc-vs-cudahostalloc-on
https://devtalk.nvidia.com/default/topic/908507/pinned-memory-slows-cpu-computation/
https://devtalk.nvidia.com/default/topic/979941/jetson-tk1/cuda-memory-performance/
https://devtalk.nvidia.com/default/topic/949519/jetson-tx1/uncached-memory-created-by-cudahostalloc-and-cudamemcpyasync-issues-on-tx1/

Now my question is that is there any way to overlap memory transfers (and kernel executions) on Jetson TX1 (or Tegra) ?

Here is my attempts on Jetson TX1.

transfer managed memory on the host to device memory (outside the managed mem. )through cudaMemcpyAsync.

int* managed, *device;
CHECK(cudaMalloc((void **)&device, sizeof(int)*N));                                                                                                                                                     
CHECK(cudaMallocManaged((void **)&managed, sizeof(int)*N));
dim3 block(128);                                                                                                                                                                                    
dim3 grid((SZ + block.x-1) / block.x);

for(auto i = 0, n = 0; i < N-SZ; i+=SZ, ++n) {                                                                                                                                                                                                                                                                                                                   
       CHECK(cudaMemcpyAsync((void *)(device+i), (void *)(managed+i), sizeof(int)*SZ, cudaMemcpyHostToDevice, streams[n]));                                                                               
       kernel<<<grid, block, 0, streams[n]>>>(device+i, SZ);                                                                                                                                                                                                                         
}

Kernel executions and memory transfers(H->D) were overlapped.

transfer device memory (outside the managed mem.) to managed mem. on the host through cudaMemcpyAsync.

for(auto i = 0, n = 0; i < N-SZ; i+=SZ, ++n) {                                                                                                                                                                                                                                       
       kernel<<<grid, block, 0, streams[n]>>>(device+i, SZ);                                                                                                                                              
       CHECK(cudaMemcpyAsync((void *)(managed+i), (void *)(device+i), sizeof(int)*SZ, cudaMemcpyDeviceToHost, streams[n]));                                                                               
}

Kernel executions and memory transfers(D->H) were overlapped.

both

for(auto i = 0, n = 0; i < N-SZ; i+=SZ, ++n) {                                                                                                                                                          
                                                                                                                                                              
       CHECK(cudaMemcpyAsync((void *)(device+i), (void *)(managed+i), sizeof(int)*SZ, cudaMemcpyHostToDevice, streams[n]));                                                                               
       kernel<<<grid, block, 0, streams[n]>>>(device+i, SZ);                                                                                                                                              
       CHECK(cudaMemcpyAsync((void *)(managed+i), (void *)(device+i), sizeof(int)*SZ, cudaMemcpyDeviceToHost, streams[n]));                                                                               
}

Every streams were completely serialized ! why ?

Thanks !

AastaLLL · December 25, 2017, 9:20am

Hi,

Here are two suggestions:

1. Please check our relevant CUDA sample:
/usr/local/cuda-9.0/samples/0_Simple/UnifiedMemoryStreams/UnifiedMemoryStreams.cu

2. Check the CUDA stream behavior in detail with NVVP

Thanks.

Nawnaw · December 26, 2017, 4:16am

Thanks for your answer !

I probably could roughly implement the desired behavior thanks to the example.
(TX1 doesn’t support Unified Memory profiling…)

The results are based on nvvp (see the attachments), and explicit H->D and D->H were not overlapped if each stream was attached to a managed memory region.

i guess i’m digressing, but i have tried two approaches shown below (for pipelining host procedures and kernels) and i want to ask why the first one is slower than the second…,though explicit memory transfers must always be more efficient than UM migrations.

CHECK(cudaMalloc((void **)&device, sizeof(int)*N));                                                                                                                                                     
                                                                                                                                                                                                            
CHECK(cudaMallocManaged((void**)&managed, sizeof(int*)*NUM_STREAMS));                                                                                                                                   
for (auto i = 0; i < NUM_STREAMS; ++i) {                                                                                                                                                                
    CHECK(cudaMallocManaged((void **)&managed[i], sizeof(int)*SZ));   
    CHECK(cudaStreamAttachMemAsync(streams[i], (void *)managed[i], 0, cudaMemAttachSingle));                                                                                                                                  
} 
for(auto i = 1; i < NUM_STREAMS; ++i) { 
    CHECK(cudaStreamSynchronize(streams[i-1]));                                                                                                                                                                                                                                                                                                                      
    CHECK(cudaMemcpyAsync((void *)(device+i*SZ), (void *)managed[i], sizeof(int)*SZ, cudaMemcpyHostToDevice, streams[i]));                                                                              
    kernel<<<grid, block, 0, streams[i]>>>(device+i*SZ, SZ);                                                                                                                                            
    CHECK(cudaMemcpyAsync((void *)(managed[i]), (void *)(device+i*SZ), sizeof(int)*SZ, cudaMemcpyDeviceToHost, streams[i]));   
    cpu_process(managed[i-1], SZ);                                                                                                                                                                                                                                                           
}

CHECK(cudaMallocManaged((void**)&managed, sizeof(int*)*NUM_STREAMS));                                                                                                                                   
for (auto i = 0; i < NUM_STREAMS; ++i) {                                                                                                                                                                
    CHECK(cudaMallocManaged((void **)&managed[i], sizeof(int)*SZ));
    CHECK(cudaStreamAttachMemAsync(streams[i], (void *)managed[i], 0, cudaMemAttachSingle));                                                                                                                                    
} 
for(auto i = 1; i < NUM_STREAMS; ++i) {  
    CHECK(cudaStreamSynchronize(streams[i-1]));                                                                                                                                                                                                                                                                                                                                                                                             
    kernel<<<grid, block, 0, streams[i]>>>(device+i*SZ, SZ); 
    cpu_process(managed[i-1], SZ);                                                                                                                                                                                                                                                                                                                                                                                                       
}

Thank you.

AastaLLL · December 27, 2017, 7:20am

Hi,

Quick check your source, this line may cause GPU to wait the stream i-1 finished before starting the next job.
It’s recommended to check it further.

CHECK(cudaStreamSynchronize(streams[i-1]));

By the way, could you profile the default UnifiedMemoryStreams sample with NVVP on your environment.
We check it on Jetson TX2 and the concurrent stream can run correctly in our environment.

Thanks.

Nawnaw · January 9, 2018, 10:30am

i profiled the default sample, and , as you said, kernels in different streams concurrently run on TX1 too.
However, what i want to ask is not this.

When i copy data from managed regions in the host to (not managed) device memory block through cudaMemcpyAsync API,
the following things can be observed.

kernels in different streams can run concurrently.
memory transfers and kernel executions in different streams can be overlapped.
H->D transfers and D->H transfers in different streams cannot run concurrently.

For example, DtoH and HtoD from different streams in this code cannot be overlapped as you can see in the attachment.

cudaMalloc(&device, sizeof(int)*N);

cudaMallocManaged(&managed, sizeof(int*)*NUM_STREAMS);
for (auto i = 0; i < NUM_STREAMS; ++i) {
    cudaMallocManaged(&managed[i], sizeof(int)*SZ);
}
for(auto i = 0; i < NUM_STREAMS; ++i) {
    cudaStreamAttachMemAsync(streams[i], (void *)managed[i], 0, cudaMemAttachSingle);
}
for(auto i = 0; i < NUM_STREAMS; ++i) {
    cudaMemcpyAsync((void *)(device+i*SZ), (void *)managed[i], sizeof(int)*SZ, cudaMemcpyHostToDevice, streams[i]);
    cudaMemcpyAsync((void *)(managed[i]), (void *)(device+i*SZ), sizeof(int)*SZ, cudaMemcpyDeviceToHost, streams[i]);
}

Note:
The reason why i want to do such a strange thing is because UnifiedMemory is necessary for (cacheable) pege-locked memory due to the uncacheable behaviour of Tegra and explicit memory transfers, instead of migrations, are desirable for performance.
Is this approach wrong ?

Thanks.

Screenshot from 2018-01-09 19-22-28.png

AastaLLL · January 12, 2018, 7:00am

Hi,

In case you hit the memory bus limit, could you share the buffer amount you transfer a time?
More, could you also arrange a sample to reproduce this sample?

Thanks.

Nawnaw · January 16, 2018, 11:16am

Hi,

I self-solved this. Thanks for your answering.
It was a simple reason. TX1 has only one copy engine… so memory transfers cannot run concurrently.

BTW, can i assume that a managed host memory block is page-locked ?

AastaLLL · January 18, 2018, 8:51am

Hi,

1. Page-lock is CPU pinned memory.
2. Unified memory is two memory buffer (CPU & GPU) with the same address, which is auto-syncing via CUDA driver.

Thanks

Nawnaw · January 23, 2018, 11:29am

Yes, I understand the difference between their purposes, but explicitly transferring managed memory blocks is not forbidden and i can transfer them via cudaMemcpyAsync.
So i wondering if i can see unified memory on CPU as cached pinned memory unless i use unified memory on devices.

Thanks.

AastaLLL · January 29, 2018, 6:44am

Hi,

You can check our document:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd

Unified Memory offers a “single-pointer-to-data” model that is conceptually similar to CUDA’s zero-copy memory. One key difference between the two is that with zero-copy allocations the physical location of memory is pinned in CPU system memory such that a program may have fast or slow access to it depending on where it is being accessed from. Unified Memory, on the other hand, decouples memory and execution spaces so that all data accesses are fast.

Thanks.

Topic		Replies	Views
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1713	October 18, 2021
Kernel lunch overhead increases significantly (10x) when using unified memory on TK1 and TX1 Jetson TK1	5	3258	August 31, 2018
CUDA memory performance Jetson TK1	3	1121	October 18, 2021
Unified Memory On TX1 Jetson TX1	4	856	October 18, 2021
Zero-Copy and Managed memory on Jetson Jetson TX1	9	11526	August 20, 2018
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1183	May 14, 2019
Optimising GPU and CPU memory transfer time (CUDA/Hardware)? CUDA Programming and Performance hw , cuda	8	3936	January 7, 2022
Question about cudaManagedMemory on Jetson AGX Jetson AGX Orin cuda	5	48	November 21, 2024
Managed memory vs cudaHostAlloc - TK1 CUDA Programming and Performance	10	6120	February 22, 2016
How to disable zero-copy on TX1? Jetson TX1	4	759	October 18, 2021

Asynchronous memory transfer on Jetson TX1

Related topics