CUDA memory performance

konstantin_a · December 3, 2016, 10:22pm

Hi all,

I’m trying to wrap my head around CUDA memory organization on Jetson TK1, and I don’t understand how I can hide memory transfer latency there. Normally one would allocate pinned memory with cudaHostAlloc or cudaMallocHost and perform asynchronous device/host transfers in parallel with kernel execution. This works reasonably well on my laptop with a discrete GPU.

However, I find that on K1 cudaMallocHost, cudaHostAlloc with cudaHostAllocDefault, and cudaHostAlloc with cudaHostAllocMapped all do pretty much the same thing: they allocate non-cachable memory mapped to /dev/nvmap. This, of course, decimates CPU performance for memory-intensive code. On the other hand, for memory allocated with malloc or mmap cudaMemcpyAsync is not asynchronous.

So, it looks like there are only two possibilities on K1:

Allocate zero-copy memory and suffer from performance degradation on CPU
Use normal memory and have all transfers be synchronous

Am I missing something?

Here are the results from my benchmark (Tegra K1 CUDA memory benchmark · GitHub):

Benchmark                                   Time           CPU Iterations
-------------------------------------------------------------------------
BM_malloc_read                       14971441 ns   14963282 ns         47   2.48962GB/s
BM_malloc_write                      10169105 ns   10166477 ns         57   3.66429GB/s
BM_cuda_copy_h2d                     28176609 ns   25812538 ns         27   1.44321GB/s
BM_cuda_copy_d2h                     24490177 ns   23068667 ns         30   1.61487GB/s
BM_cuda_copy_h2d_async/manual_time   28274816 ns   25883084 ns         25
BM_cuda_copy_d2h_async/manual_time   24495113 ns   23077762 ns         28
BM_cuda_malloc_read                 200448152 ns  200174523 ns          4   190.569MB/s
BM_cuda_malloc_write                 47369395 ns   47274455 ns         11   806.926MB/s
BM_pinned_read                      200451211 ns  200078443 ns          3    190.66MB/s
BM_pinned_write                      47364799 ns   47276902 ns         11   806.884MB/s
BM_mapped_read                      200649663 ns  200352668 ns          3   190.399MB/s
BM_mapped_write                      47381870 ns   47285940 ns         11    806.73MB/s
BM_managed_read                      15006465 ns   14947227 ns         47    2.4923GB/s
BM_managed_write                     10032148 ns   10020794 ns         61   3.71756GB/s

konstantin_a · December 4, 2016, 5:50am

I guess cudaMemcpyAsync is not synchronous, it’s just CPU intensive. I can see in the profiler that copies are overlapped with kernel execution, even though memory is not pinned. Interesting.

kayccc · December 9, 2016, 2:55am

Hi konstantin_a,

Regarding the article http://arrayfire.com/zero-copy-on-tegra-k1/ from 2014 stating that zero-copy is faster than cudaMalloc, this article is mis-leading and generalizes the zero-copy case. This is not really accurate.
Zero copy is only faster in some cases where the access pattern does not benefit from caches.

Zero-Copy memory on Tegra is CPU and GPU uncached. So every access by the CUDA kernel goes to DRAM. So if the kernel repeatedly accesses the same memory location from then it is likely that the cudaMalloc memory is faster.

You could also refer the discussing at another thread, even that’s in TX1 board, but concept is the same:
[url]uncached memory created by cudaHostAlloc and cudaMemcpyAsync issues on TX1 - Jetson TX1 - NVIDIA Developer Forums

Topic		Replies	Views
cudaHostAlloc vs cudaMallocHost vs cudaMallocManaged Jetson TK1	2	4120	October 20, 2016
Zero Copy vs. CudaMemcpy on Jetson TK1 Jetson TK1	4	1527	May 18, 2016
Issue with cuda pinned memory on Tegra K1(XiaoMi pad) Android Development	1	1510	January 21, 2015
Issue with cuda pinned memory on Tegra K1(XiaoMi pad) CUDA Programming and Performance	1	936	January 21, 2015
OpenCV Performance TK1 Jetson TK1	18	10618	October 18, 2021
uncached memory created by cudaHostAlloc and cudaMemcpyAsync issues on TX1 Jetson TX1	3	1764	July 15, 2016
Managed memory vs cudaHostAlloc - TK1 CUDA Programming and Performance	10	6166	February 22, 2016
Pinned memory slows CPU computation Jetson TK1	5	1440	January 8, 2016
Memory copy improvement ? CUDA Programming and Performance	6	3107	April 25, 2012
Managed memory vs cudaHostAlloc - TK1 Jetson TK1	6	2048	February 15, 2016

CUDA memory performance

Related topics