CUDA memory performance

Hi all,

I’m trying to wrap my head around CUDA memory organization on Jetson TK1, and I don’t understand how I can hide memory transfer latency there. Normally one would allocate pinned memory with cudaHostAlloc or cudaMallocHost and perform asynchronous device/host transfers in parallel with kernel execution. This works reasonably well on my laptop with a discrete GPU.

However, I find that on K1 cudaMallocHost, cudaHostAlloc with cudaHostAllocDefault, and cudaHostAlloc with cudaHostAllocMapped all do pretty much the same thing: they allocate non-cachable memory mapped to /dev/nvmap. This, of course, decimates CPU performance for memory-intensive code. On the other hand, for memory allocated with malloc or mmap cudaMemcpyAsync is not asynchronous.

So, it looks like there are only two possibilities on K1:

  • Allocate zero-copy memory and suffer from performance degradation on CPU
  • Use normal memory and have all transfers be synchronous

Am I missing something?

Here are the results from my benchmark (https://gist.github.com/konstantin-azarov/666093bc0162bc23d7b32cf5926b1174):

Benchmark                                   Time           CPU Iterations
-------------------------------------------------------------------------
BM_malloc_read                       14971441 ns   14963282 ns         47   2.48962GB/s
BM_malloc_write                      10169105 ns   10166477 ns         57   3.66429GB/s
BM_cuda_copy_h2d                     28176609 ns   25812538 ns         27   1.44321GB/s
BM_cuda_copy_d2h                     24490177 ns   23068667 ns         30   1.61487GB/s
BM_cuda_copy_h2d_async/manual_time   28274816 ns   25883084 ns         25
BM_cuda_copy_d2h_async/manual_time   24495113 ns   23077762 ns         28
BM_cuda_malloc_read                 200448152 ns  200174523 ns          4   190.569MB/s
BM_cuda_malloc_write                 47369395 ns   47274455 ns         11   806.926MB/s
BM_pinned_read                      200451211 ns  200078443 ns          3    190.66MB/s
BM_pinned_write                      47364799 ns   47276902 ns         11   806.884MB/s
BM_mapped_read                      200649663 ns  200352668 ns          3   190.399MB/s
BM_mapped_write                      47381870 ns   47285940 ns         11    806.73MB/s
BM_managed_read                      15006465 ns   14947227 ns         47    2.4923GB/s
BM_managed_write                     10032148 ns   10020794 ns         61   3.71756GB/s

I guess cudaMemcpyAsync is not synchronous, it’s just CPU intensive. I can see in the profiler that copies are overlapped with kernel execution, even though memory is not pinned. Interesting.

Hi konstantin_a,

Regarding the article http://arrayfire.com/zero-copy-on-tegra-k1/ from 2014 stating that zero-copy is faster than cudaMalloc, this article is mis-leading and generalizes the zero-copy case. This is not really accurate.
Zero copy is only faster in some cases where the access pattern does not benefit from caches.

Zero-Copy memory on Tegra is CPU and GPU uncached. So every access by the CUDA kernel goes to DRAM. So if the kernel repeatedly accesses the same memory location from then it is likely that the cudaMalloc memory is faster.

You could also refer the discussing at another thread, even that’s in TX1 board, but concept is the same:
https://devtalk.nvidia.com/default/topic/949519/jetson-tx1/uncached-memory-created-by-cudahostalloc-and-cudamemcpyasync-issues-on-tx1/