Hi all,
I’m trying to wrap my head around CUDA memory organization on Jetson TK1, and I don’t understand how I can hide memory transfer latency there. Normally one would allocate pinned memory with cudaHostAlloc or cudaMallocHost and perform asynchronous device/host transfers in parallel with kernel execution. This works reasonably well on my laptop with a discrete GPU.
However, I find that on K1 cudaMallocHost, cudaHostAlloc with cudaHostAllocDefault, and cudaHostAlloc with cudaHostAllocMapped all do pretty much the same thing: they allocate non-cachable memory mapped to /dev/nvmap. This, of course, decimates CPU performance for memory-intensive code. On the other hand, for memory allocated with malloc or mmap cudaMemcpyAsync is not asynchronous.
So, it looks like there are only two possibilities on K1:
- Allocate zero-copy memory and suffer from performance degradation on CPU
- Use normal memory and have all transfers be synchronous
Am I missing something?
Here are the results from my benchmark (Tegra K1 CUDA memory benchmark · GitHub):
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
BM_malloc_read 14971441 ns 14963282 ns 47 2.48962GB/s
BM_malloc_write 10169105 ns 10166477 ns 57 3.66429GB/s
BM_cuda_copy_h2d 28176609 ns 25812538 ns 27 1.44321GB/s
BM_cuda_copy_d2h 24490177 ns 23068667 ns 30 1.61487GB/s
BM_cuda_copy_h2d_async/manual_time 28274816 ns 25883084 ns 25
BM_cuda_copy_d2h_async/manual_time 24495113 ns 23077762 ns 28
BM_cuda_malloc_read 200448152 ns 200174523 ns 4 190.569MB/s
BM_cuda_malloc_write 47369395 ns 47274455 ns 11 806.926MB/s
BM_pinned_read 200451211 ns 200078443 ns 3 190.66MB/s
BM_pinned_write 47364799 ns 47276902 ns 11 806.884MB/s
BM_mapped_read 200649663 ns 200352668 ns 3 190.399MB/s
BM_mapped_write 47381870 ns 47285940 ns 11 806.73MB/s
BM_managed_read 15006465 ns 14947227 ns 47 2.4923GB/s
BM_managed_write 10032148 ns 10020794 ns 61 3.71756GB/s