Hi,
Due to the fact that pinned memory allocated by cudaHostAlloc (or cudaMallocHost) on Tegra is uncacheable,
we cannot easily asynchronously transfer memory blocks and overlap data transfers.
A simple solution to the uncacheable problem is to use Unified Memory, and we can allocate page-locked memory through cudaMallocManaged.
However, what we want to do is explicitly and asynchronously transferring memory blocks and overlapping memory transfers for performance.
ref.
https://stackoverflow.com/questions/27972491/cpu-memory-access-latency-of-data-allocated-with-malloc-vs-cudahostalloc-on
https://devtalk.nvidia.com/default/topic/908507/pinned-memory-slows-cpu-computation/
https://devtalk.nvidia.com/default/topic/979941/jetson-tk1/cuda-memory-performance/
https://devtalk.nvidia.com/default/topic/949519/jetson-tx1/uncached-memory-created-by-cudahostalloc-and-cudamemcpyasync-issues-on-tx1/
Now my question is that is there any way to overlap memory transfers (and kernel executions) on Jetson TX1 (or Tegra) ?
Here is my attempts on Jetson TX1.
- transfer managed memory on the host to device memory (outside the managed mem. )through cudaMemcpyAsync.
int* managed, *device; CHECK(cudaMalloc((void **)&device, sizeof(int)*N)); CHECK(cudaMallocManaged((void **)&managed, sizeof(int)*N)); dim3 block(128); dim3 grid((SZ + block.x-1) / block.x); for(auto i = 0, n = 0; i < N-SZ; i+=SZ, ++n) { CHECK(cudaMemcpyAsync((void *)(device+i), (void *)(managed+i), sizeof(int)*SZ, cudaMemcpyHostToDevice, streams[n])); kernel<<<grid, block, 0, streams[n]>>>(device+i, SZ); }
Kernel executions and memory transfers(H->D) were overlapped.
- transfer device memory (outside the managed mem.) to managed mem. on the host through cudaMemcpyAsync.
for(auto i = 0, n = 0; i < N-SZ; i+=SZ, ++n) { kernel<<<grid, block, 0, streams[n]>>>(device+i, SZ); CHECK(cudaMemcpyAsync((void *)(managed+i), (void *)(device+i), sizeof(int)*SZ, cudaMemcpyDeviceToHost, streams[n])); }
Kernel executions and memory transfers(D->H) were overlapped.
- both
for(auto i = 0, n = 0; i < N-SZ; i+=SZ, ++n) { CHECK(cudaMemcpyAsync((void *)(device+i), (void *)(managed+i), sizeof(int)*SZ, cudaMemcpyHostToDevice, streams[n])); kernel<<<grid, block, 0, streams[n]>>>(device+i, SZ); CHECK(cudaMemcpyAsync((void *)(managed+i), (void *)(device+i), sizeof(int)*SZ, cudaMemcpyDeviceToHost, streams[n])); }
Every streams were completely serialized ! why ?
Thanks !