Zero Copy vs. CudaMemcpy on Jetson TK1

dwyer2bp · April 20, 2016, 2:30pm

Hi,

I want to open up a discussion here, in order to better understand how to efficiently use NVIDIA TK1’s (physically) unified memory architecture. I have an example problem which I thought would have improvements using #1 over (the more common) #2:

cudaHostAlloc() & cudaHostGetDevicePointer() functions (with cudaDeviceMapHost flag set)
cudaMalloc() & cudaMemcpy(host to device)
… run kernel …
cudaMemcpy(device to host)

Link to the program:
https://drive.google.com/open?id=0B1VzyJ5ock3XYVM3LXMwajc4TXc

Main launches each method individually a given # of iterations and computes the average cycle duration. For some reason method 2 out performs method 1, even thought it seems to be doing MUCH more memory transfer between host and device (my example is performed on a 640x480 float-array which is the input and output of the kernel algorithm).

What I cannot seem to grasp is the ‘WHY’ this is seemingly backwards, given that (at least in my mind) no matter which method is used the same memory should be accessed during the kernel’s for loop. The only difference I could imagine, is that method # 2 executes 2 memory transfers of the entire array (once before kernel execution, and once after kenel execution).

I’d appreciate any input from those who know better. Thanks in advance!!

spencer_k · May 6, 2016, 3:45pm

I have run into this very same issue, I have no input as to why.

codesign · May 10, 2016, 6:42am

I had similar problems when using unified memory on TK1. It seems to be a driver problem that has been resolved on TX1.

ctichenor · May 13, 2016, 4:44pm

Hello codesign, what exactly is the issue that appears to be fixed on the TX1 that is still an issue on the TK1? If you could post steps/code to reproduce the issue we would like to investigate this further and see if it might be possible for us to resolve the issue on TK1.

codesign · May 18, 2016, 1:47pm

I filed a bug for this issue (#1719505), providing a test case. According to the bug report, R23.x (and newer) contains a fix for the issue. Unfortunately, TK1 only supports R21.4.

Topic		Replies	Views
CUDA memory performance Jetson TK1	3	1120	October 18, 2021
Managed memory vs cudaHostAlloc - TK1 Jetson TK1	6	2008	February 15, 2016
How to disable zero-copy on TX1? Jetson TX1	4	758	October 18, 2021
Unified Memory On TX1 Jetson TX1	4	855	October 18, 2021
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1617	October 18, 2021
zero-copy not working on tx1 Jetson TX1	4	964	November 29, 2016
Kernel lunch overhead increases significantly (10x) when using unified memory on TK1 and TX1 Jetson TK1	5	3245	August 31, 2018
Managed memory vs cudaHostAlloc - TK1 CUDA Programming and Performance	10	6119	February 22, 2016
cudaMemcpy leaks on TK1 Jetson TK1	4	1168	February 24, 2016
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1712	October 18, 2021

Zero Copy vs. CudaMemcpy on Jetson TK1

Related topics