Zero Copy vs. CudaMemcpy on Jetson TK1

Hi,

I want to open up a discussion here, in order to better understand how to efficiently use NVIDIA TK1’s (physically) unified memory architecture. I have an example problem which I thought would have improvements using #1 over (the more common) #2:

  1. cudaHostAlloc() & cudaHostGetDevicePointer() functions (with cudaDeviceMapHost flag set)

  2. cudaMalloc() & cudaMemcpy(host to device)
    … run kernel …
    cudaMemcpy(device to host)

Link to the program:
https://drive.google.com/open?id=0B1VzyJ5ock3XYVM3LXMwajc4TXc

Main launches each method individually a given # of iterations and computes the average cycle duration. For some reason method 2 out performs method 1, even thought it seems to be doing MUCH more memory transfer between host and device (my example is performed on a 640x480 float-array which is the input and output of the kernel algorithm).

What I cannot seem to grasp is the ‘WHY’ this is seemingly backwards, given that (at least in my mind) no matter which method is used the same memory should be accessed during the kernel’s for loop. The only difference I could imagine, is that method # 2 executes 2 memory transfers of the entire array (once before kernel execution, and once after kenel execution).

I’d appreciate any input from those who know better. Thanks in advance!!

I have run into this very same issue, I have no input as to why.

I had similar problems when using unified memory on TK1. It seems to be a driver problem that has been resolved on TX1.

Hello codesign, what exactly is the issue that appears to be fixed on the TX1 that is still an issue on the TK1? If you could post steps/code to reproduce the issue we would like to investigate this further and see if it might be possible for us to resolve the issue on TK1.

I filed a bug for this issue (#1719505), providing a test case. According to the bug report, R23.x (and newer) contains a fix for the issue. Unfortunately, TK1 only supports R21.4.