I want to open up a discussion here, in order to better understand how to efficiently use NVIDIA TK1’s (physically) unified memory architecture. I have an example problem which I thought would have improvements using #1 over (the more common) #2:
cudaHostAlloc() & cudaHostGetDevicePointer() functions (with cudaDeviceMapHost flag set)
cudaMalloc() & cudaMemcpy(host to device)
… run kernel …
cudaMemcpy(device to host)
Link to the program:
Main launches each method individually a given # of iterations and computes the average cycle duration. For some reason method 2 out performs method 1, even thought it seems to be doing MUCH more memory transfer between host and device (my example is performed on a 640x480 float-array which is the input and output of the kernel algorithm).
What I cannot seem to grasp is the ‘WHY’ this is seemingly backwards, given that (at least in my mind) no matter which method is used the same memory should be accessed during the kernel’s for loop. The only difference I could imagine, is that method # 2 executes 2 memory transfers of the entire array (once before kernel execution, and once after kenel execution).
I’d appreciate any input from those who know better. Thanks in advance!!