I have some code which executes the same function on the host, and then on the device, and outputs each to a file (it’s just some array arithmetic, nothing exciting).
I was using the standard method of allocating memory on the device, and then copying the array contents from the host (see “Standard CUDA pipeline”: http://arrayfire.com/zero-copy-on-tegra-k1/)
I was unhappy with the results, as it was not much faster than the CPU, so I implemented the “Zero Copy Access” method and got a tremendous improvement in speed (and verified that the 2 outputs, from CPU and GPU were the same).
HOWEVER: This DRASTICALLY slowed down the HOST (CPU) execution speed… by an order of magnitude!!!
Why is the CPU execution on the host arrays so slow when using this zero copy access method???
Thanks for the help