I was unhappy with the results, as it was not much faster than the CPU, so I implemented the “Zero Copy Access” method and got a tremendous improvement in speed (and verified that the 2 outputs, from CPU and GPU were the same).
HOWEVER: This DRASTICALLY slowed down the HOST (CPU) execution speed… by an order of magnitude!!!
Why is the CPU execution on the host arrays so slow when using this zero copy access method???
Unified Memory offers a “single-pointer-to-data” model that is conceptually similar to CUDA’s zero-copy memory. One key difference between the two is that with zero-copy allocations the physical location of memory is pinned in CPU system memory such that a program may have fast or slow access to it depending on where it is being accessed from. Unified Memory, on the other hand, decouples memory and execution spaces so that all data accesses are fast.