I have been testing two different implementations of a GPU algorithm which has a series of kernel calls which operate on 2-D data;
version one: which reduces answers down to blocks, copies those small block answer arrays to host memory then iterates on CPU through to get those final answers.
version two: uses __threadfence() after global writes, then when blockIdx.x==0 && blockIdx.y==0 && threadIdx.x==0 that thread iterates through the GPU global memory and copies those final answers there back to global memory(which then is copied to CPU).
After testing both, they return the same answers, but it seems version 1 is slightly faster.
Since CPU clock speeds tend to be faster than GPU clock speeds, is there ever a situation where __threadfence() will be faster than a CPU final step?