say I want to time a memory fetching from device global memory
I don’t understand why my time3 and time2 always give same results. My kernel does take a long time to get the result ready for fetching, but shouldn’t cudaThreadSynchronize() block all the operation before kernel_call is done? Also fetching from device memory to host memory shall also take a while, at least noticeable. Thanks.