I don’t understand why my time3 and time2 always give same results. My kernel does take a long time to get the result ready for fetching, but shouldn’t cudaThreadSynchronize() block all the operation before kernel_call is done? Also fetching from device memory to host memory shall also take a while, at least noticeable. Thanks.
I don’t understand why my time3 and time2 always give same results. My kernel does take a long time to get the result ready for fetching, but shouldn’t cudaThreadSynchronize() block all the operation before kernel_call is done? Also fetching from device memory to host memory shall also take a while, at least noticeable. Thanks.
Is there a typo in the memory transfer directions? Looks like it should be cudaMemcpyHostToDevice in the first memcpy and cudaMemcpyDeviceToHost in the second, only then are you timing fetches. Right now since there’s nothing depending on the second memcpy, some optimization might be happening (wild guess). And you don’t need the second cudaThreadSynchronize(), since there would be no gpu threads running.
Is there a typo in the memory transfer directions? Looks like it should be cudaMemcpyHostToDevice in the first memcpy and cudaMemcpyDeviceToHost in the second, only then are you timing fetches. Right now since there’s nothing depending on the second memcpy, some optimization might be happening (wild guess). And you don’t need the second cudaThreadSynchronize(), since there would be no gpu threads running.