Question: time counting with/without memcpy

Hi all,

I’m not sure if anyone else discussed this question before, if did, sorry for the reposting.

The problem is, when I tried to count the time of the matrixMul from CUDA SDK examples, I moved the statement of copying device memory back to the host after the line of CUT_SAFE_CALL(cutStopTimer(timer)). Then I got two different time results (I use 512 by 512 for all matrices to make them bigger):

./debug/matrixMul Processing time: 3.235000 (ms) ./release/matrixMul
Processing time: 0.031000 (ms)

Why in release mode, it runs so fast??

But if I don’t change anything except for the size, the results are quite the same.
./debug/matrixMul Processing time: 5.632000 (ms) ./release/matrixMul
Processing time: 5.035000 (ms)

I’m using nvcc 1.1 and SDK 1.1 btw.

first of all, why are you still using 1.1? 2.0 is pretty cool, we promise :(

the answer, if I’m understanding what you did correctly, is that you don’t have a cudaThreadSynchronize() before you call cutStopTimer. kernel launches are asynchronous, so you’re measuring just the launch, not the kernel’s execution time itself. cudaMemcpy forces a sync before it copies back and blocks until the memcpy is completed, so that should explain the differences.

OH, OK, that makes sense. Thanks! Yep, I’d try 2.0.