I’m not sure if anyone else discussed this question before, if did, sorry for the reposting.
The problem is, when I tried to count the time of the matrixMul from CUDA SDK examples, I moved the statement of copying device memory back to the host after the line of CUT_SAFE_CALL(cutStopTimer(timer)). Then I got two different time results (I use 512 by 512 for all matrices to make them bigger):
Processing time: 3.235000 (ms)
Processing time: 0.031000 (ms)
Why in release mode, it runs so fast??
But if I don’t change anything except for the size, the results are quite the same.
./debug/matrixMul Processing time: 5.632000 (ms) ./release/matrixMul
Processing time: 5.035000 (ms)
I’m using nvcc 1.1 and SDK 1.1 btw.