When the program is build in EmuRelease configuration, what is the targeted device? When I call cutGetTimerValue function to get the execution time, is it the execution time on the targeted device or the host processor? I tried to run the same program several times, but everytime it gives me different results, and seems the result also depends on the host processor’s loading.
Everything runs on the CPU in the emulation mode. I wouldn’t worry about the times in emulation, as the mode is emulating execution by threads, blocks, etc., all sequentially.
And also cutGetTimerValue() only returns the time passed between cutStartTimer and cutStopTimer calls. It doesn’t care what you did in between so you can’t say it is the execution time on the target or the host device.
Is there any way to get execution time (or the number of cycles) a segment of program will execute on real GPU? I don’t have a CUDA compatible graphics card at hand.