Within the CUDA code, we measure the time with cudaEventElapsedTime(elapsed, start, stop). We get about 2ms, that’s pretty fast. However, if we measure the execution time of this CUDA function in the surrounding C-Code with , we get about 200ms. Hence we have a time discrepancy of almost 200ms while measuring “the same thing”.
We would appreciate any help or hints!
Thanks in advance,
For the sake of getting a quick answer of the correct amount of time, just pop in a for loop and run the kernal 100 times. That’ll tell you quite plainly which time is more correct.
You could try inserting a cudaThreadSynchronize between the kernal launch and the finish event. That will make sure the event doesn’t fire until after the kernal is done.