I have a timer which is producing output that is not what I expect. The code I am using goes like this. Basically, I create a timer, run the kernel, output the timer value, copy the results back to the host, and output the timer value again.
[codebox]//start a timer
printf(“\nStarting Kernel Now…\n”);
cutilCheckError(cutCreateTimer(&timer));
cutilCheckError(cutStartTimer(timer));
//run the kernel
invokeKernel<<<grid, threads>>>( …parameters… );
//output the amount of time elapsed so far
printf( “\nKernel Done. Execution time: %f (ms)\n”, cutGetTimerValue( timer));
// Copy the output back to main memory
printf( “\nGetting Output from GPU.\n”);
cutilSafeCall(cudaMemcpy(…, …, …, cudaMemcpyDeviceToHost));
//output the amount of time elapsed so far
cutilCheckError( cutStopTimer( timer));
printf( “\nTransfer done. Total time: %f (ms)\n”, cutGetTimerValue( timer));[/codebox]
In my application, the kernel is slow. I expect it to take about 30 seconds to complete. The output data is small (2 mb). However, when I run my application, the first timer output is something like 0.05 ms, and the second output is about 30 seconds.
The output of the application is correct. So there is no way my kernel is running in 0.05 ms. Likewise, there is no way it should take 30 seconds to transfer 2 mb of data from the GPU to the host. That leads me to believe I am using the timers incorrectly.
Doe the invokeKernel method return immediately after invoking the kernel, or does it wait for all threads to complete? Any advice in troubleshooting this would be much appreciated.
Thanks,
Bill