Timing the Kernel

I have a timer which is producing output that is not what I expect. The code I am using goes like this. Basically, I create a timer, run the kernel, output the timer value, copy the results back to the host, and output the timer value again.

[codebox]//start a timer

printf(“\nStarting Kernel Now…\n”);

cutilCheckError(cutCreateTimer(&timer));

cutilCheckError(cutStartTimer(timer));

//run the kernel

invokeKernel<<<grid, threads>>>( …parameters… );

//output the amount of time elapsed so far

printf( “\nKernel Done. Execution time: %f (ms)\n”, cutGetTimerValue( timer));

// Copy the output back to main memory

printf( “\nGetting Output from GPU.\n”);

cutilSafeCall(cudaMemcpy(…, …, …, cudaMemcpyDeviceToHost));

//output the amount of time elapsed so far

cutilCheckError( cutStopTimer( timer));

printf( “\nTransfer done. Total time: %f (ms)\n”, cutGetTimerValue( timer));[/codebox]

In my application, the kernel is slow. I expect it to take about 30 seconds to complete. The output data is small (2 mb). However, when I run my application, the first timer output is something like 0.05 ms, and the second output is about 30 seconds.

The output of the application is correct. So there is no way my kernel is running in 0.05 ms. Likewise, there is no way it should take 30 seconds to transfer 2 mb of data from the GPU to the host. That leads me to believe I am using the timers incorrectly.

Doe the invokeKernel method return immediately after invoking the kernel, or does it wait for all threads to complete? Any advice in troubleshooting this would be much appreciated.

Thanks,

Bill

kernel launches are asynchronous, so the invokeKernel time of 0.05ms is only the kernel queuing time, not the execution time. Add a call to cudaThreadSynchronize to make the host spinlock until the kernel finishes execution before you stop the timer. That will correct your timing.

the kernel call only queues the kernel for execution, the kernel is executed on the device asynchronously.

you need to insert a cudaThreadSynchronize() call behind the kernel call, this will cause your program to wait for the kernel to finish.

Thank you to both of you. Works perfectly.