NVIDIA Developer Forums

Strange Performance Issues Strange Performance Issues at the First Kernel Execution

Accelerated Computing CUDA CUDA Programming and Performance

allanmulin August 8, 2009, 1:33pm 1

Hello,

I have done a while loop which executes:

a kernel and;
a host function which do exactly the same operation in the kernel.

The objective is to measure the time which each function (host and device) takes. I have noticed that the first execution of the kernel (the first loop) is much faster than the others. Look below the results:

Type N: 1000
Type numThreadPerBlock (<= 512): 512

elapsedTimeCPU = 1.875000 miliseconds
elapsedTimeGPU = 0.088000 miliseconds
factor = 21.306818 (elapsedTimeCPU/elapsedTimeGPU)

Type N: 1000
Type numThreadPerBlock (<= 512): 512

elapsedTimeCPU = 1.848000 miliseconds
elapsedTimeGPU = 0.267000 miliseconds
factor = 6.921349 (elapsedTimeCPU/elapsedTimeGPU)

Type N: 1000
Type numThreadPerBlock (<= 512): 512

elapsedTimeCPU = 1.847000 miliseconds
elapsedTimeGPU = 0.268000 miliseconds
factor = 6.891791 (elapsedTimeCPU/elapsedTimeGPU)

Type N: 1000
Type numThreadPerBlock (<= 512): 512

elapsedTimeCPU = 1.847000 miliseconds
elapsedTimeGPU = 0.268000 miliseconds
factor = 6.891791 (elapsedTimeCPU/elapsedTimeGPU)

Type N: 1000
Type numThreadPerBlock (<= 512): 512

elapsedTimeCPU = 1.862000 miliseconds
elapsedTimeGPU = 0.269000 miliseconds
factor = 6.921933 (elapsedTimeCPU/elapsedTimeGPU)

Type N: 1000
Type numThreadPerBlock (<= 512): 512

elapsedTimeCPU = 1.850000 miliseconds
elapsedTimeGPU = 0.269000 miliseconds
factor = 6.877324 (elapsedTimeCPU/elapsedTimeGPU)

I have already check for errors in the first execution, but I found nothing. I am using the timer functions of the cutil library and I call cutilSafeThreadSync() before the beginning and end of timing.

Is there anyone who have already noticed that or which can try to reproduce the error with a simple kernel?

SPWorley August 8, 2009, 2:41pm 2

You didn’t post your code, but likely what you’re timing is the QUEUEING speed, not the execution speed.

Use cudaThreadSyncronize() before your timer call to make sure kernels have finished running before finishing the timing loop.

Topic		Replies	Views	Activity
Kernel execution overhead CUDA Programming and Performance	2	1159	July 6, 2009
help with first cuda program CUDA Programming and Performance	5	3879	June 24, 2009
Speed up due to a kernel launch ? CUDA Programming and Performance	3	1193	December 26, 2009
First kernel execution takes longer CUDA Programming and Performance	8	2865	December 8, 2014
Inconsistent CUDA Kernel Execution Times in Sequential Execution CUDA Programming and Performance cuda	6	251	June 11, 2024
How different kernels affect the performance Performance issues CUDA Programming and Performance	3	4437	September 18, 2007
Odd Slowdown Problem Same function slows down in loop CUDA Programming and Performance	3	9875	February 8, 2008
faster at small runtimes, slower for larger runtimes CUDA Programming and Performance	1	754	June 4, 2010
Timing the Kernel CUDA Programming and Performance	3	3727	January 15, 2010
Getting Different Execution Times of Running Same Kernel Twice CUDA Programming and Performance	2	26	August 13, 2024