SPMT: Single Program Multiple (Exeuction) Time

I compiled and ran successfully the following program which finds the square of first 1000 integers on my 9600Gt , 512 MB. I am wondering why I am getting different execution time every time I execute it. This is happening in emulation mode as well.

The execution time on GPU is : 0.061770, 0.066406, 0.054441 (ms)
The execution time in emulation mode: 4.440767, 4.114464, 4.896113 (ms)

I am calling it SPMT: Single Program Multiple (Exeuction) Time :)

Thanks in advance.

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <cutil.h>

global void square_array(float *a, int N)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];

// main routine that executes on the host
int main(void)
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 1000; // Number of elements in arrays

 unsigned int timer = 0;

cutCreateTimer( &timer );
unsigned int timercpu = 0;
cutCreateTimer( &timercpu );

 size_t size = N * sizeof(float);  
 a_h = (float *)malloc(size);        // Allocate array on host  
 cudaMalloc((void **) &a_d, size);   // Allocate array on device  
 // Initialize host array and copy it to CUDA device  
 for (int i=0; i<N; i++) a_h[i] = (float)i;  
 cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);  
 // Do calculation on device:  
 int block_size = 4;  
 int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);  
 cutStartTimer( timer ); // Start timer 
square_array <<< n_blocks, block_size >>> (a_d, N);  

cutStopTimer( timer ); // Stop timer 

 // Retrieve result from device and store it in host array  
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);  
 // Print results  
 printf("CUDA execution time = %f ms\n",cutGetTimerValue( timer )); 
 for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);  
// Cleanup  
 free(a_h); cudaFree(a_d);  


In order to get better timing results, you should perform it on a (preferably large) number of iterations.

The first iteration is usually slower.

This code should give you more accurate timing results:

[codebox]cutilSafeCall( cudaThreadSynchronize() );

cutilCheckError( cutResetTimer(hTimer) );

cutilCheckError( cutStartTimer(hTimer) );

for(int i = 0; i < NUM_ITERATIONS; i++){


    cutilCheckMsg("Kernel() execution failed\n");


cutilSafeCall( cudaThreadSynchronize() );

cutilCheckError( cutStopTimer(hTimer) );

double gpuTime = cutGetTimerValue(hTimer) / NUM_ITERATIONS;


    "Avg. time: %f ms\n", gpuTime)



Thanks Nico…

I will try it…but will I get the same execution time every time I compile my program?

Also keep in note the limitations mentioned by tmurray in this post:

If you want accurate GPU execution time, you really should use events…

Thanks friends…

Well my questions is exactly why this variation in execution time occurs??

And the answer is that you cannot conclude there is any significant variation in execution time. You have presented 3 execution time data points (which include host to device data transfers) obtained while running what amounts to a null kernel, and measured using an event timer whose precision is low relative to the running time. It is impossible to draw any conclusions on the basis of what you have done.

Thanks a tone Avidday…

But actually I am unable to understand your point… I have a simple question…I have a well defined program institutions and it should take same time every time I execute it.

Is it because of the (random)memory allocation using malloc thgat i am getting this variation. and why It is impossible to draw any conclusions on the basis of what I have done??

The whole point is you are not measuring how long it takes for the GPU to run the code, you are measuring how long it takes for the kernel to launch, and then you are claiming that there is inexplicable variation between durations of 61, 66 and 54 microseconds measured using the standard host real time clock, which probably doesn’t have resolution below about 10 microseconds anyway.

Thanks Avidday !! I got your points now. I am measuring how long it takes for the kernel to launch and not how long it is taking for the GPU to run my code. :)

Now can you tell me exactly how will I get the time it is taking for the GPU to run my code.

Call cudaThreadSynchronize() before stopping the timer.

Avidday: When we place a cudaThreadSynchronize() call before stopping the timer, it will block the CPU host code from continuing untill all the threads have executed their jobs. Do you not think that we are slowing the speed of CPU and thus the whole program execution. Since GPU and CPU are working independently, the two should be working and executing their respective codes SIMULTANEOUSLY. Is there any method for finding the timer value WITHOUT stoping the CPU to move ahead.

yes, use cudaEvents.


Nico: then what if some one wants to synchronize the threads as well as do not want to stop the CPU from going ahead? I mean can we use both CUDA events and Synchthread() simultaneously thereby getting the accurate value of kernal execution time and allowing the CPU to move ahead (rather than blocking it until all the threads have finished their jobs)

Can’t you performs the kernels + GPU stream/event synchronization and the CPU parts in separate host threads?


Thanks a lot Nico!!

Could you please tell me how we performs the kernels + GPU stream/event synchronization and the CPU parts in separate host threads?

I might be asking very simple question but I am trying my best to understand all these things.

Here’s a basic thread tutorial for boost.