Hi,
I compiled and ran successfully the following program which finds the square of first 1000 integers on my 9600Gt , 512 MB. I am wondering why I am getting different execution time every time I execute it. This is happening in emulation mode as well.
The execution time on GPU is : 0.061770, 0.066406, 0.054441 (ms)
The execution time in emulation mode: 4.440767, 4.114464, 4.896113 (ms)
I am calling it SPMT: Single Program Multiple (Exeuction) Time :)
global void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 1000; // Number of elements in arrays
And the answer is that you cannot conclude there is any significant variation in execution time. You have presented 3 execution time data points (which include host to device data transfers) obtained while running what amounts to a null kernel, and measured using an event timer whose precision is low relative to the running time. It is impossible to draw any conclusions on the basis of what you have done.
But actually I am unable to understand your point… I have a simple question…I have a well defined program institutions and it should take same time every time I execute it.
Is it because of the (random)memory allocation using malloc thgat i am getting this variation. and why It is impossible to draw any conclusions on the basis of what I have done??
The whole point is you are not measuring how long it takes for the GPU to run the code, you are measuring how long it takes for the kernel to launch, and then you are claiming that there is inexplicable variation between durations of 61, 66 and 54 microseconds measured using the standard host real time clock, which probably doesn’t have resolution below about 10 microseconds anyway.
Thanks Avidday !! I got your points now. I am measuring how long it takes for the kernel to launch and not how long it is taking for the GPU to run my code. :)
Now can you tell me exactly how will I get the time it is taking for the GPU to run my code.
Avidday: When we place a cudaThreadSynchronize() call before stopping the timer, it will block the CPU host code from continuing untill all the threads have executed their jobs. Do you not think that we are slowing the speed of CPU and thus the whole program execution. Since GPU and CPU are working independently, the two should be working and executing their respective codes SIMULTANEOUSLY. Is there any method for finding the timer value WITHOUT stoping the CPU to move ahead.
Nico: then what if some one wants to synchronize the threads as well as do not want to stop the CPU from going ahead? I mean can we use both CUDA events and Synchthread() simultaneously thereby getting the accurate value of kernal execution time and allowing the CPU to move ahead (rather than blocking it until all the threads have finished their jobs)