SPMT: Single Program Multiple (Exeuction) Time

Kiran_CUDA · June 23, 2009, 12:31pm

Hi,
I compiled and ran successfully the following program which finds the square of first 1000 integers on my 9600Gt , 512 MB. I am wondering why I am getting different execution time every time I execute it. This is happening in emulation mode as well.

The execution time on GPU is : 0.061770, 0.066406, 0.054441 (ms)
The execution time in emulation mode: 4.440767, 4.114464, 4.896113 (ms)

I am calling it SPMT: Single Program Multiple (Exeuction) Time :)

Thanks in advance.

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <cutil.h>

global void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}

// main routine that executes on the host
int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 1000; // Number of elements in arrays

 unsigned int timer = 0;

cutCreateTimer( &timer );
unsigned int timercpu = 0;
cutCreateTimer( &timercpu );

 size_t size = N * sizeof(float);  
 a_h = (float *)malloc(size);        // Allocate array on host  
 cudaMalloc((void **) &a_d, size);   // Allocate array on device  
 // Initialize host array and copy it to CUDA device  
 for (int i=0; i<N; i++) a_h[i] = (float)i;  
 cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);  
 // Do calculation on device:  
 int block_size = 4;  
 int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);  
 
 cutStartTimer( timer ); // Start timer 
 
square_array <<< n_blocks, block_size >>> (a_d, N);  

cutStopTimer( timer ); // Stop timer 

 // Retrieve result from device and store it in host array  
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);  
 // Print results  
 
 printf("CUDA execution time = %f ms\n",cutGetTimerValue( timer )); 
 
 getchar();
 
 for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);  
     
 getchar();
// Cleanup  
 free(a_h); cudaFree(a_d);

}

Nico · June 23, 2009, 12:48pm

In order to get better timing results, you should perform it on a (preferably large) number of iterations.

The first iteration is usually slower.

This code should give you more accurate timing results:

[codebox]cutilSafeCall( cudaThreadSynchronize() );

cutilCheckError( cutResetTimer(hTimer) );

cutilCheckError( cutStartTimer(hTimer) );

for(int i = 0; i < NUM_ITERATIONS; i++){

    Kernel<<<...,...>>>(...);

    cutilCheckMsg("Kernel() execution failed\n");

}

cutilSafeCall( cudaThreadSynchronize() );

cutilCheckError( cutStopTimer(hTimer) );

double gpuTime = cutGetTimerValue(hTimer) / NUM_ITERATIONS;

printf(

    "Avg. time: %f ms\n", gpuTime)

);[/codebox]

N.

Kiran_CUDA · June 23, 2009, 12:51pm

In order to get better timing results, you should perform it on a (preferably large) number of iterations.

The first iteration is usually slower.

This code should give you more accurate timing results:

[codebox]cutilSafeCall( cudaThreadSynchronize() );
cutilCheckError( cutResetTimer(hTimer) );

cutilCheckError( cutStartTimer(hTimer) );

for(int i = 0; i < NUM_ITERATIONS; i++){

    Kernel<<<...,...>>>(...);

    cutilCheckMsg("Kernel() execution failed\n");

}

cutilSafeCall( cudaThreadSynchronize() );

cutilCheckError( cutStopTimer(hTimer) );

double gpuTime = cutGetTimerValue(hTimer) / NUM_ITERATIONS;

printf(

    "Avg. time: %f ms\n", gpuTime)

);[/codebox]
N.

Thanks Nico…

I will try it…but will I get the same execution time every time I compile my program?

jph4599 · June 23, 2009, 12:51pm

Also keep in note the limitations mentioned by tmurray in this post:

If you want accurate GPU execution time, you really should use events…

Kiran_CUDA · June 24, 2009, 10:20am

Thanks friends…

Well my questions is exactly why this variation in execution time occurs??

avidday · June 24, 2009, 10:33am

And the answer is that you cannot conclude there is any significant variation in execution time. You have presented 3 execution time data points (which include host to device data transfers) obtained while running what amounts to a null kernel, and measured using an event timer whose precision is low relative to the running time. It is impossible to draw any conclusions on the basis of what you have done.

Kiran_CUDA · June 24, 2009, 11:27am

Thanks a tone Avidday…

But actually I am unable to understand your point… I have a simple question…I have a well defined program institutions and it should take same time every time I execute it.

Is it because of the (random)memory allocation using malloc thgat i am getting this variation. and why It is impossible to draw any conclusions on the basis of what I have done??

avidday · June 24, 2009, 12:57pm

The whole point is you are not measuring how long it takes for the GPU to run the code, you are measuring how long it takes for the kernel to launch, and then you are claiming that there is inexplicable variation between durations of 61, 66 and 54 microseconds measured using the standard host real time clock, which probably doesn’t have resolution below about 10 microseconds anyway.

Kiran_CUDA · June 26, 2009, 11:57am

Thanks Avidday !! I got your points now. I am measuring how long it takes for the kernel to launch and not how long it is taking for the GPU to run my code. :)

Now can you tell me exactly how will I get the time it is taking for the GPU to run my code.

avidday · June 26, 2009, 12:24pm

Call cudaThreadSynchronize() before stopping the timer.

Kiran_CUDA · July 4, 2009, 5:52am

Avidday: When we place a cudaThreadSynchronize() call before stopping the timer, it will block the CPU host code from continuing untill all the threads have executed their jobs. Do you not think that we are slowing the speed of CPU and thus the whole program execution. Since GPU and CPU are working independently, the two should be working and executing their respective codes SIMULTANEOUSLY. Is there any method for finding the timer value WITHOUT stoping the CPU to move ahead.

Nico · July 4, 2009, 8:52am

yes, use cudaEvents.

N.

Kiran_CUDA · July 4, 2009, 11:27am

Nico: then what if some one wants to synchronize the threads as well as do not want to stop the CPU from going ahead? I mean can we use both CUDA events and Synchthread() simultaneously thereby getting the accurate value of kernal execution time and allowing the CPU to move ahead (rather than blocking it until all the threads have finished their jobs)

Nico · July 4, 2009, 11:39am

Can’t you performs the kernels + GPU stream/event synchronization and the CPU parts in separate host threads?

N.

Kiran_CUDA · July 4, 2009, 12:12pm

Thanks a lot Nico!!

Could you please tell me how we performs the kernels + GPU stream/event synchronization and the CPU parts in separate host threads?

I might be asking very simple question but I am trying my best to understand all these things.

Nico · July 4, 2009, 12:19pm

Here’s a basic thread tutorial for boost.

N.

Topic		Replies	Views
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10571	June 21, 2009
Execution timings varying from instance to instance CUDA Programming and Performance	10	481	September 29, 2023
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6529	February 19, 2009
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	670	April 4, 2017
Number of GPU clock cycles CUDA Programming and Performance	15	10093	June 16, 2017
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7734	December 5, 2008
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9346	January 7, 2008
clock() doesn't work properly CUDA Programming and Performance	10	6281	July 3, 2009
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8610	December 18, 2008
Getting different time for kernel execution. CUDA Programming and Performance	6	5897	July 30, 2009

SPMT: Single Program Multiple (Exeuction) Time

Related topics