Execution time The first execution time is always slow

When I had executed the .exe file after compilation the following code, why the first execution time is always slower than the others? (N=512, it’s slower than 1024 so on…)

Result:
512
0.0888103
1024
0.000553656
2048
0.000540972
4096
0.000565986
8192
0.000561761
16384
0.000560941
32768
0.000612148
65536
0.000697729
131072
0.000837562
262144
0.00144884

//-----------Main.cpp
#include
#include “stopWatch.c”

using namespace std;

extern “C” void reduce(int *, int *, unsigned int);

int main()
{
double GPU;
for(unsigned int z=512; z<=262144; z*=2)
{
//unsigned int N = 512;
int *h_a = new int[z];

for(unsigned int i=0; i<z; i++)
*(h_a+i) = 1;

int *h_sum = new int[z/512];
watchStart();
reduce(h_a, h_sum, z);
watchStop();
GPU = elapsedTime();

cout << *h_sum << endl;
cout << GPU << endl;


delete[] h_a;
delete[] h_sum;
}



return 0;

}


//------------Reduce.cu
#include <stdlib.h>
#include <stdio.h>

global void sum_kernel(int *, int *);

extern “C” void reduce(int h_a, int h_sum, unsigned int N)
{
unsigned int threadSize = 512;
unsigned int blockSize = N / threadSize;
int d_a = 0, d_output = 0;
cudaMalloc((void
)&d_a, N
sizeof(unsigned int));
cudaMalloc((void
*)&d_output, blockSizesizeof(unsigned int));
cudaMemcpy(d_a, h_a, N
sizeof(unsigned int), cudaMemcpyHostToDevice);

sum_kernel<<<blockSize, threadSize>>>(d_a, d_output);
sum_kernel<<<1, blockSize>>>(d_output, d_output);
cudaMemcpy(h_sum, d_output, sizeof(unsigned int), cudaMemcpyDeviceToHost);

cudaFree(d_a);
cudaFree(d_output);

}

global void sum_kernel(int *d_a, int *d_output)
{
shared int share_a[512];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
share_a[threadIdx.x] = d_a[idx];
__syncthreads();

for(int distance=blockDim.x/2; distance>0; distance/=2)
{
	if(threadIdx.x<distance)
	share_a[threadIdx.x] += share_a[threadIdx.x+distance];
	__syncthreads();
}


if(threadIdx.x==0)
d_output[blockIdx.x] = share_a[0];

}

Thank you very much!

initialisation overhead. The first time your kernel runs should always be kept out of your performance calculations.

I have noticed the same even with non-CUDA applications. I was using the Intel timebase mechanism to measure time.

I just assumed that the OS reads the EXE file off the disk and there could be some page-faults and more cache-misses the first time. From next time onwards, everything is in place. So, there would be lesser overhead and your time will be measured more accruately.

You can get rid of the setup time by simply allocating and deallocating some GPU memory before measuring time. Something like:

float * dummy;

CUDA_SAFE_CALL( cudaMalloc( (void**)&dummy, 100 * sizeof(float) ) );

CUDA_SAFE_CALL( cudaFree( dummy ) );

The author was talking about why the first EXE execution is slower when compared to subsequent invocations. I dont think your answer actually answers it.

Well, it isn’t really a good idea to time whole .exe execution. There are number of different factors which can greatly influence that timings.

Generally Sarnath is correct. When you start .exe Windows needs to load file into memory. Loading file for the first time requires reading from disk. Subsequent file execution may not trigger actual disk reads because file may reside in disk cache.

However, I think chris777 is timing kernel execution times, not .exe running time. In this case initialization overhead is responsible for this.

Thanks for your reply.

I have another case as well as this example, the following code is just executed on CPU:

Result:

512
4.64281e-006
1024
6.9097e-006
2048
1.26728e-005
4096
2.40148e-005
8192
4.51537e-005
16384
8.96457e-005
32768
0.00017837
65536
0.000394578
131072
0.000711628
262144
0.00142254

//-------Main.cpp
#include
#include “stopWatch.c”

using namespace std;

void reduce(float *h_a, unsigned int N)
{
for(unsigned int i=1; i<N; i++)
*h_a += *(h_a+i);
}

int main()
{
for(unsigned int N=512; N<=262144; N*=2)
{
//unsigned int N = 262144;
float *h_a = new float[N];
double CPU;

for(unsigned int i=0; i<N; i++)
*(h_a+i) = 1.0f;

watchStart();
reduce(h_a, N);
watchStop();
CPU = elapsedTime();

cout << *h_a << endl;
cout << CPU << endl;


delete[] h_a;
}
return 0;

}

Why the first execution time is faster than the others?

Because number of iterations is smaller.

What does that mean? I don’t understand that.

There are 512 threads in a block, but in actually there are 256 threads that were executed in one block.

512

0.0888103

1024

0.000553656

It’s because of initialization overhead, so the first execution time 0.0888103(512) is slower than 0.000553656(1024).

In CPU version why the initialization overhead didn’t take much time as well as in GPU version?

Because of iterations? Why?

Thank you. :)

Because your CPU code has to do less iterations for the first than the second (512 vs 1024) Initialisation overheads is something you have on the GPU (the program needs to be transferred to the GPU etc)

Thanks for the reply, and I will try to understand that.

Thank you.

I dont think “Intialization overhead” is resposible for FIRST EXE execution latency. “Initialization” has to happen for every EXE invocation, right?

I have seen this first-time overhead even with NON-CUDA applications. Do a small check on it. It is just related to caching at various levels (File caching, Disk data caching, physical-tag L1/2/3 caches and so on…)

But he is not speaking of first EXE execution overhead. He is speaking of the first kernel invocation that is slower (while having less to do)