Timing Issue

I have two kernels that perform the same thing in slight different ways. The first kernel uses 3 floating-point arrays, whereas the second one uses one float3-point array. The main idea is to time them to determine which one is a better implementation. However, the issue is that the kernel executed first takes almost twice time than the kernel executed second.

Next are the kernels:

global void addDevice(float *A, float *B, float *C, float nStrm, float bSize, float dt) {

int tx = threadIdx.x + blockIdx.x * blockDim.x;
float a, b, c;

a = A[tx];
b = B[tx];
c = C[tx];
b = nStrm * a + (1-nStrm) * b + bSize * c * dt;

}

global void addDevice3(float3 *ABC, float nStrm, float bSize, float dt) {

int tx = threadIdx.x + blockIdx.x * blockDim.x;
float3 abc;
float b;

abc = ABC[tx];
b = nStrm * abc.x + (1-nStrm) * abc.y + bSize * abc.z * dt;

}

Say if I called first addDevice it will take aprox. 0.06 secs and the addDevice3 will take aprox 0.03, however if I called first addDevice3 it will take aprox 0.06 secs and addDevice 0.03 secs. So as I said depending on which kernel execute first it will take aproximately twice the time of the kernel executed second. I am using events to time the kernels right before the execution of every kernel, so it looks to me that I’m timing correctly.

As it can be seen on the kernels above, they use different arrays, so I guess caching is not helping the execution of the kernel executed second. The question is why is this thing happening ?

Thanks.

CUDA needs some kind of initialization. This happens when some first CUDA function is called. I also often see this behavior and by calling some e.g. kernel in advance of what u want to measure this should be solved.