I am getting some weird timing for a very simple kernel.
The following function is used to find the min of an array.
global void k_gpuFCT_timestep_pass1(float *dt, int nbrPoints, float *d)
float minv = dt;
for (int i=0; i<nbrPoints; i++) minv = min(minv,dt[i]);
d[SOLVERDATA_TIMESTEP] = minv;
This is run with only one block and one thread. d & dt are 2 cudaMalloc arrays. d is about 16 floats and a has 32000 floats.
Time like this the routine takes about 6.1ms.
If I comment out the d[SOLVERDATA_TIMESTEP]=minv, the timing is 0.03ms
I can not figure out why the extra line is a problem. Any ideas?
In fact if somebody has a better way to compute the min of an array without having to switch back to 1bloc x 1thread, I would appreciate the tip. I have looked at the scan SDK example and adapted it but this is far too expensive for such a simple problem.
I have also tried to replace d[SOLVERDATA_TIMESTEP] by a device variable without much luck.