speed not stable,and performance lost Maybe a HUGE bug

I am working on applying CUDA on image processing,and recently I have found something rather confusing.

Enviroment: CUDA 1.0 ,8800GTS,4G memory,CORE2 DUO 6500E,162.01_int driver,Windows XP, DX9(I think it has nothing to do with the DX ,though)
shared memory usage: ~15k/block

In my programme, I have to run a function 74 times. I wrote my function 74 times,rather than a for loop.

global Myfunc();

…74 times…

I cut time for only one run, and it took only 0.041 ms, which is pretty good. and the first 20 run cost only 1-2 ms, GOOD.
40 run ,14 ms
I have to run it 74 times ,and it cost 53 ms! This is by no means fast comparing to CPU.

In the SDK “transpose” exsample a similar problem occured:
itrationtimes: naive optimized
1 : 0.580 : 0.011
2 : 0.292 : 0.011
8 : 0.081 : 0.012 //still 10x speedup as intro said
9 : 0.072 : 0.435 //!
16 : 0.046 : 3.553 //not only not stable ,but also much slower
and when iteration continues to increase ,speed was still not stable ,and performance lost was very huge.
40 : 2.207 : 1.895
100 : 3.156 : 1.051
200 : 3.473 : 0.767 //optimized 75times slower,naive 6 times slower

I don`t know what caused this, compiler hides the memory transpost time?CPU-GPU comunication caused speed lost?

Anybody have run transpose example and got a better result?Which Speed is real?Can the lost performance be found back by change settings or using CUDA 1.1 and later driver?Thanks for reply.

At least, maybe the cut time fuction is not reliable enough to estimate the speed of your programme.

Kernels are launched asynchronously, and the queue depth for kernel launches is ~16 by my tests. You clearly are not calling cudaThreadSynchronnize() to wait for the kernel to complete before marking the end time.

Thanks for reply
then the slower one is the real speed or the performance could be enhanced by syncronization?

It is impossible to even know what you are measuring now, so I have no idea what the “real” execution time is. Some of the values you quote could be measuring the time of the last 4 kernel launches, or the last 2. Until you put a cudaThreadSynchronize() before every call to your timer, you are measuring garbage data and can make absolutley no conclusions from it.

synchronizaiong cannot enhance performance, it can only slow things down by preventing the driver from overlapping CPU and GPU computations. Synchronizing will however, cause you to measure the proper times of kernel launches.

Note that you can also use the CUDA PROFILER for this task. See the docs in the toolkit download.

Thanks really

Another quistion
you mentioned that the time I quote could be the time of last 2 or 4 kernel launches , but in transpose example it was average time caculated by:
Is there something wrong with the example?The initial value of iterationtime is 1, then is the speed of 0.58/0.011 reliable?

There is noting wrong with the way the transpose sample measures average time. cudaThreadSynchronize() is called before starting and stopping each timer. So, the elapsed time is exactly the time for as many kernels as were launched between the cudaThreadSynchronize() calls.