the same thing, different time consuming asking for help

I wrote 3 simple kernels, and found it takes quiet a long time to run the 3 kernels for 1000 times(the purpose of the kernels is to stimulate the logistic mapping). Then i found it run the first 300 times consuming 69ms, but 158ms in the next 50 times. I use CUDA2.2, 9600GT, and the OS is windows XP. but it happens also in rhel5.1.
can anyone tell me how this happens?

The most likely reason is you are not using cudaThreadSynchronize() and your timing measurements are not measuring what you think you’re measuring. Kernel launches are asynchronous and can queue up in the device. When you stop the stopwatch they may not have finished yet. Using cudaThreadSynchronize() will guarantee that they have finished.

This is not necessarily your problem but it’s the most likely culprit.

I used __syncthreads(), are they two the same?

__syncthreads is used on the device side to synchronize threads within a kernel (specifically, within a block).

cudaThreadSynchronize() is used on the host side to make sure the kernel calls have finished. You usually don’t need cudaThreadSynchronize() because most operations naturally wait for the previous operations to finish, for example if you cudaMemcpy() it will wait for any queued kernels to finish, (otherwise it would give wrong results!) But when making timing measurements, you do need to use cudaThreadSynchronize().

I’ll try that, thanks!

So it is, thank you very much.