Kernel launch

Hi everyone,
I have a question. I’m working on implementation of Nelder-Mead method on GPU. The method itself is implemented on CPU but error computations on GPU (for a set of points). For the convergence many iterations (kernel calls) are needed. The kernel itself is very simple, but it’s being called hundred times. I noticed that some calls (~10%) take more than 1 ms, other calls only 0.01ms… Does anyone have any idea why it’s like that and is it possible to avoid that delay?

Thanks a lot in advance!

I have a same problem with my kernels and found not a solution for it. But the call time of the kernles is faster with Windows 7 compared to Windows XP.