I have a question. I’m working on implementation of Nelder-Mead method on GPU. The method itself is implemented on CPU but error computations on GPU (for a set of points). For the convergence many iterations (kernel calls) are needed. The kernel itself is very simple, but it’s being called hundred times. I noticed that some calls (~10%) take more than 1 ms, other calls only 0.01ms… Does anyone have any idea why it’s like that and is it possible to avoid that delay?
Thanks a lot in advance!