I am implementing an iterative solver using cublas and cusparse Fortran libraries. I have noticed that in every iterative step and after a few library calls a significant delay is taking place. This delay is recursive at least twice in every iterative step and it is not affected by the library call order. I measured these delays to be 0.06 to 0.08 secs depending on the problem size, while the library calls computing time is about 0.003 secs. These delays damage the performance. The computer uses two Tesla M2070 GPU, Cuda 5.5 and PGI14.3. Do you have any idea why that may happen ?