Tips to avoid laggy display during long kernels

Dear CUDA users,
I amcurrently doing image processing on GPU and I have one kernel that takes something like 500 to 700 milliseconds when running on big images.
The problem is that the whole display and even the mouse cursor are getting laggy (OS=windows 7)

My idea was to split my kernel in 4 or 8 kernel launches, hoping that the driver could refresh more often (between each kernel launch).
Unfortunately it does not help at all, so what else could I try to avoid this freezing display effect?
Note: I am prepared to trade performances for smoothness!

Insert a [font="'Courier New"]cudaStreamQuery(0)[/font][font=“Arial”]after each kernel to prevent them from being sent together as one batch.[/font]

Ok, I will try that ASAP. So do you think (know?) if splitting my kernel will solve my issue?

It’s hard to say without know at least a sketch of your code.

    [*]Are you doing Graphic Interoperatibility?[*]Can you overlap memory copies and kernel calls?


No graphic interop.
I don’t overlap mem transfer and kernels, but the data processed is quite small (512² images) and the transfer takes only a small percentage of the whole processing.

cudaStreamQuery(0) solved the issue. Thanks