cudaStreamQuery returns cudaSuccess directly after kernel launch

I am using two streams to overlap a kernel execution and a memory copy from device to page-locked host memory. The strange thing is that when I do a cudaStreamQuery for the stream used to execute the kernel directly after the kernel call, then it always returns cudaSuccess as if the kernel call does not return before the kernel execution is finished. When I try the same thing with the async. memory copy then I get a cudaErrorNotReady as expected.

I am using a GTX295 on Kubuntu 9.04.

Any ideas why this is happening?
Thanks :-)