Cuda Kernel not launching asynchronously?

I have an algorithm that needs to access a fair amount of memory, so to try and speed it up I split it into streams so that the memory copy and the kernel can execute in parallel. Unfortunately my first attempt didn’t really seem to speed it up as much as I would’ve hoped. I then added some timing code in to try and see what was going on, and it seems like the kernel isn’t launching asynchronously. I get the time on either side of the kernel call and the difference between them is about 200 ms. I’m using GetTickCount() on windows 7, have a Compute Capability 1.2 card (Quadro FX 880M), and CUDA SDK 4.2 installed. I also did the same thing around the cudaMemcpyAsync calls and same thing… Any ideas?


I read about CUDA_LAUNCH_BLOCKING, but it wasn’t set on my system. I tried explicitly setting it to 0, but it made no difference.

Turns out I had COMPUTE_PROFILE set to 1 and that automatically serializes the streams