Multiple kernels concurrency problems + MemcpyToArrayAsync() incorrect stream

Hello,

I am currently developing an application in CUDA and I have some questions regarding to concurrency. The problem is I can not achieve a concurrent execution of two kernels. I have checked the card is capable of doing this and the two kernels are in separate streams. Also, I have checked that there is no other kernel blocking any of the executions. The program look like this:

-MemcpyAsync Host to device.(Stream 1)
-Execution of kernel A. (Stream 1)
-MemcpytoArrayAsync device to device. (Stream 0)(specified Stream 1)
-StreamSynchronize (Stream 2)
-Binding of textures
-Execution of kernel B (Stream 1)

-MemcpyAsync Host to device.(Stream 2)
-Execution of kernel A. (Stream 2)
-MemcpytoArrayAsync device to device. (Stream 0)(specified Stream 2)
-StreamSynchronize (Stream 1)
-Binding of textures
-Execution of kernel B (Stream 2)

-MemcpyAsync Host to device.(Stream 1)
-Execution of kernel A. (Stream 1)
-MemcpytoArrayAsync device to device. (Stream 0)(specified Stream 1)
-StreamSynchronize (Stream 2)
-Binding of textures
-Execution of kernel B (Stream 1)

And so on…

The first MemcpyAsync does execute concurrently with the previous execution of kernel B but when it comes to the execution of kernel A it waits until kernel B is done when the expected result is that it executes just after the MemcpyAsync and along kernel B. This behavior is shown in nvvp 5.0.0 and CUDA 5.0.

Also, as you can see the MemcpyToArrayAsync is in default stream, however that it not what I want. I don’t know why although I passed the stream as argument it keeps executing it in the default stream.

Any help is appreciated. Thanks.