Concurrent Memory Copy and Kernel Execution

The SDK example “simpleStreams” shows the concurrent execution of kernels and memory copies. Parallel execution is done as follows:
enqueue 4 kernels in 4 different streams, afterwards enqueue 4 asynchronous memory copies in the same 4 different streams
→ command order: [font=“Courier New”]kernel1-kernel2-kernel3-kernel4-cpy1-cpy2-cpy3-cpy4[/font] (number refers to stream)

A problem occurs, when you surround the asychronous tasks with events. In this case, the memory copies start after all kernels have finished, although they should be executed in parallel.
→ command order: [font=“Courier New”]event1-kernel1-event1- … -event4-kernel4-event4-event1-cpy1-event1- … -event4-cpy4-event4[/font]

Has someone experienced similar problems and found a solution or workaround? I definitely need to surround the asynchronous calls with events to get the timing information.

Tested with CUDA environment:
NVIDIA Driver 190.53 and 195.17 for Linux (SLED) with CUDA Support
CUDA toolkit 2.3 and 3.0beta for Linux (SLED)
CUDA SDK 2.3 and 3.0beta code samples for Linux (SLED)
Compiler for CPU host code: gcc/g++ (SUSE Linux) 4.3.2
System: Intel® Xeon® CPU E5420@2.50GHz, 8GB RAM, GeForce GTX 260, Tesla T10 Processor