concurrent copy and execute - strange behavior

I’ve been playing around with getting concurrent copy and execute working under OpenCL and have been seeing some strange behavior. Completely doesn’t seem like the expected behavior.

At least according to nsight analysis activity, also seems that way from the profiler time stamps, no matter how I allocate the memory or organize the write, kernel, read commands I get concurrent copy execute on read but not on write.

I create 4 command queues. I then tried two approaces:

  1. issue the 4 write commands, 4 kernels and 4 read commands
  2. issue a 4 write,kernel,read sets

I also tried allocating memory using enqueue map (pinned memory), c++ new and cudaHostAlloc (didn’t do what I was hoping).

In all cases all write commands happen first before anything else, then the kernels happen on after the other where the matching read happen concurrently to the kernel launch when they can happen. I couldn’t get the write commands to overlap with the kernel no matter what I did. On the other hand, the behavior that I expected from CUDA where there is no everlap with non-pinned memory didn’t happen either.

any ideas? (how do I get full overlap)

By the way 1: the mapped memory was marked as pinned copies, the other two (cudaHostAlloc and new) were marked as pageable memory
By the way 2: this is with Windows 7 64bit, NVIDIA driver 280.26, 32bit application.

Thanks