why is cudaMemsetAsync(), cudaMemcpyAsync(), or even cudaEventRecord() killing parallel kernel exec

Hello, I’m stumped by this strange issue. I find that even issuing a cudaEventRecord() to a stream is preventing tasks from executing in parallel. Previously, I was surprised that a cudaMemcpyAsync() and cudaMemsetAsync() are also preventing everything issued after them from running before them, but that behavior was stated in the CUDA programming manual. But it absolutely doesn’t make any sense why is cudaEventRecord() is doing this too?

I’m using CUDA 5 with a GTX 560Ti.

__global__ void Dummy(uint16_t *data, int n)
{
    __shared__ uint16_t scratch[512];
    for (int i = threadIdx.x; i < n; i += blockDim.x)
        scratch[i % 512] = data[i];
}


void TestOverlap()
{
  uint16_t *data;
  const int N = 80000000;
  assert(cudaMalloc(&data, N * sizeof(uint16_t)) == cudaSuccess);

  cudaStream_t stream1, stream2;
  assert(cudaStreamCreate(&stream1) == 0);
  assert(cudaStreamCreate(&stream2) == 0);
  cudaEvent_t stream1Event;
  assert(cudaEventCreate(&stream1Event) == 0);
 
  for (int repeat = 0; repeat < 2; ++repeat)
  {
    Dummy<<<9, 64, 0, stream1>>>(data, 100000);
    //assert(cudaEventRecord(stream1Event, stream1) == 0);
    for (int i = 0; i < 9; ++i)
      cudaMemsetAsync(&data[i * 1000000], i, 10000000, stream2);
  }
  assert(cudaThreadSynchronize() == 0);
  cudaDeviceReset();
}

Here’s the timeline from CUDA Visual Profiler:


Also, can someone give an explanation for why should cudaMemsetAsync() or cudaMemcpyAsync() (from a different stream) serve as synchronization point (prevent kernels issued after them from executing before it completes). Conceptually, I’m issuing it to another parallel stream, so why synchronize ?

Thank you for any help

I vaguely remember that one of the Nvidia employees explained the reasoning a while ago on the forums. The only thread I could quickly bring up however only contains the mere confirmation by Tim Murray that this is indeed the case.

OK, that gave me a clue. I tried changing cudaEventCreate() to cudaEventCreateWithFlags(cudaEventDisableTiming) and the memset now overlaps!

I will try to see if the problem exists on Linux too. Apparently WDDM has limitations for concurrent kernel execution as Greg describes here
https://devtalk.nvidia.com/default/topic/538232/cuda-programming-and-performance/concurrent-kernels/

Basically, you might lose parallelism after the kernel calls get split into WDDM command buffers. I think that would explain another concurrency problem I’m having (why I can’t overlap 18 kernels, but can overlap 2).