why is cudaMemsetAsync(), cudaMemcpyAsync(), or even cudaEventRecord() killing parallel kernel exec

Uncle_Joe · April 4, 2013, 9:14pm

Hello, I’m stumped by this strange issue. I find that even issuing a cudaEventRecord() to a stream is preventing tasks from executing in parallel. Previously, I was surprised that a cudaMemcpyAsync() and cudaMemsetAsync() are also preventing everything issued after them from running before them, but that behavior was stated in the CUDA programming manual. But it absolutely doesn’t make any sense why is cudaEventRecord() is doing this too?

I’m using CUDA 5 with a GTX 560Ti.

__global__ void Dummy(uint16_t *data, int n)
{
    __shared__ uint16_t scratch[512];
    for (int i = threadIdx.x; i < n; i += blockDim.x)
        scratch[i % 512] = data[i];
}


void TestOverlap()
{
  uint16_t *data;
  const int N = 80000000;
  assert(cudaMalloc(&data, N * sizeof(uint16_t)) == cudaSuccess);

  cudaStream_t stream1, stream2;
  assert(cudaStreamCreate(&stream1) == 0);
  assert(cudaStreamCreate(&stream2) == 0);
  cudaEvent_t stream1Event;
  assert(cudaEventCreate(&stream1Event) == 0);
 
  for (int repeat = 0; repeat < 2; ++repeat)
  {
    Dummy<<<9, 64, 0, stream1>>>(data, 100000);
    //assert(cudaEventRecord(stream1Event, stream1) == 0);
    for (int i = 0; i < 9; ++i)
      cudaMemsetAsync(&data[i * 1000000], i, 10000000, stream2);
  }
  assert(cudaThreadSynchronize() == 0);
  cudaDeviceReset();
}

Here’s the timeline from CUDA Visual Profiler:

Also, can someone give an explanation for why should cudaMemsetAsync() or cudaMemcpyAsync() (from a different stream) serve as synchronization point (prevent kernels issued after them from executing before it completes). Conceptually, I’m issuing it to another parallel stream, so why synchronize ?

Thank you for any help

tera · April 4, 2013, 9:43pm

I vaguely remember that one of the Nvidia employees explained the reasoning a while ago on the forums. The only thread I could quickly bring up however only contains the mere confirmation by Tim Murray that this is indeed the case.

Uncle_Joe · April 4, 2013, 10:42pm

OK, that gave me a clue. I tried changing cudaEventCreate() to cudaEventCreateWithFlags(cudaEventDisableTiming) and the memset now overlaps!

I will try to see if the problem exists on Linux too. Apparently WDDM has limitations for concurrent kernel execution as Greg describes here

Basically, you might lose parallelism after the kernel calls get split into WDDM command buffers. I think that would explain another concurrency problem I’m having (why I can’t overlap 18 kernels, but can overlap 2).

Topic		Replies	Views
Memset/memcpyDtoD implicitly synchronizes all streams -- a way to disable it? CUDA Programming and Performance	5	550	August 23, 2023
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1044	December 15, 2022
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1766	June 23, 2010
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	886	July 7, 2017
cudaStreamSynchronize CUDA Programming and Performance cuda	4	64	July 17, 2024
Why the cuda kernel and copy do not overlap? CUDA Programming and Performance cuda	2	44	November 5, 2024
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	906	June 18, 2010
Kernels won't run in parallel CUDA Programming and Performance	3	1171	May 8, 2013
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2655	April 29, 2019
Asynchronous multi streaming: not working... CUDA Programming and Performance	2	516	May 13, 2018

why is cudaMemsetAsync(), cudaMemcpyAsync(), or even cudaEventRecord() killing parallel kernel exec

Related topics