Memset/memcpyDtoD implicitly synchronizes all streams -- a way to disable it?

garko12345 · August 17, 2023, 9:48pm

Hi I have 2 simple questions:

Why does cudaMemset and cudaMemcpyDeviceToDevice implicitly synchronize all streams in the application if they are internally implemented as “ordinary” kernels?
I have a need to use memsets/memcopies in multi-stream application without synchronization. Do I really need to write my own dedicated kernels and replace all memsets/memcopies or is there a way to disable the synchronization?

Example:

Stream #1 (CPU thread #1): kernelA (very long kernel) → memsetasync
Stream #2 (CPU thread #2): kernelB

Calls are made in this order chronologically: kernelA → memsetasync → kernelB

However, now kernelA and kernelB can’t run in parallel since memset synchronizes the streams.

Robert_Crovella · August 17, 2023, 9:53pm

because there is no way to specify a non-default stream for them to run in, therefore they run in the default stream, and the default stream has the behavior you describe.

I haven’t tried it myself, but it might be interesting to try overriding the default stream behavior.

TBH I’m sort of confused by your question. Perhaps I shouldn’t have answered this way. In your question you specifically call out cudaMemset:

and also something called “memsetasync”:

Have you tried using cudaMemsetAsync() ? It allows you to specify a stream parameter and should obey stream semantics. Of course, as you point out, if implemented as a GPU kernel, it may not be able to run at the same time as another kernel that “fully” occupies the GPU. It’s difficult to tell which case(s) you are asking about.

Robert_Crovella · August 17, 2023, 10:23pm

Although I haven’t used multiple threads, I don’t seem to have any difficulty achieving this:

Here is my test case:

# cat t17.cu
const size_t delay = 1000000000ULL;

__global__ void k(){
  size_t start = clock64();
  while (clock64() < start+delay) {};
}


int main(){

  cudaStream_t s1, s2;
  int *d;
  const size_t s = 32*1048576*2;
  cudaMalloc(&d, s);
  cudaStreamCreate(&s1);
  cudaStreamCreate(&s2);
  k<<<1,1,0,s1>>>();
  cudaMemsetAsync(d, 0, s, s1);
  k<<<1,1,0,s2>>>();
  cudaDeviceSynchronize();
}


# nvcc -o t17 t17.cu
# nsys nvprof --print-gpu-trace ./t17
WARNING: t17 and any of its children processes will be profiled.

Generating '/tmp/nsys-report-4769.qdstrm'
[1/3] [========================100%] report3.nsys-rep
[2/3] [========================100%] report3.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report

  Start (ns)    Duration (ns)  CorrId  GrdX  GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd     Device      Ctx  Strm      Name
 -------------  -------------  ------  ----  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  -------------  ---  ----  -------------
   685,765,672    490,196,340     121     1     1     1     1     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  k()
   685,795,944    490,199,092     123     1     1     1     1     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  k()
 1,175,962,620        157,248     122                                                                               67.109        426,745.266  Device              NVIDIA L4 (0)    1    13  [CUDA memset]

Generated:
    /.../report3.nsys-rep
    /.../report3.sqlite
#

CUDA 12.2.1, L4

We can see from the profiler trace that the kernel launched into the stream s1 almost fully overlaps with the kernel launched into stream s2. The cudaMemsetAsync() operation, launched into stream s1 after the kernel, does not impede the progress of the kernel launched into stream s2. (The kernels have an overall duration of approximately 500 milliseconds each, but the difference in start time is only 30 microseconds: they are basically fully overlapped/concurrent). The start time of the cudaMemsetAsync() operation is given approximately by the start time of the s1 kernel plus its duration; exactly what we would expect from stream semantics.

Perhaps you are barking up the wrong tree. Perhaps your kernels don’t run “in parallel” because they have some resource usage which prevents it. (That’s just a guess; there’s not enough info to make a determination, of course my guess may be wrong.)

garko12345 · August 18, 2023, 9:14am

Thank you, it seems that cudaMemsetAsync works perfectly when you pass it the stream parameter. We have also tried to build with “–default-stream per-thread” and it works as expected.

It seems something else is causing the kernels to block.

garko12345 · August 23, 2023, 2:00pm

I have additional question regarding our test example.
We have application with multiple threads and use “–default-stream per-thread” option (but similar behavior is observed if custom non-blocking streams are used).
We use MemcpyAsync with pinned host memory in all examples.
Although, number of threads is bigger, maximal number of parallel streams is limited to 10.

Example 1:
Each CPU thread (stream) has the following workflow (repeatedly):
cudaMemcpyAsync (HtoD, 1MB) → kernel “convert” → Kernel1 → Kernel2
In this case Nsight System report looks very nice, we have 10 parallel streams through whole processing (report below)
report_A0.nsys-rep (376.9 KB)

Example 2:
One cudaMemcpyAsync (DtoH, 4B) + cudaStreamSynchronize(cudaStreamPerThread) added between Kernel1 and Kernel2.
Overall reports still looks nice, but closer inspection reveals some empty spaces between consecutive kernels (report below)

report_A1.nsys-rep (377.4 KB)

Example 3:
Two additional cudaMemcpyAsync(DtoH, 4B) + cudaStreamSynchronize(cudaStreamPerThread) are added after Kernel2.
In this case timeline is not densely filled and it looks to be far away from full concurrency (report below)

report_A3.nsys-rep (411.5 KB)

Can you maybe explain this behavior? Is this expected?
Because regarding CUDA Runtime API documentation, we would expect that CudaMemcpyAsync with pinned memory is fully asynchronous and doesn’t cause such behavior:
https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior

@Robert_Crovella

Robert_Crovella · August 23, 2023, 2:44pm

Sorry, this is a lot of data to sort out, a lot of work. I normally wouldn’t embark on such an exercise without access to the code to study. Good luck!

Topic		Replies	Views
Confusion about implicit inter-stream synchronization brought by cudaMemsetAsync CUDA Programming and Performance	5	674	December 30, 2023
Is cudaMemset actually "asynchronous"? CUDA Programming and Performance	5	7935	January 5, 2016
Is cudaMemcpyAsync + cudaStreamSynchronize on default stream equal to cudaMemcpy (non-async) CUDA Programming and Performance	7	4055	December 12, 2019
cudaMemSet with streams expected a version of cudaMemSet for steams CUDA Programming and Performance	8	6098	September 16, 2010
Trying to run cudaMemsetAsync in a more timely manner CUDA Programming and Performance	8	1293	September 15, 2019
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1121	December 15, 2022
cudaMemcpyAsync CUDA Programming and Performance	10	21207	October 16, 2015
cudaMemcpyAsync with cudaMemcpyHostToDevice does not implicitly synchronize with stream CUDA Programming and Performance	2	45	July 2, 2025
Why some synchronize function make cudaMemcpyAsync and kernal in different stream work in sequential CUDA Programming and Performance	2	6558	March 1, 2011
why is cudaMemsetAsync(), cudaMemcpyAsync(), or even cudaEventRecord() killing parallel kernel exec CUDA Programming and Performance	2	4693	April 4, 2013

Memset/memcpyDtoD implicitly synchronizes all streams -- a way to disable it?

Related topics