Memset/memcpyDtoD implicitly synchronizes all streams -- a way to disable it?

Hi I have 2 simple questions:

  1. Why does cudaMemset and cudaMemcpyDeviceToDevice implicitly synchronize all streams in the application if they are internally implemented as “ordinary” kernels?

  2. I have a need to use memsets/memcopies in multi-stream application without synchronization. Do I really need to write my own dedicated kernels and replace all memsets/memcopies or is there a way to disable the synchronization?


Stream #1 (CPU thread #1): kernelA (very long kernel) → memsetasync
Stream #2 (CPU thread #2): kernelB

Calls are made in this order chronologically: kernelA → memsetasync → kernelB

However, now kernelA and kernelB can’t run in parallel since memset synchronizes the streams.

because there is no way to specify a non-default stream for them to run in, therefore they run in the default stream, and the default stream has the behavior you describe.

I haven’t tried it myself, but it might be interesting to try overriding the default stream behavior.

TBH I’m sort of confused by your question. Perhaps I shouldn’t have answered this way. In your question you specifically call out cudaMemset:

and also something called “memsetasync”:

Have you tried using cudaMemsetAsync() ? It allows you to specify a stream parameter and should obey stream semantics. Of course, as you point out, if implemented as a GPU kernel, it may not be able to run at the same time as another kernel that “fully” occupies the GPU. It’s difficult to tell which case(s) you are asking about.

Although I haven’t used multiple threads, I don’t seem to have any difficulty achieving this:

Here is my test case:

# cat
const size_t delay = 1000000000ULL;

__global__ void k(){
  size_t start = clock64();
  while (clock64() < start+delay) {};

int main(){

  cudaStream_t s1, s2;
  int *d;
  const size_t s = 32*1048576*2;
  cudaMalloc(&d, s);
  cudaMemsetAsync(d, 0, s, s1);

# nvcc -o t17
# nsys nvprof --print-gpu-trace ./t17
WARNING: t17 and any of its children processes will be profiled.

Generating '/tmp/nsys-report-4769.qdstrm'
[1/3] [========================100%] report3.nsys-rep
[2/3] [========================100%] report3.sqlite
[3/3] Executing 'cuda_gpu_trace' stats report

  Start (ns)    Duration (ns)  CorrId  GrdX  GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd     Device      Ctx  Strm      Name
 -------------  -------------  ------  ----  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  -------------  ---  ----  -------------
   685,765,672    490,196,340     121     1     1     1     1     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    13  k()
   685,795,944    490,199,092     123     1     1     1     1     1     1       16         0.000         0.000                                                     NVIDIA L4 (0)    1    14  k()
 1,175,962,620        157,248     122                                                                               67.109        426,745.266  Device              NVIDIA L4 (0)    1    13  [CUDA memset]


CUDA 12.2.1, L4

We can see from the profiler trace that the kernel launched into the stream s1 almost fully overlaps with the kernel launched into stream s2. The cudaMemsetAsync() operation, launched into stream s1 after the kernel, does not impede the progress of the kernel launched into stream s2. (The kernels have an overall duration of approximately 500 milliseconds each, but the difference in start time is only 30 microseconds: they are basically fully overlapped/concurrent). The start time of the cudaMemsetAsync() operation is given approximately by the start time of the s1 kernel plus its duration; exactly what we would expect from stream semantics.

Perhaps you are barking up the wrong tree. Perhaps your kernels don’t run “in parallel” because they have some resource usage which prevents it. (That’s just a guess; there’s not enough info to make a determination, of course my guess may be wrong.)

Thank you, it seems that cudaMemsetAsync works perfectly when you pass it the stream parameter. We have also tried to build with “–default-stream per-thread” and it works as expected.

It seems something else is causing the kernels to block.

I have additional question regarding our test example.
We have application with multiple threads and use “–default-stream per-thread” option (but similar behavior is observed if custom non-blocking streams are used).
We use MemcpyAsync with pinned host memory in all examples.
Although, number of threads is bigger, maximal number of parallel streams is limited to 10.

Example 1:
Each CPU thread (stream) has the following workflow (repeatedly):
cudaMemcpyAsync (HtoD, 1MB) → kernel “convert” → Kernel1 → Kernel2
In this case Nsight System report looks very nice, we have 10 parallel streams through whole processing (report below)
report_A0.nsys-rep (376.9 KB)

Example 2:
One cudaMemcpyAsync (DtoH, 4B) + cudaStreamSynchronize(cudaStreamPerThread) added between Kernel1 and Kernel2.
Overall reports still looks nice, but closer inspection reveals some empty spaces between consecutive kernels (report below)

report_A1.nsys-rep (377.4 KB)

Example 3:
Two additional cudaMemcpyAsync(DtoH, 4B) + cudaStreamSynchronize(cudaStreamPerThread) are added after Kernel2.
In this case timeline is not densely filled and it looks to be far away from full concurrency (report below)

report_A3.nsys-rep (411.5 KB)

Can you maybe explain this behavior? Is this expected?
Because regarding CUDA Runtime API documentation, we would expect that CudaMemcpyAsync with pinned memory is fully asynchronous and doesn’t cause such behavior:


Sorry, this is a lot of data to sort out, a lot of work. I normally wouldn’t embark on such an exercise without access to the code to study. Good luck!