Why does cudaMemset and cudaMemcpyDeviceToDevice implicitly synchronize all streams in the application if they are internally implemented as “ordinary” kernels?
I have a need to use memsets/memcopies in multi-stream application without synchronization. Do I really need to write my own dedicated kernels and replace all memsets/memcopies or is there a way to disable the synchronization?
TBH I’m sort of confused by your question. Perhaps I shouldn’t have answered this way. In your question you specifically call out cudaMemset:
and also something called “memsetasync”:
Have you tried using cudaMemsetAsync() ? It allows you to specify a stream parameter and should obey stream semantics. Of course, as you point out, if implemented as a GPU kernel, it may not be able to run at the same time as another kernel that “fully” occupies the GPU. It’s difficult to tell which case(s) you are asking about.
We can see from the profiler trace that the kernel launched into the stream s1 almost fully overlaps with the kernel launched into stream s2. The cudaMemsetAsync() operation, launched into stream s1 after the kernel, does not impede the progress of the kernel launched into stream s2. (The kernels have an overall duration of approximately 500 milliseconds each, but the difference in start time is only 30 microseconds: they are basically fully overlapped/concurrent). The start time of the cudaMemsetAsync() operation is given approximately by the start time of the s1 kernel plus its duration; exactly what we would expect from stream semantics.
Perhaps you are barking up the wrong tree. Perhaps your kernels don’t run “in parallel” because they have some resource usage which prevents it. (That’s just a guess; there’s not enough info to make a determination, of course my guess may be wrong.)
I have additional question regarding our test example.
We have application with multiple threads and use “–default-stream per-thread” option (but similar behavior is observed if custom non-blocking streams are used).
We use MemcpyAsync with pinned host memory in all examples.
Although, number of threads is bigger, maximal number of parallel streams is limited to 10.
Each CPU thread (stream) has the following workflow (repeatedly):
cudaMemcpyAsync (HtoD, 1MB) → kernel “convert” → Kernel1 → Kernel2
In this case Nsight System report looks very nice, we have 10 parallel streams through whole processing (report below) report_A0.nsys-rep (376.9 KB)
One cudaMemcpyAsync (DtoH, 4B) + cudaStreamSynchronize(cudaStreamPerThread) added between Kernel1 and Kernel2.
Overall reports still looks nice, but closer inspection reveals some empty spaces between consecutive kernels (report below)
Two additional cudaMemcpyAsync(DtoH, 4B) + cudaStreamSynchronize(cudaStreamPerThread) are added after Kernel2.
In this case timeline is not densely filled and it looks to be far away from full concurrency (report below)