Why does cudaMemset and cudaMemcpyDeviceToDevice implicitly synchronize all streams in the application if they are internally implemented as “ordinary” kernels?
I have a need to use memsets/memcopies in multi-stream application without synchronization. Do I really need to write my own dedicated kernels and replace all memsets/memcopies or is there a way to disable the synchronization?
because there is no way to specify a non-default stream for them to run in, therefore they run in the default stream, and the default stream has the behavior you describe.
TBH I’m sort of confused by your question. Perhaps I shouldn’t have answered this way. In your question you specifically call out cudaMemset:
and also something called “memsetasync”:
Have you tried using cudaMemsetAsync() ? It allows you to specify a stream parameter and should obey stream semantics. Of course, as you point out, if implemented as a GPU kernel, it may not be able to run at the same time as another kernel that “fully” occupies the GPU. It’s difficult to tell which case(s) you are asking about.
We can see from the profiler trace that the kernel launched into the stream s1 almost fully overlaps with the kernel launched into stream s2. The cudaMemsetAsync() operation, launched into stream s1 after the kernel, does not impede the progress of the kernel launched into stream s2. (The kernels have an overall duration of approximately 500 milliseconds each, but the difference in start time is only 30 microseconds: they are basically fully overlapped/concurrent). The start time of the cudaMemsetAsync() operation is given approximately by the start time of the s1 kernel plus its duration; exactly what we would expect from stream semantics.
Perhaps you are barking up the wrong tree. Perhaps your kernels don’t run “in parallel” because they have some resource usage which prevents it. (That’s just a guess; there’s not enough info to make a determination, of course my guess may be wrong.)
Thank you, it seems that cudaMemsetAsync works perfectly when you pass it the stream parameter. We have also tried to build with “–default-stream per-thread” and it works as expected.
It seems something else is causing the kernels to block.
I have additional question regarding our test example.
We have application with multiple threads and use “–default-stream per-thread” option (but similar behavior is observed if custom non-blocking streams are used).
We use MemcpyAsync with pinned host memory in all examples.
Although, number of threads is bigger, maximal number of parallel streams is limited to 10.
Example 1:
Each CPU thread (stream) has the following workflow (repeatedly):
cudaMemcpyAsync (HtoD, 1MB) → kernel “convert” → Kernel1 → Kernel2
In this case Nsight System report looks very nice, we have 10 parallel streams through whole processing (report below) report_A0.nsys-rep (376.9 KB)
Example 2:
One cudaMemcpyAsync (DtoH, 4B) + cudaStreamSynchronize(cudaStreamPerThread) added between Kernel1 and Kernel2.
Overall reports still looks nice, but closer inspection reveals some empty spaces between consecutive kernels (report below)
Example 3:
Two additional cudaMemcpyAsync(DtoH, 4B) + cudaStreamSynchronize(cudaStreamPerThread) are added after Kernel2.
In this case timeline is not densely filled and it looks to be far away from full concurrency (report below)
Sorry, this is a lot of data to sort out, a lot of work. I normally wouldn’t embark on such an exercise without access to the code to study. Good luck!