The problem I’m seeing is that a DtoH cudaMemcpyAsync is blocking a cudaMemsetAsync from completing in a timely manner. That is, I issue a cudaMemcpyAsync in one stream, then issue a cudaMemsetAsync in a different stream, the cudaMemcpyAsync takes a long time to complete, and then only after that’s done does the cudaMemsetAsync run and finish. This occurs, despite issuing the cudaMemcpyAsync and cudaMemsetAsync in different streams (and using pinned memory). Is that expected behavior? Normally I would expect a CPU lock or some other higher-level error, but I’ve spent a lot of time looking at the CPU code, and I’m currently out of ideas… which usually means my assumptions are wrong.
This is on a v100 with 6 copy engines and CUDA 10.1.
I was going to attach two screenshots of the nvvp timeline, but there’s not an easy way to attach them to this thread.
In the first shot, you would see that the DtoH copy is initiated in the second thread in the small sliver that I’ve highlighted with a red circle. There’s also a cudaStreamSynchronize call that syncs on the stream after the cudaMemCpyAsync[s] are done.
In the second screenshot, you would see that the cudaMemsetAsync call takes a long time to be eligible to be run, but once it’s run, it is very quick.
My apologies for this being a horrible question w.r.t. the amount of concrete information that I can divulge (i.e., code) – I can’t copy+paste the code as it’s proprietary and not mine, and I’m struggling to make a minimal viable example that demonstrates my issue at the moment. That being said, there’s a good chance one of my assumptions is wrong, and maybe someone here can help me understand what it is. Thanks for your time.