I’ve observed that a cudaMemcpyAsync, as well as many other theoretically non-blocking CUDA APIs, actually pause the calling thread for some microseconds (~2us). Is it possible to get a general idea of what happens at the CUDA driver level? Does the thread remain locked, waiting for the command to be accepted by the GPU?
If multiple threads within the same CUDA context are launching commands in succession, does the CUDA driver accept them in parallel, or is there a mechanism that serializes them (at the CPU thread level, I mean?
In our application all those pauses are affecting the system more than data transfer and the actually GPU work which is usually extremely fast (congratulations!)
Thanks
Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources. Such behavior is subject to change and undocumented behavior should not be relied upon.
There are many questions like this on these forums, here is one recent one. Some of the questions that you can find show attempts to use debuggers to figure out what is going on at a lower level, and there is indication in some of them of lock negotiation, consistent with the statement in the documentation.
No, I don’t have amazing suggestions for how to make this go away. Issuing all work from a single thread should certainly mitigate the variation in latency. Yes, I understand that is not a very palatable suggestion. I have no others.
Thank you very much for your fast reply.
Do you think using combined CPU+GPU (Grace) would reduce those latencies?
I see NVIDIA Aerial (which is a low latency application with many very small and fast kernels) is trying to reduce CPU/GPU interactions to zero so I think this is still the bottleneck of the system
You may want to try a scaling experiment using host CPUs of different single-thread performance. To first order, single-thread performance will be dominated by CPU clock frequency.
The 2 microsecond delay mentioned in the original post seems indicative of basic host / device communication overhead. Given that, minimizing CPU-GPU interactions of this sort seems like a sensible optimization strategy.
My (limited) understanding is that this overhead is largely attributable to PCIe hardware mechanisms, and only the smaller portion of it to software overhead on the host. The scaling experiment I suggested would explore the impact of the software component.
My usual recommendation is to use CPUs running at > 3.5 GHz to minimize impacts from driver overhead. SPEC CPU2017 Int Speed results suggest that among CPUs for which data was submitted the highest single-thread performance is offered by AMD EPYC 4564P and AMD Ryzen 9 7950X. According to the specifications, these are processors with a nominal clock of 4.5 GHz and a maximum boost clock of 5.7 GHz.
I do not know whether newer PCIe standards offer lower latencies; improvements in PCIe are typically focused on improving bandwidth. As far as I know there are no GPUs yet that offer PCIe5 support, but once there are, you could compare PCIe5 versus PCIe4 to see whether you observe any improvements.