GPU work launch latencies have had a lower bound in the range of ~5us for quite a while. Some of this is attributable to library (CPU code) overhead, so having a fast CPU may help. Lately, for certain kind of work issuance strategies, CUDA Graphs can further reduce this.
I wouldn’t be able to give you a detailed breakdown of where exactly this ~5us launch latency is spent.
Observed work launch latencies that are much much longer than 5 us (say, above 50us) are most likely due to some other contributor, and profiling is your best avenue to identify the culprit. Windows WDDM is also a contributor to work launch latency, and the usual advice there is to switch to TCC mode (or linux) if possible.
Multithreading may not help much. The CUDA runtime API is generally thread-safe, but that doesn’t mean threading has no bearing or impact. Careful study of the runtime API behavior will show that multithreaded calls to the runtime often negotiate for locks (under the hood) so it’s possible to demonstrate that in various multithreading scenarios, the runtime API latency increases. You can find forum posts here to this effect. So issuing work from a single thread may be a good choice if its amenable to your workflow.
In a recent thread we established that the launch overhead of null kernels (kernels that don’t do anything) appears to have been reduced to about 3 microseconds with recent hardware and software, which constitutes a new “speed of light”:
As a general principle, any time multiple software instances attempt to access a single physical resource, latency is likely to increase, as some form of communication has to occur to negotiate access between these instances.
With increasing GPU performance, it becomes more likely that kernel performance becomes negatively impacted by launch overhead. Programmers should therefore strive to pack a sufficient amount of work into each kernel launch. As a rule of thumb, one might want to target a minimum kernel runtime of around 1 millisecond for high-end GPUs. Obviously that is not always realizable.
When I used Nsight systems tool to profile the code I realized that every thread when it calls an API it goes to waiting state and in the OS runtime libraries row I see pthread_rwlock_rdlock or pthread_rwlock_wrlock or pthread_mutex_lock in which the API latency took too much then the API finishe and the thread returns to running state again.
I also have another question: Why if the CPU thread goes to waiting state the API take too much time to be loaded to the GPU and isn’t loaded and executed immediately even if the GPU is idle.
Do you have any suggestions to my case or help me where can I find things to help me in this. Note that I am using Linux?
Sorry if I ask naive questions but I really confused, I need to know what makes the thread to go to waiting state immediately after calling the GPU API and can I do anything to release the lock of the CPU faster?
Also, something I can’t understand that why the API waiting until the CPU thread to be released and not loaded to the GPU to be executed despite that the CPU called the API and what the API is waiting for?
Suppose I have 4 threads. And suppose each thread makes a call to the CUDA runtime API ( a function call beginning with cuda.., typically).
The CUDA runtime API may be designed in such a way that those threads cannot all independently run in parallel, i.e. concurrently. The CUDA runtime API may choose to implement a lock:
cpu thread makes a runtime call
other cpu threads are doing the same thing
cpu thread must acquire a lock
if cpu thread does not immediately acquire the lock, perhaps because another CPU thread “owns” the lock, then it must wait
eventually, that CPU thread is granted the lock
CPU thread can then complete whatever activity was requested from the CUDA runtime API
cpu thread returns from runtime API library call
All of the above activity is taking place on the host, using CPU threads. Even if there is GPU activity, the CPU thread that waits is not waiting explicitly on GPU activity, it is waiting on release of the lock.
I don’t know of anything that can help you release the lock of the CPU faster.
I’m unlikely to be able to help any further here, and probably won’t be able to respond to additional requests for the same information.
You may wish to google for information about CPU threads and locks, to get more general information/background.
The performance of lock operations typically depends on the (1) operating system (2) the single-thread performance of the CPU.
It has been decades since I last looked at the lock-handling performance of different operating systems. If for some reason you do not wish to follow the helpful advice to issue GPU work from a single thread, you may want to do performance comparisons between different operating systems and/or check whether there are OS configuration settings that influence lock-handling performance.
As for single-thread performance of CPU, this is primarily driven by clock frequency. I recommend using a CPU with a base clock of >= 3.5 GHz.
sorry for the late reply, I tried to write the same program one as multi thread program and another as single thread, actually the kernel launch latency decreased too much in case of single thread but the total time taken by multi thread program is less than the total time taken by the single thread program.
So do you see any other something that can be done?
issue GPU work from a single thread != single-threaded program
The description provided to this point appears to indicate that the application-level performance is not dominated by kernel launch latency. Since the app overall benefits from multi-threading, hypothetically an optimal solution would involve a multi-threaded application that uses a single thread to issue work to the GPU, with all other threads coordinating with that thread. How to do that in a lockless fashion, or with minimal locking, is left as an exercise to the application programmer.
In a sense, this simply moves the problem of coordinating access to a shared resource up one level in the software stack. The potential advantage of this is that the details of this coordination are no longer a black box, but are fully transparent to the application programmer, allowing potentially higher-performance solutions based on contextual knowledge.