In a recent thread we established that the launch overhead of null kernels (kernels that don’t do anything) appears to have been reduced to about 3 microseconds with recent hardware and software, which constitutes a new “speed of light”:
As a general principle, any time multiple software instances attempt to access a single physical resource, latency is likely to increase, as some form of communication has to occur to negotiate access between these instances.
With increasing GPU performance, it becomes more likely that kernel performance becomes negatively impacted by launch overhead. Programmers should therefore strive to pack a sufficient amount of work into each kernel launch. As a rule of thumb, one might want to target a minimum kernel runtime of around 1 millisecond for high-end GPUs. Obviously that is not always realizable.