I am trying to find an explanation to what I have observed recently when I am optimizing a signal processing pipeline running on a volta GPU. Basically when I change the implementation of one kernel, the other kernels in the pipeline see a non-trivial slow-down. e.g. one kernel’s average execution time goes from around 1ms to 1.3 ms, other kernels see a similar 20-30% increase in execution time.
I can somewhat understand that effect if the GPU was originally underutilized when running the kernel being optimized, but it is actually not. I used nsys to look at the execution profile, it is obvious that both before and after the change, the changed kernel does not overlap with any other kernels in the execution timeline. So I would expect the reduction of execution time for the optimized kernel to not have any effect on the execution time of other kernels, however that is not what I see. Anybody has seen this before and why would this happen?
Btw, all execution times were measured using nsys.
Have any details of the functionality changed with the change in implementation? For example, is the exact same data being produced, and stored into exactly the same locations, written in the same order? Were there any numerical changes such as a change of a scale factor? Were there any inadvertent changes to the compilation switches of other kernels (e.g. removal of -use_fast_math).
Make sure you fix clocks as much as possible when doing these comparisons. Otherwise the dynamic clocking of the GPU based on power and temperature headroom might influence performance: A power-hungry kernel could lead to depressed operating frequency while one or more subsequent kernels are running.
Have you tried using the profiler to look at performance characteristics of the other kernels before and after changing this one kernel? I do not know what any of the kernels are doing, but looking at metrics related to the memory subsystem would be the first thing I would look at.
Thanks for the quick reply. The optimization does change the memory access pattern in the changed kernel, it was not coalesced and now it is, so even though the same data is produced and written to the same location, the ordering has changed. Also, shared memory was not used in the previous implementation, but it is now.
My assumption was that since in the profiler I can see the kernels are executed sequentially, this kind of changes would not have an effect on other kernels, (e.g. shared memory/cache partition is done on a per kernel basis, different kernels are not competing for memory bandwidth since they are run one after the other etc.), but I could be wrong.
Given how this changed kernel is ~5% of the entire pipeline before it is optimized, it is probably not going to cause frequency scaling but yeah I should look out for that, and I can certainly check other kernels’ metrics. Thanks for these suggestions.
Do you know any pointers to nvidia documentations mentioning these kind of changes would have a effect cross kernel boundaries? Thank you.
Everything I mentioned is not specific to programming with CUDA, but just modern systems in general. Some effects that may come into play with CPUs do not or rarely apply to GPUs, for example coupling through branch prediction mechanisms or instruction cache effects.
Other than brainstorming for ideas there is little we can do here. You have access to the code and all the data and can observe the system while its running, while information shared in this thread represents approximately 0.1% of the total to you.
Since this is a situation where one can go back and forth at will between the old and the new version (I am assuming version control is being used), you could try to isolate the effect of portions of the changes if they are separable. That might narrow down the proximate cause. Beyond that, comparing profiler stats before and after the change should result in detectable changes to at least some metrics, providing additional clues.