Hello,
we are currently developing a software that uses dynamic parallelism to launch different versions of a same template kernel to get the most of performances for a computation. It is a dispatch approach: a base kernel just launches a low number of different kernels all at the same time, so that they can run in parallel. These different versions of our template kernel enable increasingly complex features, through template parameters. The goal is to handle the simple cases with small, efficient kernels that run in large blocks, and the complex cases with heavier kernels, that really can not be advantageously simplified or split in smaller kernels considered the structure of our calculation.
While implementing this approach, we are facing several questions regarding the behavior of nvcc and its capacities, and we would really much appreciate any insight or ways to mitigate the issues we have.
- First, a question regarding live register counts as reported by nsight compute (great tool by the way): we have a non-inlined function texAccum2DP that requires at its maximum, according to nsight compute, 26 registers (I assume this includes the cost to store the parameters, but is it correct ?). When called using call.abs.noinc in a context where the live register count is 22, the reported live register count goes up to 81, and then is 37 after the call, not that close to the 22 we had just before. This is illustrated below by a capture of nsight compute for such a case. We do not understand this behavior, starting at 22 we would expect that the live register count would be at maximum around 50, would it be possible to have an explanation ? We observed this behavior at many different places. We profiled a small kernel, could it be related to the fact that this function is also called by larger kernels, in contexts where the live register count before the call would be higher ? We tried to create a specific function called only by the small kernel (or so we tried, maybe we missed something), the result on live registers was exactly the same.
-
Second, a little for our culture about spilling management, but also because it has an impact: we observed that spilling seems to be managed at the beginning of a non-inlined function, and restoration at its end. In any case, whatever the context of a call to a non-inlined function being made, the number of storage to or reading from local memory is the same, and seems to be related to the context where the most spilling is required (which seems normal when spilling is done “statically” like that). We would have expected that it is done before and after a call to the function, in the context of the caller, based on what is really used at the location of the call. In the context of functions being called from kernels with massively different complexities, and thus live register count that differ greatly from a call place to another, we fear that the overhead of spilling required by the large kernels impacts the execution efficiency of simpler kernels, due to both the memory queries in themselves, and also because of the reported local-memory consumption preventing the scheduler to allocate many threads per block, or have as many blocks as really possible active at the same time, thus severely impacting the amount of parallelism that could ideally be reached by the small kernels.
-
Third and final question, a question about per-kernel compiler optimisation: as we want to use dynamic parallelism, we must have, as far as we know, a single cubin file containing all the kernels we want to dispatch. As the different versions of the kernels have very different complexities and register requirements, we would ideally like to be able to specify a very low max register count for the small kernels, to be able to run many of them per multi-processor, and less restrictive settings for the larger kernels, because the spilling cost (both pure storage and local-memory requirements) would become very problematic, making the sum of execution times of small and high-highly optimised kernels on one side, and large and “over-spilling” kernels on the other side, larger than “mildly optimised” small and large kernels. Is there any way we can approach this behavior ? If no, would it lead to an execution efficiency similar to dynamic parallelism if we used one stream per version of the kernel, have one cubin per kernel (or per group of kernels with same max allowed register count), and launch the kernels all at the same time from the CPU ?
Thanks for any answer or insight you could provide us on these subjects.
Best regards.
