I am using a Titan V GPU to profile two preexisting cuda kernels using a centos 7 workstation. The kernels efficiently convolve a filter with input data. To do this, they use a lot of registers for each convolution. Under cuda-10.2, each thread in the first kernel, called P3_M32, uses 127 registers, but under cuda-9.2 it only used 80. Similarly the second kernel, called P4_M32 uses 128 registers under cuda-10.2 whereas it only required 80 under cuda-9.2. The increased register usage affects the number of thread blocks that can be loaded at any given time and adds at least 40% to each kernel’s timeline. I know that the standard solution to this problem is to set the max register count, but in the past when I have tried this, the kernel then uses global memory to catch the spillover values, slowing things down even more. Is there a reason why the Cuda 10.2 compiler does not make efficient use of registers? How do I overcome this problem?
Without any code to look at, it is impossible to give recommendations. A 40% performance regression is sufficiently large that it warrants filing a bug report with NVIDIA, so that is one course of action you may want to consider.
Compilers use many code transformation stages, each of which is typically controlled by one or several heuristics. Heuristics have a tendency to change over time, as they are tweaked to deliver better code across the majority of use cases. Unfortunately, this process can also introduce performance regressions for some use cases.
Thanks for your response. I regret not being able to share the proprietary kernels with you. I understand that compilers have a lot going on. For me, compilers are a lot like making sausage - don’t ask about what goes on to get the code to the state where it can run.
I have tried a few things which have ameliorated the performance somewhat, the most promising was that I compiled it for the Pascal architecture (compute capability 6.0) for which this code was originally developed and that lowered the register use down from 127 to 95 per thread. It still isn’t the 80 registers that cuda 9.2 gives, but it’s something.
The problem will occur when we have to move up to the Turing architecture and are required to use cuda 10.2.
With some exceptions (Maxwell -> Pascal) the various GPU architectures have sufficiently different ISAs that one should expect register usage to differ between them for any given kernel.
The compiler backend PTXAS is responsible for register allocation, and it is this part of the toolchain that incorporates most of the machine-specific optimizations and heuristics. You could try reducing optimizations in that (default is full optimization. i.e. -Xptxas -O3, maybe try -Xptxas -O2). Based on your latest description, it is not clear to me whether the performance differences you observe are due to solely changes in compiler version, or a change in compiler version plus target architecture.
As I said, filing a bug with NVIDIA may be the best way forward. Bug reports are confidential: Only the filer and relevant NVIDIA personnel have access, so you can attach code there that you cannot shared here.
As I recall, changing the optimization to -O2 back under cuda-8.0 minimized the register usage, but with cuda-10.2 it provides negligible improvement. Something must have changed in the heuristics with this current compiler. I will request permission to submit this code or something similar with a bug report, but I do not hold out too much hope there. Thanks for your help.