Different Number of Registers used In a Compiled Kernel

Hi everyone,

I recently developed a kernel and ran it on two different platforms, Nvidia 3090 and V100. For compilation, I utilized CUDA 12.0.1 with the compiler options “-O3 -arch=sm_70”. Surprisingly, I observed a significant performance discrepancy, with the V100 achieving only half the performance of the 3090.

Upon further investigation, I discovered that the occupancy on the V100 is low. This appears to be attributed to a higher number of registers being utilized compared to the 3090. I gathered the register count using the following code: cudaFuncGetAttributes(&funcAttrib1, kernel); printf(" kernel: numRegs:%d\n", funcAttrib1.numRegs);

What confuse me is that the register count increased even when employing the same compilation configuration. I even attempted to compile the kernel on the 3090 and then copy the executable file to the V100, only to find that the register count still changed.

Why did the register number change and can the lower occupancy explain the worse performance on v100 compared to 3090? Thank you.

The portion of the compiler that translates PTX (an intermediate representation) to SASS (machine code) is an optimizing compiler that performs many architecture-specific optimizations. It also performs register allocation and instruction scheduling. Therefore one should expect different register usage across different GPU architectures for the same source code. Usually, these differences are fairly small, but larger difference may occur. Without buildable reproducer code, it is impossible to say what specifically is happening with your code base, and wildly speculating about possible reasons does not seem indicated. Your source code may also contain architecture-specific code-path which could cause such differences.

Maybe. While there is often some correlation between occupancy and application-level performance, it is not a particularly strong correlation. Instead of engaging in speculation, it would best to examine the performance characteristic of the code in question with the help of the CUDA profiler, which allows programmers to pinpoint the bottlenecks in their code.

Agree with njuffa, regarding absence of specifics, but at the hardware level there are some generalities to consider.

The 3090 has better performance in both GPU clockspeed and memory bandwidth and with respect to FP32 performance, has a considerable advantage - 35.58 vs14.13 TFLOPS.

The SM8.6 whitepaper gives some detail.

Thanks a lot

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.