Same kernel function uses different number of registers on different platform

markusxwr · April 26, 2024, 8:22am

Hi everyone, when I am profiling two same kernel function on different platforms(gtx 1660ti on my Laptop and rtx 2080 ti on the lab server, both of them have same compute capability) I got different results. Let say it is kernel function 1 and kernel function 2, while both of them do the same thing and lauched with 1 block with 1024 threads. As I designed in the kernel function 1, it should use more registers and take a shorter duration to run. It is also as estimated with gtx 1660ti on my Laptop. Kernel function 2 used 37 registers and has a duration of 488, while kernel function1 use 41 registers(+10.8%) and duration 459(-5.9%).
But when I profile my program on the lab server. Kernel function 2 uses 41 registers and duration of 22.94. While kernel function 1 uses 37 registers(even 1 less) and duration of 26.59(+15.9%) I can not understand why kernel 2 get a register number increase when I run it on 2080ti. And then the more registers cause a better memory locality and saved the time?

And another situation is that, when I run a similar program in the situation below(a little bit modified, designed to use even more registers in Kernel function 1 and Kernel function 2 keeps the same). This time I launched both kernel functions with 1 block with 512 threads. As I estimated before, kernel function1 can use more registers than kernel function 2 and the duration is also decreased. This profiling result(comparison of duration and used register count) afford with my estimation. But I also found that, for the same kernel function, for example, kernel function1 use 43 registers while running on gtx 1660 ti but 60 registers on rtx 2080 ti. What is the cause that the number of register used differ a lot?