Oddly high regcounts in sm_70 compared to sm_61

Hello everybody, and thanks in advance.

We have a fairly large CUDA codebase with a relatively hefty number of CUDA kernels in our project.

We target all architectures starting at sm_30, and running some tests this week we found that our code is running more slowly in a GV100 than it is on a 1080Ti.

After further inspection, I discovered that the regcounts for nearly all our kernels are higher for sm_70 than they are for sm_61. In some cases, much higher (+20…40 registers).

We are using NVCC / CUDA 9.1, and as of now, there is no arch-dependent code branching between sm_6x and sm_70. We control register usage via launch_bounds, and we use the exact same constraints in sm_6x and sm_70. All constraints are 64 or 128 regs/thread, depending on how hungry each kernel is. Those are the values that work best for us.

The total sum of registers (counting all our kernels) in sm_61 is about 6100, and about 6700 in sm_70.

Any ideas as to why the register usage may be so different? The code (as is) runs approx. 10% slower on our GV100 compared to the 1080Ti cards we own.

Again, thank you very much!

Does the compiler honor violate the constraints you are imposing with launch_bounds? If it is violating the constraints, consider filing a bug report with NVIDIA.

If the compiler is honoring the constraints, what you are observing would be a feature, not a bug. The compiler’s register allocation stage is free to use as many registers as it desires, subject to the constraints imposed by launch_bounds or other mechanisms. Using more registers (within the allowed limit) might allow it to schedule instructions better (maybe to account for increased latencies), use common subexpression elimination more often (requiring temp storage for those subexpressions), or create additional induction variables.

You seem to imply that the higher register usage is responsible for the difference in performance that are observed. That may or may not be the case. You could test such a hypothesis by modifying your launch_bound settings so that the compiler is forced to use fewer registers, and check whether this leads to improve performance.

Have you profiled the kernels in question with the CUDA profiler, and compared relevant statistics on the two platforms? That might provide some solid clues as to why the kernels are running more slowly on the GV100. Based on your understanding of performance characteristics (e.g. roofline analysis), what performance did you expect on the GV100?

If it is possible, you may want to try the CUDA 9.2 toolchain as well. Every major GPU architecture is different (no binary compatibility), and typically requires new (machine specific) compiler backend components, which tend to be less mature when a new architecture (such as Volta) first ships. I haven’t had a chance to try Volta, so I wouldn’t know how different code generation is for the new architecture. The differences used to be fairly significant with each new GPu generation; only Pascal and Maxwell are close because these architectures are quite closely related.

Thank you very much for the verbose reply, @njuffa.

The compiler does honor the constraints. I tried different combinations for launch_bounds in the Volta card and nothing improved performance. The outcome was ranging between same-speed and runs-slower.

I understand that the compiler is free to choose how to make use of registers. I don’t think that what I am experiencing is not a bug at all. My hypothesis is more along these lines: “I may be writing code here and there in a way that works great on previous architectures, but not so great on a GV100”.

I haven’t profiled on the GV100 yet because the board is on a remote computer I have somewhat limited access to, but eventually I will.

I will start by trying CUDA 9.2 out just in case.

I will post here my findings, if any.

That is entirely possible. It takes some time to get an intuitive feel for a new architecture. The CUDA profiler should be able to help with that, as could looking at generated machine code (SASS). Looking at more than one application is likely also helpful.

In general, CUDA compilers improved years ago to the point where CUDA programmers do not have to “write to the architecture”. Once more people have access to Volta than just the privileged few, there should also a growing public body of shared experience on this and other forums.

I cornered the problem and found a solution. Here’s the summary of my findings:

1- After the fix, our render engine runs +50% faster on a GV100 than it does on a 1080Ti. Which is awesome and even more than I expected.

2- I tried CUDA v9.2 as @njuffa suggested. After doing so, our code started crashing on Volta, while running normally on every other architecture. With CUDA v9.1, however, our code was running “oddly slow” on Volta, and normally on every other architecture.

3- I managed to corner the problem down to the area in the code where we do something like what’s described in C.2.5.1. Discovery Pattern in the CUDA C Programming Guide. i.e., “aggregating atomic increment across threads in a warp”.

3.1- That piece of code is hosted in a device function that I found to deliver more/less performance depending on whether the function is force-inlined or no-inlined, which is a decision we make on a case-by-case basis. This works ok in older architectures…

3.2- …but on Volta, when said code is wrapped in a no-inline function, the calling kernel may either behave erratically, or even crash. The odd slow speed I was measuring was probably nothing but a side effect (we use this function to fill up some temporary result buffers).

So, the solution that works (at least for me in our code) is to simply inline those calls. Now everything works identically in all the architectures, with no arch-dependent branching whatsoever, and we get that sweet +50% boost in Volta.

Also,

4- The register usage for older architectures is pretty much the same in CUDA v9.1 and v9.2. But the regcount sum for all our kernels went from about 6700 (v9.1) down to 6400 (v9.2). This has nothing to do with the crashes/erratic behavior. But is worth noting as proof that the compiler indeed gets better over time.

Could this be a bug, or is there something I am missing regarding Independent Thread Scheduling and function calling?

Again, thank you very much.

I think it could be a compiler bug, but that statement could probably be made in many situations. It’s possible that better statements could be made with a test case.

If you are able to develop a self-contained test case, you may wish to file a bug at developer.nvidia.

If you are using any deprecated intrinsics (you will usually get compiler/ptxas warnings if you are) then that would certainly be cause for concern as you go to CUDA 9/Volta, especially if that is in the presence of divergent code. I mention this because the warp aggregation techniques I am familiar with generally use warp-level intrinsics for part of the work. With Volta, its necessary to specify which threads in the warp you expect to participate in warp-level intrinsics, using the mask parameter on said intrinsics (the non-deprecated versions).

Yes, I am aware of the changes required for warp-level intrinsics. Actually, the piece of code I am talking about looks identical to the C.2.5.1. example in the CUDA C Programming Guide. i.e., I am not getting any warnings / we are not using the deprecated versions of those intrinsics.

I can reproduce the problem by simply switching the device function that hosts the atomic increment from forceinline to noinline. Now, what I haven’t tried is to reproduce the problem outside the context of our codebase.