NVCC produces a significantly slower binary in CUDA 11 compared to CUDA 10.1

Dear community,

After upgrading ourselves to CUDA 11, NVCC compiles our codebase into a binary that runs 40-80% slower compared to CUDA 10.x.

I will elaborate a bit,

We are the developers of Maverick Render, a CUDA-based render engine comprised of 75-100 kind of hefty CUDA kernels that carry out a bunch of heterogeneous tasks (e.g., render logic, scene ray-tracing, physics calculations, …). We have an in-house Unit Testing system that runs a large batch of hundreds of scenes and compares the rendered outputs with references, and also measures performance.

If we compile with NVCC from 11.0.1 (or 11.0.2) the total time it takes run all our tests in an RTX2080Ti becomes 9’45" compared to 5’30" if we compile the exact same codebase with the exact same settings with NVCC from CUDA 10.1. We have tried with other GPU architectures that we have at our disposal (sm_3x, sm_5x, sm_6x, …) and the performance degradation observed is a bit less dramatic, but equally measurable.

We have tried to dig a bit deeper and played around with our launch_bounds configurations. We have also reviewed the NVCC documentation to see if there have been any changes in the compiler settings. To no avail.

I have also profiled each of our kernels individually to see if for some reason some pattern in our code happens to be specially disliked by the new compiler. But it seems that the performance degradation is scattered evenly throughout all the kernels. And they are fairly heterogeneous, so the symptoms could be summarized as: “the code generated by the new compiler seems to be less optimized (?)”.

Given the same launch_bounds settings, I have observed that the new compiler tends to use fewer registers overall, specially in the fatter kernels. This looks good at first sight, but performance is damaged badly.

There must be something we are overlooking. We have upgraded to every new CUDA release since CUDA 3 and this is the first time that something like this happens.

Are there any changes in NVCC, its optimization policies, or its command-line settings that could be causing this and that we are not aware of?

For the moment we will be forced to stay in CUDA 10.x. But this will become a true problem when the first Ampere cards hit the market.

Thank you all in advance for your help.

A performance regression of this magnitude strongly suggests that you would want to file a bug report with NVIDIA right away. What I would do here is extract the kernel with the worst performance regression, surround it by scaffolding code to make it buildable and runnable in isolation and attach that to the bug report as a repro case.

If the product is built using best-practice fat binaries (embed binary code for all supported architectures, embed PTX for the latest architecture) this should get you code that is functional on Ampere platforms, at the expense of JIT overhead. You would have to benchmark whether the resulting performance is better or worse than using the native Ampere support in the CUDA 11 tool chain.