After upgrading ourselves to CUDA 11, NVCC compiles our codebase into a binary that runs 40-80% slower compared to CUDA 10.x.
I will elaborate a bit,
We are the developers of Maverick Render, a CUDA-based render engine comprised of 75-100 kind of hefty CUDA kernels that carry out a bunch of heterogeneous tasks (e.g., render logic, scene ray-tracing, physics calculations, …). We have an in-house Unit Testing system that runs a large batch of hundreds of scenes and compares the rendered outputs with references, and also measures performance.
If we compile with NVCC from 11.0.1 (or 11.0.2) the total time it takes run all our tests in an RTX2080Ti becomes 9’45" compared to 5’30" if we compile the exact same codebase with the exact same settings with NVCC from CUDA 10.1. We have tried with other GPU architectures that we have at our disposal (sm_3x, sm_5x, sm_6x, …) and the performance degradation observed is a bit less dramatic, but equally measurable.
We have tried to dig a bit deeper and played around with our launch_bounds configurations. We have also reviewed the NVCC documentation to see if there have been any changes in the compiler settings. To no avail.
I have also profiled each of our kernels individually to see if for some reason some pattern in our code happens to be specially disliked by the new compiler. But it seems that the performance degradation is scattered evenly throughout all the kernels. And they are fairly heterogeneous, so the symptoms could be summarized as: “the code generated by the new compiler seems to be less optimized (?)”.
Given the same launch_bounds settings, I have observed that the new compiler tends to use fewer registers overall, specially in the fatter kernels. This looks good at first sight, but performance is damaged badly.
There must be something we are overlooking. We have upgraded to every new CUDA release since CUDA 3 and this is the first time that something like this happens.
Are there any changes in NVCC, its optimization policies, or its command-line settings that could be causing this and that we are not aware of?
For the moment we will be forced to stay in CUDA 10.x. But this will become a true problem when the first Ampere cards hit the market.
Thank you all in advance for your help.