Hello everybody, and thanks in advance.
We have a fairly large CUDA codebase with a relatively hefty number of CUDA kernels in our project.
We target all architectures starting at sm_30, and running some tests this week we found that our code is running more slowly in a GV100 than it is on a 1080Ti.
After further inspection, I discovered that the regcounts for nearly all our kernels are higher for sm_70 than they are for sm_61. In some cases, much higher (+20…40 registers).
We are using NVCC / CUDA 9.1, and as of now, there is no arch-dependent code branching between sm_6x and sm_70. We control register usage via launch_bounds, and we use the exact same constraints in sm_6x and sm_70. All constraints are 64 or 128 regs/thread, depending on how hungry each kernel is. Those are the values that work best for us.
The total sum of registers (counting all our kernels) in sm_61 is about 6100, and about 6700 in sm_70.
Any ideas as to why the register usage may be so different? The code (as is) runs approx. 10% slower on our GV100 compared to the 1080Ti cards we own.
Again, thank you very much!