code made for Fermi runs slower on 465 than code made for CC=1.3 compiler fails to properly compile

I compile and link code with nvcc version 41.28. My code compiled with arch=sm_20 runs slower on GTX 465 than code compiled with arch=sm_13 (with factor 1.30). I think this is due to register spilling.

I implemented Montgomery multiplication with all data in registers. Data size for Montgomery is 3*s+4 where s is operand size, equals 32 for 1024-bit modulo. So for s=32 Montgomery algorithm uses exactly 100 registers. Lim-Lee exponentiation takes some registers, too. So overall count of register variables is about 105. Ptxas information for 1.3 code:

ptxas info : Used 106 registers, 5136+0 bytes lmem, 48+16 bytes smem, 264 bytes cmem[1]

When I set arch to sm_20, register spilling occurs and not more than 63 registers are used. When arch=sm_13, one mul.{hi,lo}.32 instruction is substituted with multiple instructions which perform poorly on 2.0 device. I did not actually inspect resulting code, trusting documentation that comes with CUDA toolchain.

In both cases (sm_13 or sm_20), the code is crippled.

My questions are:

  1. Where does the upper limit of 63 comes from? How do I unset this limit?

  2. How do I compile my code without substituting one fast instruction with multiple slow instructions and spilling heavily-used registers?

  3. Are there any tools (beside NVIDIA CUDA toolchain) that convert CUDA C/asm into .exe or .o? I use Linux if it matters.

Code for compute capability 1.x is never directly executed on compute capability 2.x devices, because they aren’t able to do that. What instead happens behind the scene is that the driver takes the PTX representation of your code and dynamically recompiles it for compute capability 2.x. PTX code however does not care for the number of registers at all. So some other difference in the PTX representation for compute_13 and compute_20 must be responsible for the speed difference you see.
In the special case of CUDA 4.1, this is most likely due to the different choice of compilers for compute capability 1.x and 2.x. 1.x code is by default compiled using an Open64-derived compiler, while 2.x code is compiled using an LLVM-based compiler. You can however manually choose the compiler using [font=“Courier New”]-open64[/font] and [font=“Courier New”]-nvvm[/font] flags to nvcc respectively. This is what I recommend doing in your case.

You can see the actual code that is executed on your device by using [font=“Courier New”]cuobjdump -sass[/font], but keep in mind only sm_20 or sm_21 compiled code will ever execute on your device.

To answer your questions:

  1. You cannot use more than 63 registers in sm_20 compiled code because the architecture just doesn’t have more. There aren’t any more bits to encode more registers in the binary instruction format.

  2. Multiplication instructions most likely do not transformed into multiple machine instructions in your case, because that is done by ptxas only when compiling for compute capability 1.x. However, code for compute capability 1.x does not run on your device at all.

  3. PGI has commercial compilers but given my answers to 1) and 2) I don’t see how this would help you.

Thank you for quick answer.

My PTX code plus PTX code made from C code is present inside executable… and is seems to use lots of registers… and it is dynamically compiled into assembler suitable for Fermi, within fraction of second… resulting code is faster than code made ptxas/cicc…

Either the dynamic compiler is very good, or the static compiler is very bad

I did more benchmarks with 4.1.28 and earlier versions. Results are

ver/flags CC productivity

41.28 -open64 13 4.39e5

41.28 -open64 20 4.47e5 winner

41.28 -nvvm 13 3.32e5

41.28 -nvvm 20 3.31e5

40.17 13 4.38e5

40.17 20 4.12e5

32.16 13 4.41e5 2nd place

32.16 20 4.11e5

Looks like my code is liked by nvcc -open64 -arch=sm_20 only. All other static compilers produce code worse than dynamic PTX compiler. nvcc -nvvm spends more than 9 minutes and produces very slow code. Compilation time for nvcc -open64 -arch=sm_20 is 22 seconds.

Goodbye, thanks everybody