Why does cudacc tends to use different registers for input and output for a += b * c?

I found cudacc seem to prefer compiling ‘a += b * c’ into ‘fma d, b, c, a’ instread of ‘fma a, b, c, a’.

Are there any benefits doing this? I end up having some registers spilled in the final SASS code.

P.S. Is register bank conflict only about input operands or all operands?

I believe the PTX code is generated with static single register assignment ( https://en.wikipedia.org/wiki/Static_single_assignment_form ) and is be the job of the PTX compiler to optimize this into as few hardware registers as required (in some cases this is done by the nvidia driver e.g. when embedded PTX code is in the binary).

Have you also tried setting the -maxrregcount option during compilation, or tried setting launch bounds kernel attributes?

Your point about avoiding register bank conflicts could be a valid one but I leave this one to the experts for answering.

If you think the PTX to SASS compiler PTXAS could have done a better job in your specific use case, create a short but complete repro case and file a bug at the nVidia developer site http://developer.nvidia.com/

PTX does double duty as both a virtual ISA and a intermediate compiler representation. I am not a compiler engineer, but have worked closely with compiler engineers for many years. My understanding is that use of SSA is extremely common in modern compilers as it makes various optimizations easier to apply. That is why you see a new virtual register used every time a new result is written (each virtual register is written to just once).

Since the number of registers available is architecture dependent, it is the job of the PTXAS optimizing compiler to allocate physical registers when compiling PTX into SASS (machine code). There are various conflicting goals in this process. For example, to increase latency tolerance the compiler will try to schedule loads early, but this often extends the live range of variables leading to an increase in register pressure. Likewise, common subexpression elimination reduces dynamic instruction count but can increase register pressure due to temporary variables created to hold the values of subexpressions. On the other hand, increased register use can lead to decreased occupancy and therefore lower performance.

Note that a small number of spill operations that occur infrequently may not be detrimental to performance, and the increased register use may help with other optimization that improve performance in the balance. PTXAS is reasonably smart about spills, e.g. putting them in outer rather than inner loops, trying to group and vectorize them. Not local memory stores seen in SASS disassembly are necessarily spills.

95% of the time the compiler manages to strike the right balance that delivers close to optimal performance. Note that the allocation of physical registers per thread in the hardware typically has a granularity > 1, and PTXAS may well take advantage of that to achieve other optimization goals. When you file a bug against the compiler regarding register pressure issues, it will likely be seen as non-actionable unless the resulting decrease in performance exceeds 5%. In other words, don’t sweat the small stuff.

Guess I hit that 5%. The result of a fma is not used in several dozens of cycles, and PTXAS puts the result of a memory load in to local storage… immediately after the load instruction…

Impossible to comment on that since we don’t know your code. It is entirely possible that the spill is not relevant to the performance of your code.

Are you using CUDA 8? Are you compiling for sm_5x or sm_6x? Based on informal observations, PTXAS seems to do a better job of allocating registers for these recent architectures compared to previous architectures.

I’m using sm_61. It’s quite weird that If I let the compiler unroll the outer loop it will not spill in this way. And If I unroll it myself I have the memory dependency stall.

I’m trying to write the kernel in PTX and see If it can get better.

BTW, how can I check register value as float in NSight?

It is doubtful (but possible) that re-writing the code in PTX will lead to much different machine code. As I mentioned, PTX is an optimizing compiler which is responsible for the allocation of physical registers. The machine code is therefore largely decoupled from minor source code changes.

You may get lucky in exercising PTXAS artifacts (I have done so reasonably extensively in the past), but such optimization is brittle in that it will likely stop to work come the next version of the compiler.

I worry about that, too. But no better option than just giving it a try now…

And who cares about compatibility in such critical code that never needs to be portable :|

I take it you have thoroughly examined and exploited all higher-level optimization opportunities? That’s usually preferable to any sort of low-level ninja programming that is obsolete half a year down the line.

I have extensive experience (~30 years) with software optimizations from top-level algorithm all the way down to squeezing the last cycle via hand-selecting machine instructions. Finding optimizations at a higher level is always preferable and almost always possible: one can make five passes over the same problem and still find ways to simplify one’s code, without ever going near machine language.

All high-level optimization I can currently think of. It’s just a modified matrix multiplication after all. Actually without that STL following LDG, it can reach around 9500GFLOPS on a 1080Ti. And drop to 8000 with that STL. This is… kind of itchy…

Generally speaking, achieving maximum performance on matrix multiplies requires assembly-language programming on pretty much every platform I know (x86, RISC, GPUs). Compiled solutions usually do not achieve more than 75% to 80% efficiency.

Unfortunately, NVIDIA refuses to make a SASS assembler publicly available, which means people have had to roll their own if they wanted to go down that path. You may be able to modify Scott Gray’s Maxwell assembler (https://github.com/NervanaSystems/maxas) for sm_61, with a bit of additional reverse engineering.

Well, 9500GFLOPS is OK to me. Maybe 8000GFLOPS is OK, too, if I haven’t seen 9500GFLOPS.

The SASS compiled from PTX doesn’t use local memory now. Hope it will work.

If your research will ultimately be documented in a paper, a pointer to it (e.g. ArXiv draft) would certainly be welcome. I am a bit curious based on the secrecy surrounding the work :-)

Never wrote a paper, especially in English :S But worth some writing in case it actually worked.

Finally made the code work, and it freaked me out with 114xx GFLOPS out of 11791 GFLOPS peak… Hopefully it may not be so high after I make the result right. Maybe I added some extra computation codes. I remember the total samples should be around 350000 now it s 510000…

I found that when I was profiling FLOPS with NSight, the GPU was overclocked to 1873MHz, which will result in a 13425 GFLOPS peak instead of 11791 GFLOPS.

The original code is kind of for one transposed matrix and one normal matrix. And when I switch to two normal matrices I can only reach 108xx GFLOPS. Still beaten by cuBLAS’s 113xx GFLOPS.