Can't make ptxas generate efficient code

Hello,

I have to compute a lengthy taylor series expansion of the form:
s = st+constant1
s = s
t+constant2

s = s*t+constant40

I have to compute this for multiple inputs inside my kernel. The problem is (looking at the output of cuobjdump).

The problem is that logic in my kernel doesn’t have too heavy register usage, since the kernel computes expansions on a few (4to8) variables, thus there is no need to require many registers to hold 8 temporaries (+ some extra for pointers to write out the result). However, my kernel wants to use maximum amount of registers available plus a kilobyte of local memory to store intermediate results.
In order investigate the problem closer I rewrote my kernel in ptx (using fma.rn) and got almost the same result in register usage. Then I looked at cuobjdump to see how my code looks like in the final cubin file and discovered the following:
for all constants used in calculations ptxas tries to allocate a separate register into which it loads a result, further down that register is used in fma instruction to accumulate the expansion. Since I have many constants, ptxas acquires almost all registers for holding constants, and starts moving accumulation variables to the local memory.
The final code generated by ptx looks like this:
mov r1, constant1
mov r2, constant2

mov r50, constant2

fma r60,r61,r1;
fma r60,r61,r2;
fma r60,r61,r3;

As you can see, not only it eats all the registerst (and starts to spill to the stack) but it also uses extra instruction for each fma instruction.
What I’d expect the code to do is to fetch the constants from constant memory (which actually works for some predefined constants like 1.0, 0.5,etc)
fmat r0,r1,c[2][constant_offs1]
fmat r0,r1,c[2][constant_offs2]
fmat r0,r1,c[2][constant_offs3]
fmat r0,r1,c[2][constant_offs4]

But no matter how I try to write the code, using immediate constants, using explicit constant tables, it always ends up using registers for loading constants instead of using constants directly.

I’m using latest toolkit 5.0 .

Anyone can help me beat ptxas stupidity ?

Thanks.

I’m not sure, but if you don’t need Kepler support you can try doing it again with ptxas 4.0 or something earlier, before nv went to llvm. There’s a slight chance it might just get fixed.

When did nvidia switch to the new llvm ? I’ve tried cuda 4.2, things are even worse there.

Switching to old compiler is not an option, just put them in shared memory or global memory or texture.

Global memory is even worse because ptxas tries to increase the distance between the read of a constant from global memory and the actual use of it, which makes it use even more temporaries. Shared memory doesn’t help - you get the same story - tons of registers used for to hold loaded constants.

Another funny thing is that ptxas completely disregards your own optimizations. I’ve tried to do ‘batched’ expansions, i.e. - fetch constant into temp register manually, and then use it to execute 1 step of expansion for all 8 variables (i.e - do 8 fma’s using that constant). Ptxas simply undoes my batched approach and executes sequentially expansion for 1st variable, then for 2nd variable and so on, reloading constants on every expansion again and again (and again).

Here’s a guess…

Are you qualifying your constants with constant ? If so, then that’s most likely your problem.

Constants qualified with constant can be modified by the host so seemingly obvious compile-time optimizations are not possible because the symbol value could be altered before kernel launch.

Try recompiling with your constants as basic #define’d immediates and see if it helps.

As I mentioned, I rewrote my logic with ptxas. There is not such thing as constant there. And I’m 110% sure it ends up in constant space.

Could you please show the actual code (or provide a link to it)? The CUDA math library uses a lot of polynomial approximations that are evaluated using Horner’s scheme for maximum accuracy, and from looking at the SASS generated I only ever saw a single accumulation register (pair) being used. So I suspect that more is going on in the code at hand (e.g. each constant is used more than once, so the compiler tries to cache it in a register).

Note that allocation of physical registers occurs during the compilation of PTX to SASS, performed by PXTAS, so changing the virtual registers used by PTX tends to make no difference. The PTX generated by the CUDA toolchain generates SSA-style code, in which each new result register is written exactly once. See Wikipedia for a brief description of SSA: http://en.wikipedia.org/wiki/Static_single_assignment_form

Ah, I see. My guess was wrong and I missed you said you tried immediates. :)

One question, did you look at the SASS? PTX can generate some voluminous sequences which look bad but wind up being tight in SASS.

I wrote two quick tiny kernels. One with __fmaf_rn() and the other with __fmul_rn()+__fadd_rn() (yes, I know they’re not quite the same).

As you note, the PTX moves the constant into a register:

mov.f32      %f3, 0f3F3AE148;
fma.rn.f32   %f4, %f1, %f2, %f3;

But dumping the SASS shows that the FFMA instruction pulls its third argument directly from constant state space… which is probably what you were hoping for. ( I suspect that it can’t be an immediate simply because there isn’t enough room to encode a float as an immediate – @hyqneuron probably knows )

/*0070*/     /*0x0030dc0030008800*/ 	FFMA R3, R3, R0, c [0x2] [0x0];

But if you’re seeing unexpected spills to local in the SASS (as indicated by -Xptxas=-v) then I would file a bug. I’ve filed similar “runaway” register allocation bugs in the past and they were fixed.

Gists: fma.cu, fmuladd.cu

A Horner-style evaluation of a polynomial approximation should result exactly in the kind of SASS you are showing above, namely a sequence of FFMAs (or DFMAs) where the addend is a constant bank reference. In addition, the FFMAs (or DFMAs) should share the same destination register (pair), and very few registers should be used overall. As I said, the CUDA math library contains numerous instances of that pattern.

reply

ok, sorry for spam, but it looks like forum doesn’t like not very short messages. will try to split …

Here is an example. I’ve added a switch between bad and how it should be case.

__device__ double* out;

#if 1
// BAD - no progre constant usage

__global__ void test1(double x)
{
  double a = 0.51;
  a = a*x + 0.251;
  a = a*x + 1.01;
  a = a*x + 2.01;
  a = a*x + 4.01;
  a = a*x + 8.01;
  a = a*x + 3.01;
  a = a*x + 5.01;
  a = a*x + -0.51;
  a = a*x + -0.251;
  a = a*x + -2.01;
  a = a*x + -4.01;
  *out = a;
}
#else

// GOOD
__global__ void test1(double x)
{
  double a = 0.5;
  a = a*x + 0.25;
  a = a*x + 1.0;
  a = a*x + 2.0;
  a = a*x + 4.0;
  a = a*x + 8.0;
  a = a*x + 3.0;
  a = a*x + 5.0;
  a = a*x + -0.5;
  a = a*x + -0.25;
  a = a*x + -2.0;
  a = a*x + -4.0;
  *out = a;
}
#endif

Compile as :

nvcc --ptx -O3 --fmad=true -o test1.ptx -m 64 -gencode arch=compute_20,code=\"sm_21\" --ptxas-options=-v test1.cu
ptxas --fmad=true --machine=64 --gpu-name=sm_21 -v -O3 --allow-expensive-optimizations=true -o test1.cubin test1.ptx
cuobjdump --function  _Z5test1d --dump-sass test1.cubin > test1.txt

Output of assembly (test1.txt) where constants are not used:

code for sm_21
		Function : _Z5test1d
	/*0000*/     /*0x00005de428004404*/ 	MOV R1, c [0x1] [0x100];
	/*0008*/     /*0x00001de440000000*/ 	NOP CC.T;
	/*0010*/     /*0x80009de428004000*/ 	MOV R2, c [0x0] [0x20];
	/*0018*/     /*0x9000dde428004000*/ 	MOV R3, c [0x0] [0x24];
	/*0020*/     /*0x48011de21a147ae1*/ 	MOV32I R4, -0x7ae147ae;
	/*0028*/     /*0xac015de218ff8147*/ 	MOV32I R5, 0x3fe051eb;
	/*0030*/     /*0xa8019de219374bc6*/ 	MOV32I R6, 0x4dd2f1aa;
	/*0038*/     /*0x8801dde218ff4041*/ 	MOV32I R7, 0x3fd01062;
	/*0040*/     /*0x10211c01200c0000*/ 	DFMA R4, R2, R4, R6;
	/*0048*/     /*0x50021de21b851eb8*/ 	MOV32I R8, -0x1eb851ec;
	/*0050*/     /*0xa4019de21b0a3d70*/ 	MOV32I R6, -0x3d70a3d7;
	/*0058*/     /*0xd401dde218ffc0a3*/ 	MOV32I R7, 0x3ff028f5;
	/*0060*/     /*0x08411c01200c0000*/ 	DFMA R4, R4, R2, R6;
	/*0068*/     /*0xe8025de21b000051*/ 	MOV32I R9, -0x3fffeb86;
	/*0070*/     /*0x50019de21b851eb8*/ 	MOV32I R6, -0x1eb851ec;
	/*0078*/     /*0xe801dde219000051*/ 	MOV32I R7, 0x4000147a;
	/*0080*/     /*0x08411c01200c0000*/ 	DFMA R4, R4, R2, R6;
	/*0088*/     /*0x28029de219c28f5c*/ 	MOV32I R10, 0x70a3d70a;
	/*0090*/     /*0x28019de219c28f5c*/ 	MOV32I R6, 0x70a3d70a;
	/*0098*/     /*0xf401dde219004028*/ 	MOV32I R7, 0x40100a3d;
	/*00a0*/     /*0x08411c01200c0000*/ 	DFMA R4, R4, R2, R6;
	/*00a8*/     /*0xf402dde21b004028*/ 	MOV32I R11, -0x3feff5c3;
	/*00b0*/     /*0x14019de21ae147ae*/ 	MOV32I R6, -0x47ae147b;
	/*00b8*/     /*0x7801dde219008014*/ 	MOV32I R7, 0x4020051e;
	/*00c0*/     /*0x08411c01200c0000*/ 	DFMA R4, R4, R2, R6;
	/*00c8*/     /*0x50019de21b851eb8*/ 	MOV32I R6, -0x1eb851ec;
	/*00d0*/     /*0xe801dde219002051*/ 	MOV32I R7, 0x4008147a;
	/*00d8*/     /*0x08411c01200c0000*/ 	DFMA R4, R4, R2, R6;
	/*00e0*/     /*0x28019de219c28f5c*/ 	MOV32I R6, 0x70a3d70a;
	/*00e8*/     /*0xf401dde219005028*/ 	MOV32I R7, 0x40140a3d;
	/*00f0*/     /*0x08411c01200c0000*/ 	DFMA R4, R4, R2, R6;
	/*00f8*/     /*0x48019de21a147ae1*/ 	MOV32I R6, -0x7ae147ae;
	/*0100*/     /*0xac01dde21aff8147*/ 	MOV32I R7, -0x401fae15;
	/*0108*/     /*0x08411c01200c0000*/ 	DFMA R4, R4, R2, R6;
	/*0110*/     /*0xa8019de219374bc6*/ 	MOV32I R6, 0x4dd2f1aa;
	/*0118*/     /*0x8801dde21aff4041*/ 	MOV32I R7, -0x402fef9e;
	/*0120*/     /*0x08411c01200c0000*/ 	DFMA R4, R4, R2, R6;
	/*0128*/     /*0x00019de428007800*/ 	MOV R6, c [0xe] [0x0];
	/*0130*/     /*0x1001dde428007800*/ 	MOV R7, c [0xe] [0x4];
	/*0138*/     /*0x08421c0120100000*/ 	DFMA R8, R4, R2, R8;
	/*0140*/     /*0x08809c0120140000*/ 	DFMA R2, R8, R2, R10;
	/*0148*/     /*0x00611ca58c000000*/ 	LDU.E.64 R4, [R6];
	/*0150*/     /*0x00409ca594000000*/ 	ST.E.64 [R4], R2;
	/*0158*/     /*0x00001de780000000*/ 	EXIT;
		..........................

Output of assembly when constands are used how they are supposed to be used

/*0000*/     /*0x00005de428004404*/ 	MOV R1, c [0x1] [0x100];
	/*0008*/     /*0x00001de440000000*/ 	NOP CC.T;
	/*0010*/     /*0x80009de428004000*/ 	MOV R2, c [0x0] [0x20];
	/*0018*/     /*0x9000dde428004000*/ 	MOV R3, c [0x0] [0x24];
	/*0020*/     /*0x00011de218000000*/ 	MOV32I R4, 0x0;
	/*0028*/     /*0x00015de218ff4000*/ 	MOV32I R5, 0x3fd00000;
	/*0030*/     /*0x00211c012008cff8*/ 	DFMA R4, R2, 0x3fe00, R4;
	/*0038*/     /*0x00019de428007800*/ 	MOV R6, c [0xe] [0x0];
	/*0040*/     /*0x14411c0120048000*/ 	DFMA R4, R4, R2, c [0x10] [0x4];
	/*0048*/     /*0x34411c0120048000*/ 	DFMA R4, R4, R2, c [0x10] [0xc];
	/*0050*/     /*0x54411c0120048000*/ 	DFMA R4, R4, R2, c [0x10] [0x14];
	/*0058*/     /*0x74411c0120048000*/ 	DFMA R4, R4, R2, c [0x10] [0x1c];
	/*0060*/     /*0x94411c0120048000*/ 	DFMA R4, R4, R2, c [0x10] [0x24];
	/*0068*/     /*0xb4411c0120048000*/ 	DFMA R4, R4, R2, c [0x10] [0x2c];
	/*0070*/     /*0xd4411c0120048000*/ 	DFMA R4, R4, R2, c [0x10] [0x34];
	/*0078*/     /*0xf4411c0120048000*/ 	DFMA R4, R4, R2, c [0x10] [0x3c];
	/*0080*/     /*0x1001dde428007800*/ 	MOV R7, c [0xe] [0x4];
	/*0088*/     /*0x14421c0120048001*/ 	DFMA R8, R4, R2, c [0x10] [0x44];
	/*0090*/     /*0x34809c0120048001*/ 	DFMA R2, R8, R2, c [0x10] [0x4c];
	/*0098*/     /*0x00611ca58c000000*/ 	LDU.E.64 R4, [R6];
	/*00a0*/     /*0x00409ca594000000*/ 	ST.E.64 [R4], R2;
	/*00a8*/     /*0x00001de780000000*/ 	EXIT;

One thing I’ve noticed - the bad case is only bad when compiled for sm_20/sm_21 arch. sm_30 is using constants properly. But why the same doesn’t work for sm_21/sm_20 ? It is obviously supported by hardware, as seen in the good case version of my sample

Very interesting example!

Unless I am missing something, the only difference between your “good” and your “bad” case seems to be the numeric value of the constants? If so, I am genuinely puzzled about the SASS differences. I am not in front of a CUDA-cpaable machine right now to run my own experiments. Were these two snippets compiled for different GPU architectures?

As far as I recall, the DFMA instruction in sm_2x does not support a constant bank reference for the addend (while it does on sm_35), so the typical pattern I see for Horner scheme evaluations in the math library on sm_2x is:

MOV32I c_lo, const_low_part
MOV32I c_hi, const_high_part
DFMA a, a, b, c

This uses a single register pair as temporary storage (in your code example that is R6:R7), and results in code with the highest possible performance for sm_2x. Note that your “bad” snippet shows appropriate re-use of the register pairs R2:R3, R4:R5, and R6:R7. Why there are early loads into R8:R9 and R10:R11 of the operands for the DFMAs at the end I cannot say. However the minimum register count target for sm_2x is 16 registers, so if there are unused register available, as in this case, the compiler might decide to use them, as there is no reason not to do so.

While the code differences are interesting, I see no evidence of a code generation issue in the code snippets above. I assume these snippets are not identical to your actual code?

For what it’s worth, those “trimmed” constants can be represented exactly:

2 MOV32I'S + 1 DFMA

	fma.rn.f64 	%fd2, %fd1, 0d3FE051EB851EB852, 0d3FD010624DD2F1AA;
	fma.rn.f64 	%fd3, %fd2, %fd1, 0d3FF028F5C28F5C29;
	fma.rn.f64 	%fd4, %fd3, %fd1, 0d4000147AE147AE14;
	fma.rn.f64 	%fd5, %fd4, %fd1, 0d40100A3D70A3D70A;
	fma.rn.f64 	%fd6, %fd5, %fd1, 0d4020051EB851EB85;
	fma.rn.f64 	%fd7, %fd6, %fd1, 0d4008147AE147AE14;
	fma.rn.f64 	%fd8, %fd7, %fd1, 0d40140A3D70A3D70A;
	fma.rn.f64 	%fd9, %fd8, %fd1, 0dBFE051EB851EB852;
	fma.rn.f64 	%fd10, %fd9, %fd1, 0dBFD010624DD2F1AA;
	fma.rn.f64 	%fd11, %fd10, %fd1, 0dC000147AE147AE14;
	fma.rn.f64 	%fd12, %fd11, %fd1, 0dC0100A3D70A3D70A;

1 DFMA USING CONSTANT
        
	fma.rn.f64 	%fd2, %fd1, 0d3FE0000000000000, 0d3FD0000000000000;
	fma.rn.f64 	%fd3, %fd2, %fd1, 0d3FF0000000000000;
	fma.rn.f64 	%fd4, %fd3, %fd1, 0d4000000000000000;
	fma.rn.f64 	%fd5, %fd4, %fd1, 0d4010000000000000;
	fma.rn.f64 	%fd6, %fd5, %fd1, 0d4020000000000000;
	fma.rn.f64 	%fd7, %fd6, %fd1, 0d4008000000000000;
	fma.rn.f64 	%fd8, %fd7, %fd1, 0d4014000000000000;
	fma.rn.f64 	%fd9, %fd8, %fd1, 0dBFE0000000000000;
	fma.rn.f64 	%fd10, %fd9, %fd1, 0dBFD0000000000000;
	fma.rn.f64 	%fd11, %fd10, %fd1, 0dC000000000000000;
	fma.rn.f64 	%fd12, %fd11, %fd1, 0dC010000000000000;

I’ll leave it to you guys to figure out. Interesting problem!

I’ve compiled both examples for sm_2_1 using the following batch file:

nvcc --ptx -O3 --fmad=true -o test1.ptx -m 64 -gencode arch=compute_20,code=\"sm_21\" --ptxas-options=-v test1.cu
ptxas --fmad=true --machine=64 --gpu-name=sm_21 -v -O3 --allow-expensive-optimizations=true -o test1.cubin test1.ptx
cuobjdump --function  _Z5test1d --dump-sass test1.cubin > test1.txt

The only difference between bad and good case is replacing #if 1 to #if 0 .

My concern is that fermi (i.e - sm_2_0/sm_2_1 ) can actually fetch directly from a constant table as the “good” case demonstrates.

Also please understand that this is a very simplified case just to demonstrate the issue with constant handling. I can’t unfortunately share the full blown version.

Njuffa - here you can see register reuse, but no sharing since the constant is used only once. In the bigger kernel I have code like shown above executed for multiple variables with other operations besides fma used. There ptxas goes crazy with register usage.

I noticed the bit representation differences, but this should only make a difference when the constant can be represented in the immediate bits of the SASS instruction, not switch between the use of MOV32I and constant bank reference. I am also fairly sure that DFMA on sm_2x does not allow constant bank references as operands, although I cannot check that right now. So the only two choices are loading a double-precision constant via MOV32I or representing it as a “short immediate” inside the instruction itself (if it is a compile-time literal constant, not a constant operand).

If I rememeber correctly, sm_2x SASS can represent a double precision constant as an immediate if the most significant 12 (??) bits suffice to represent the entire number (meaning the trailing bits in the bit representation are zero). So double-precision constants like 1.0, 0.5, and 2.0 can be represented as immediates. I do not recall whether this applies to all source operands in a DFMA, though. As far as I am aware, sm_35 SASS no longer supports “short immediates” in double-precision instructions, and instead allows constant bank references, which seems more useful for the general case of arbitrary constants.

In any event these SASS code differences don’t hint at a problem, yet sergeyn reports severe register pressure in his kernel, so I assume the snippets above are not representative of his code. I suspect there are other issues at work, e.g. sharing of coefficients between multiple polynomials in the same kernel causing the compiler to cache these in registers, or the register pressure coming from other parts of the kernel code than the polynomial evaluation. It is impossible to tell without seeing the complete actual source code.

If there does not seem to be a reasonable explanation for the high register pressure observed in the actual code, and there is a noticeable loss of performance due to the high register pressure, please file a bug through the registered developer website.