My kernel is taking a lot more registers than I anticipated. I took a look at the disassembly with nvdisasm. It’s less smart than I expected.
Exhibit A:
FMUL R11, RZ, R4;
Multiply a value by zero, yielding a zero. I’ve seen add with zero, too.
The main problem I’m having is that I have a lot of parameters to pass to my kernel. So I moved the parameters to a structure which I pass directly to the kernel. Here’s the resulting disassembly.
MOV R19, c[0x0][0x15c];
STL.64 [R0+0x8], R10;
MOV R22, c[0x0][0x160];
STL.64 [R0+0x10], R14;
MOV R23, c[0x0][0x164];
STL.64 [R0+0x18], R18;
MOV R26, c[0x0][0x168];
STL.64 [R0+0x20], R22;
MOV R27, c[0x0][0x16c];
MOV R30, c[0x0][0x170];
STL.64 [R0+0x28], R26;
…
followed by 3 local loads for the 3 values that are actually needed (I commented out most of the kernel).
In other words, the compiler takes values from constant memory and moves them to local memory and then loads them back from local memory. I guess that means I’ll have to pass every parameter explicitly.
I lost some confidence in the compiler. I’d like some pointers as to what I can expect the compiler to get right.
Question 1)
Can the compiler resolve static memory accesses in constant memory correctly, as in the following (same index for all threads)?
Does it still work if I use a dynamic index that is the same for all threads in the warp?
global void kernel(int foo[3])
{
int val = foo[0];
}
Question 2)
Can the compiler fold all constants if I use static iteration counts in several nested loops with monotonically increasing indices for register array accesses?
In which cases do I need to unroll loops manually using macros?
Question 3)
Does the compiler fail to optimize when it sees computations involving uninitialized registers? I’ve seen some weird behavior in the disassembly.
Thanks.