reducing unnecessary register spilling

jeffiek · November 8, 2010, 10:38pm

Hi all,

I could use some help reducing unnecessary register spilling in a fairly complex program. It seems to me that the compiler doesn’t re-use registers. Am I missing a switch or something?
A simplified example of the problem:
global …
{
int a;
a = 5;
some code
a = 7;
some more code
}

Obviously, “a” only needs one register, but when I look at the ptx code generated by the compiler, I see two registers used. The compiler doesn’t re-use the first one. This leads to register spilling in the “some more code” section. Of course, my program is more complex, but that’s the idea. I use only about 40 registers at any one time, but by the time the compiler is done, it needs over 512.
Any ideas?

Thanks

jeffiek · November 8, 2010, 10:38pm

Hi all,

I could use some help reducing unnecessary register spilling in a fairly complex program. It seems to me that the compiler doesn’t re-use registers. Am I missing a switch or something?
A simplified example of the problem:
global …
{
int a;
a = 5;
some code
a = 7;
some more code
}

Obviously, “a” only needs one register, but when I look at the ptx code generated by the compiler, I see two registers used. The compiler doesn’t re-use the first one. This leads to register spilling in the “some more code” section. Of course, my program is more complex, but that’s the idea. I use only about 40 registers at any one time, but by the time the compiler is done, it needs over 512.
Any ideas?

Thanks

LSChien · November 9, 2010, 12:13am

ptx code is not optimized.

you should use -Xptxas to output number of registers, size of local memory, and size of shared memory of your kernel.

LSChien · November 9, 2010, 12:13am

ptx code is not optimized.

you should use -Xptxas to output number of registers, size of local memory, and size of shared memory of your kernel.

seibert · November 9, 2010, 1:50am

To expand on LSChien’s answer: nvcc emits ptxas in static single assignment form (see Wikipedia for details). ptxas (called by nvcc usually on your behalf) does final register assignment, so you have to pass it an option to print the number of registers used in the actual machine code. I usually do this by passing --ptxas-options=-v to nvcc.

seibert · November 9, 2010, 1:50am

To expand on LSChien’s answer: nvcc emits ptxas in static single assignment form (see Wikipedia for details). ptxas (called by nvcc usually on your behalf) does final register assignment, so you have to pass it an option to print the number of registers used in the actual machine code. I usually do this by passing --ptxas-options=-v to nvcc.

jeffiek · November 9, 2010, 7:12pm

Hi,
Thanks for the quick help. Not directly applicable, I’m using driver API, but it pointed in the right direction. Started by copying from the VectorAdd sample code for the host, modified it, and didn’t notice the max reg for loading .ptx files.
Oops.

It’s a shame the ptx file isn’t optimized. It helps ( me anyway ) to see what’s generated from the C code. Now I think I have the opposite problem. It’s not using enough registers.

I have limited parallelism ( threads/block = 64 ). I’m wondering if more registers could reduce register latency. It’s fast enough, one block ~= one cpu. But of course I can launch many blocks External Image beating the cpu by about 20. Still, faster is better.

Thanks again.

jeffiek · November 9, 2010, 7:12pm

Hi,
Thanks for the quick help. Not directly applicable, I’m using driver API, but it pointed in the right direction. Started by copying from the VectorAdd sample code for the host, modified it, and didn’t notice the max reg for loading .ptx files.
Oops.

It’s a shame the ptx file isn’t optimized. It helps ( me anyway ) to see what’s generated from the C code. Now I think I have the opposite problem. It’s not using enough registers.

I have limited parallelism ( threads/block = 64 ). I’m wondering if more registers could reduce register latency. It’s fast enough, one block ~= one cpu. But of course I can launch many blocks External Image beating the cpu by about 20. Still, faster is better.

Thanks again.