Ways to reduce number of registers

When I compile I get the following:

This only gives me 25% occupancy. What steps can be taken to reduce the number of regs used? Will combining computations help?

Ex I have : c = .5DF+C*G-cn2; (All are floats)

would I be better off calculating D, F, C, G, and cn2 inline?

In the kernel I allocate 33 floats and 6 ints. Does this set the number or regs used?

The simpliest way is to pass “-maxregcount N” parameter to nvcc. In this case compiler will try to reduce number of registers used. It will use local (slow!) memory to do this, so test your performance.

Other options include redesigning your kernel (if it’s possible) to reduce number of variables. You may also try to use shared memory instead of some registers.

Make sure you’re using CUDA 1.1 since it’s much better in register allocation than 1.0.

Will I have an problems moving to 1.1? Do I just need to do the CUDA 1.1 install and nothing else? In actuality I only need to store 4 of the floats I mentioned. I put all the calculations in line to see if it would reduce the number of registers used. It actually increased them to 53. I have a hard time understanding this. Does anybody have any insight?

You should not have any problems moving to 1.1. If you wish I may compile your kernel on 1.1 and check if register usage actually decreased.

And it’s difficult to comment on increased register usage without original and modified source code…

I have included the source. Check the vortex panel folder for the Foil_CUDA.txt and Field_CUDA.txt files used for importing data.

If you see anything that would decrease my time it would be apprecieted.

I have also noticed a weird trend with the BLOCK_SIZE and GRID_SIZE. I’ll try BLOCK_SIZE = 128 and GRID_SIZE = 8 and it will run in 14 ms with correct results when compared to the Matlab code I have. I then try BLOCK_SIZE = 192 and GRID_SIZE = 8. The code runs faster but some of data returns as 0 instead of what is expected. So I’ll go back to 128 and 8 but it won’t work! It returns 0’s in some of the results. Any thoughts on this?

I am using cuda 1.0 w/ vs2005 on a Quadro FX4600
VortexPanel.zip (495 KB)

OK I found a huge error in my kernal loop that would explain the problem I am had with changing block and grid sizes.

I had

for(i=tx;i<total;i+=BlockSize*GridSize)

instead of

for(i=bx*BlockSize+tx;i<total;i+=BlockSize*GridSize)

But I still need help with the registers.

Trying to compile your code on 1.1 gives me the following result:

1>"matrixMul_kernel.cu", line 55: warning: variable "bx" was declared but never

1>          referenced

1>      int bx = blockIdx.x;

1>          ^

1>"matrixMul_kernel.cu", line 71: warning: variable "B" was declared but never

1>          referenced

1>   float A, B, C, D, E, F, G, P, Q, at_x, at_y, st, sn;

1>            ^

1>matrixMul_kernel.cudafe1.gpu

1>matrixMul_kernel.cudafe2.gpu

1>Internal error

1>nvcc error  : 'ptxas' died with status 0xC0000005 (ACCESS_VIOLATION)

I guess you’d like to wait until this is fixed…

I really love these errors. I like dieing computers / code / compilers :blink:

I’ve reproduced this locally and filed a bug. Keep 'em coming!

Thanks,
Mark

Always good to see that bugs are fixed

So it sounds like I’m better off sticking with CUDA 1.0

Not necessarily. It is likely that you can work around the compiler failure with some code changes – just experiment. CUDA 1.1 has a lot of improvements – I recommend checking it out.

Mark

Do you have any insight on the types of changes I should be looking to make?

No, sorry. You can localize the error by commenting out portions of the code and seeing what makes the compiler internal error go away.

Mark

I also met the same problem. Compiler returns

Internal error

nvcc error : ‘ptxas’ died with status 0xC0000005 (ACCESS_VIOLATION).

Hope that this could be solved soon.