reducing the number of used registers

Hello,

What are the strategies for reducing the number of registers that are being used in your code. Currently, I have a bit of code that I ported to CUDA and even after much tinkering, I am unable to get the number of registers that are being used to come down (currently, it stands at 38!), which restricts the number of concurrent threads quite severely and I am unable to make full use of my card.

I packed a lot of variables into shared memory but I am still unable to get the register usage to an acceptable range. I looked at breaking up my kernel into multiple kernels but it seems quite impossible, I am afraid.

So, what are the things one should look for when attempting to reduce register usage. Sometimes, I am flummoxed by how the compiler decies to use registers (and there is a difference in optimization between windows and linux as well for the same CUDA version).

Cheers,

/x

Looks like you’ve mentioned most of the options :)

maybe you can post the kernel code… are you using doubles?

eyal

Nah, not using doubles. Currently, I am looking into maybe rewriting the whole thing to somehow split the damn thing into multiple kernels, but it is quite tough.

I take it you already know about the -maxrregcount compiler flag (you don’t mention it). I have used compressed integer types with some success before (I seem to recall I had 4 8-bit integers packed into an int data type). Apart from that my only other suggestion would be rearranging code.

Yeah, I used the maxregcount option but that just kills my kernel :(

Yes, I will try and rearrange code. Would be a good exercise.

/x

Reusing variables seems like it has helped some in the past.

Declare the per-thread variables (registers and local memory arrays) as volatile. It has helped me reduce the variable usage by a significant amount… all the time.

NA

i thought bit arrays are not allowed?? is there an advantage to this over having four individual int8_t type?

I seem to recall that using one variable and doing bitwise operations to access it used fewer registers than declaring many variables - I think I tried many variables and it didn’t make a difference. Can’t find the code unfortunetly. Might have just been my special case.