Reduce no. of registers

What practices have you applied (if any…) in order to achieve it?

I have a kernel which uses 18 registers and I desperately need to reach 16 because I launch many threads with lots of data and program crashes. I use very little shared memory (I can’t use lot of it since my data are huge to fit into it).

I have realized that the number of variables is not relevant. On the contrary, I have achieved to reduce registers by re-thinking some calculations.

For example:

for(int i = array[j] + x*y+z; i < array[j] + x*y+z+k; i++) {

     ...

}

is better if transformed to:

int tmp = x*y+z;

for(int i = array[j] + tmp; i < array[j] + tmp + k; i++) {

     ...

}

I know that all these are code dependent but I’m just curious to test any other ideas…

Interesting, thanks for sharing – that is counterintuitive. I guess it makes sense though. I haven’t written anything too difficult yet, but it seems there are a lot of places where shared variables work well. you can always emulate the stack with offsets to a shared or global array.

Since you’re not using much shared memory, you can use it for register-like storage, though you’ll have to manage it yourself… (meaning the compiler won’t handle reusing the storage).

Threads access shared memory at effectively the same speed as registers.
You do have to watch for bank conflicts, but you actually won’t hit them in practice since each thread has its own data and you can make sure the reads don’t line up.

So, say you need 6 more variables. The easy trick to avoid bank conflicts is to make sure you use an odd number, so call it 7. Your block needs 7numthreadssizeof(int) extra storage.

Define one register as a pointer to shared storage:
extern shared int *blockData;
int myData= blockData+threadIdx.x7;

Now, in your code, you can freely use myData[0] myData[5] myData[3] etc as variables. You could define macros to rename them if it gets confusing.

Elegant? Nope! But it works.

An alternative way to avoid bank conflicts is to space them by multiples of 16, but that gets confusing. It does have the advantage of being able to use any number of variables that fits, not just an odd number.

Also, if you have anything, which might be computed on the host side and placed to constants storage - it also might help (but not always as compiler likes to load value from constants memory to register).

Finally there is an compiller option to place a HARD limit for number of registers. That will force compiller to use local memory to store extra data, and this likely to reduce speed.

I’ve been told that a good way to reduce registers on the cpu is to block your code – i.e. place all variable declarations as high into the nested structure as possible and then ensure different jobs are in their own {} blocks (even if it’s not part of a functional block) so the variables go out of scope as fast as possible and the registers can be reused.

I haven’t tried it in cuda, but that may help here as well.

Anybody else have good code-level tricks for reducing register count?

CUDA compiller is clever enough to detect/remove variables which are going out of scope without using extra {} for that.

It should be - but I’m paranid, I don’t trust it 100%. However, I have yet to run experiments on it. Have you tried?