Arrays and Local Memory

I have a small array: “int data[16];”.

Without this array, I use 0 lmem. But with this array I get 48 lmem.

Now the problem is, that the program is very complex. And it took me a lot of time to get very limited “divergent branches”(0-8) and so on.

So when I try to get around lmem, my speed falls by around 25-40%.

Using shared memory, is also a pain(128 threads=> data[0]…data[127]).

Therefore, is there any way to put arrays into registers?

Nope, registers are not addressable. The compiler will only put arrays up to size 4 into registers where all indexing is made by constants. Any type of indexing that is not known at compile time will move that array into local memory.

Thanks for your quick reply!

You mean “int data[4];” would be OK?(I am not on the other computer. So I can’t try it out now)

yeah, but only if you use data[0], data[1], … data[2] in all following code.

As soon as you’ve got something like data[i] where i is a non-const variable, you’re back to local memory.

I am having a similar problem with matrix structs consisting of four float2 types. As soon as I pass them as arguments into a function (even by reference), these are passed through local memory. So the local memory will have the role of the stack that you find in “normal” CPU architectures.

However when I pass all the float2s explicitly, the data remains in registers. It’s annoying that I have to do more typing, but at least I’ve found the workaround.

Christian

Since your array is not so big, you could try replacing indexing with calculations, so instead of writing val = data[i] you’d write something like this:

int data0,data1,data2,data3,...data15

int SampleDataArray(int idx)

{

  val   = data0 * (0 == idx);

  val += data1 * (1 == idx);

  val += data1 * (2 == idx);

  val += data1 * (3 == idx);

  ...

  val += data15 * (15 == idx);

  return val;

}

I’ve replaced data[16] with variables to avoid the case cbuncher1 mentions. You’d probably want to use __umul24 instead of ‘*’ to make the math faster or something else, like val += datax & ((x != idx)-1) or maybe the fastest one would be val = x == idx ? datax : val; Just check the assembly output and pick up the fastest code.

Also note that putting your array will increase the use of the registers, which will decrease occupancy - so just try it and see if you get any benefit.

Thanks sergeyn!

As soon as I could, I tried it out. Unfortunately I then ran into problems with occupancy. Because the speed increased when I started with very few blocks. However when using it with a higher number of blocks, the speed decreased by more than 30%. Anyway, at least I learned a few more tricks! :)