Variables that are put in local memory
a) Arrays that are dynamically indexed, i.e. array[index], instead of array
b) Any variable when number of required registers exceeds the maximum. You do not know which and when, but compiler will try to minimise number of stores and loads, so variables which you access infrequently but their values are kept for the big part of your kernel are most likely candidates.
c) Apparently when you load a struct from global to register space at once (that is my current problem btw). Try loading it element by element instead.
Variables that are not put in local memory must conform to these constrainds:
There must be enough registers to store the data
Which element is accessed must be known precisely at compile time
You should keep in mind that spilling to local memory happens at the register level, not at the source code level. A variable might not be assigned to local memory, but the intermediate value in a complex expression could be put into local memory. Variables in the source code do not have a one-to-one correlation with registers.
This is actually the easy part! When you address an array by a value of another variable it is indirect addressing. Loop unrolling may help you, but not necessairly.
On device with 1.2 capability or higher you can launch up to 1024 threads, thats 16 bytes of shared memory per thread. On device 1.1 or lower you have only 8192 bytes of shared memory in total, and you can launch up to 768 threads, that is 10.6 bytes per thead.
Current ATI best card is probably better than current NVIDIA’s best card, but ATI baby is fairly new, while NVIDIA has over one-year-old architecture. Fermi is on the way but you need to wait few more months.
Apart from the newest ATI which can handle DirectX11, programming them was somewhat more complicated…
ispla, you are quite demanding. Make me this, make me that… :P