Top of my wish list are memory optimization directives:
register : will force variable into registers, even short arrays. This one is most important wish.
shared_local : will force variable LOCAL to one thread into shared memory. This one is “nice to have” wish.
nodevice : combination of register and shared_local where compiler can decide. This one is least important wish.
I would guess that most important of these directives (register) is also easiest to implement.
Usage scenarios:
Those directives should help especially those scenarios where single thread functions are just ported to kernel functions in order to use multiple processors, and where you dont have multiple threads working on common large data structures. One example area is doing simulations on GPU. There you have reduced data as input (usually simulation bounds and seeds), reduced data as output, and single kernel thread tend to run long usually without need for access to large data structures. Therefore avoiding usage of local/device memory is possible and could result in huge performance improvements.
Basically, it can benefit any algorithms that can be executed in parallel but do not have large data I/O.
register explanation:
This one is needed because compiler right now will put any long structure or array in local memory. So if i have kernel that need something like this:
int seed,a,b,c;
int n[8];
for (int i=0; i<7; i++) n[i]=0; // example of accessing array
n[seed % 8]+=a;
this should fit easily in 16 registers per kernel thread, but unfortunatelly compiler will put n[8] into local memory, significantly slowing down kernel. There is no elegant workaround for this - I use right now:
int seed,a,b,c;
int n0,n1,n2,n3,n4,n5,n6,n7=0;
switch(seed % 8){
 case 0: n0+=a; break;
 case 1: n1+=a; break;
 ...
 case 7: n7+=a; break;
}
Above is just example for accessing short array elements, not real kernel,and while it is very unreadable and against my normal C practices, it still perform hundreds of times faster than letting compiler put array in local memory.
So what I want here is:
__register__ int seed,a,b,c;
__register__ int n[8];
for (int i=0; i<7; i++) n[i]=0; // example of accessing array
n[seed % 8]+=a;
and if compiler can not fit those with register into register, it should report error.
shared_local explanation:
This one is bit harder to explain or implement than above, but has same reason - avoiding local/device memory. In cases where I need more “register” memory for single kernel thread than I have registers, I would like to be able to use shared memory as local thread memory. I can do that even now for arrays by indexing into shared memory by threadID, but I can not do it for int,float,struct per thread , and also shared memory is optimized right now to avoid bank conflicts from multiple threads, which is not good if local usage is wanted.
For example if i have following variables local to thread (that i would like to be in shared memory only because register memory is full):
int seed;
int n[32];
int a;
for (int i=0; i<32; i++) n[i]=seed+a*i; //example of usage
I could right now workaround that by indexing into shared array, like:
__shared__ int  shared[];
...
const numberOfRegularSharedInts=1024;
const threadSharedLocalSize=1+32+1;
int idx=threadIdx.x*threadSharedLocalSize+numberOfRegularSharedInts;
#define map(k) shared[idx+k]
#define seed map(0)
#define n(k) map(1+k)
#define a map(33)
for (int i=0; i<32; i++) n(i)=seed+a*i; //example of usage
but that have disavantages like:
-
complicated to write
-
error prone (need carefully to add indexes)
-
lower performance due to addition and indexing, so simple a=3 instead of assigment to static address has 3 basic instructions: addition (idx+1), indexing , and assigment
-
most important, it is not optimized to avoid bank conflicts
This last one is most important reason why doing it by hand is hard - right now consecutive addresses in shared mem are in different banks, to avoid conflicts of multiple threads working on consecutive data elements. BUT in this case we want single thread to work on consecutive addresses. So in above example, n(0)=1 in tread#0 can clash with n(0)=3 in thread#1.
Solution to that would be putting consecutive integers from single thread in same bank (bank being 16 integers if i understood correctly).
so above example would look something like:
__shared__ int  shared[];
...
const numberOfRegularSharedInts=1024;
const threadSharedLocalSize=48; // need to be divisible by 16
int idx16=threadIdx.x%16;
int idxM=threadIdx.x/16;
#define map(k) shared[numberOfRegularSharedInts+ idxM*16*threadSharedLocalSize+idx16+k*16]
#define seed map(0)
#define n(k) map(1+k)
#define a map(33)
for (int i=0; i<32; i++) n(i)=seed+a*i; //example of usage
What I’m suggesting here is much simpler:
__shared_local_ int seed;
__shared_local_ int n[32];
__shared_local_ int a;
for (int i=0; i<32; i++) n[i]=seed+a*i; //example of usage
Advantages of that would be:
-
complier would work math to decide indexes, thus eliminating error chances
-
compiler can map non-array integers to fixed addresses, so “int a;” would map to “shared[1316]” at compile time, thus making is as fast as if a was register vatriable (as opposed to 2-3 times slower if shared[1000+n] was done)
-
compiler could even map structs in above way (doing that by han dwould be nightmare)
-
compiler can “upgrade” _shared_local to register whenever enough registers are present
-
it would be optimized for avoiding bank conflicts
nodevice explanation:
This one is not so esential as other two . It is combination of register and shared_local where compiler can decide which one to use, as long as it is not device local memory.
Some guidelines for compiler could be:
-
give preference to register just as it does now
-
spread evenly usage of registers and _shared_local , thus increasing posibility of more concurent threads
For example, if my kernel function need 768 nodevice integers, and if total registers (per multiprocessor) are 8192 int32, and lets say total shared memory is 4096 int32 per multiprocessor, then compiler could split variables into:
-
512 int32 into registers, using it for more important variables
-
256 int32 into _shared_local, using it for less important, arrays, etc
This way one kernel thread would use 1/16th on both registers and shared pool, allowing 16 threads to run at same time in one mutiprocessor. Without such “smart” split, if all were register for example, only 10 threads would run in parallel.
This split could be done manually, but since number of registers and amount of shared memory per multiprocessor will change across CUDA devices, if compiler know for what type of device it compilers (say it has 1.1 directive), it could optimize . When tomorrow we change for 2.1 device, it could reoptimize without need for programmer to revisit every kernel function (which would be needed in case of manual balancing)