I have a problem with shared memory that I don’t fully understand.
I have a functionality like this:
__shared__ unsigned char sh_progs[2048];//2048 comes from maximum number of nodes in a binary tree of depth 11 (2047 actually)
__shared__ unsigned char sh_adf1_progs[2048];
__shared__ unsigned char sh_adf2_progs[2048];
The variables sh_adf1_progs and sh_adf2_progs is only filled with data if another parameter nrProgsArrays is > 1. So it looks like this ( part of a loop):
sh_progs[offset] = progs[base+offset];
if(nrProgsArrays==2)
sh_adf1_progs[offset] = adfProgs1[(int)adfProgs1[programIndex]+offset];
if(nrProgsArrays==3){
sh_adf2_progs[offset] = adfProgs2[(int)adfProgs2[programIndex]+offset];
sh_adf2_progs[offset] = adfProgs2[(int)adfProgs2[programIndex]+offset];
}
What I do here is to move a small part of a very big array that is kept in global memory into shared memory. The data that is taken is unique for the block (Im using 2048 blocks) and will be used very often so it should
increase performance quite much to have that data in the shared memory, at least that was my intention.
The only place I use the code is in the calling of another function, where the data is used frequently, on the kernel. So it looks like this
if(nrProgsArrays==1)
stackInterpreter(sh_progs, NULL, NULL, stack, bar, lastSignal);
else if(nrProgsArrays==2)
stackInterpreter(sh_progs, sh_adf1_progs, NULL, stack, bar, lastSignal);
else if(nrProgsArrays==3)
stackInterpreter(sh_progs, sh_adf1_progs, sh_adf2_progs, stack, bar, lastSignal);
My problem is if nrProgsArrays=1 ,and then sh_adf1_progs, sh_adf2_progs is not really used at all, then the execution takes X time. But if I change in the signatures to NULL instead of sh_adf1_progs, sh_adf2_progs ( Because they are not used because nrProgsArrays=1) so it looks like this:
if(nrProgsArrays==1)
stackInterpreter(sh_progs, NULL, NULL, stack, bar, lastSignal);
else if(nrProgsArrays==2)
stackInterpreter(sh_progs, NULL, NULL, stack, bar, lastSignal);
else if(nrProgsArrays==3)
stackInterpreter(sh_progs, NULL, NULL, stack, bar, lastSignal);
Then the execution takes 0.5 X time. So for some reason it takes twice the amount of time when the values in the parameter list ( which isnt used because nrProgsArrays=1 ) isnt NULL.
Howcome this can happen?
I tried skipping using shared memory for sh_adf1_progs, sh_adf2_progs and just had them passed from global memory directly into the function( and they still reside in global memory then I assume ) and then it also was faster than using shared memory.
I think I may miss some important aspect of shared memory here, perhaps there is some threshold when there is not as good to use shared memory because some swapping overhead comes into place?