Optimal 2d Shared memory layout double ; one dimension variable

Hi all…

My algorithm requires each thread to store a “dim” length double for each thread. Lets call this variable “state”. Typically the value of dim will vary from 6 to 12, but can be as high as 128. In most cases the value of dim will be 6.

The value of “DIM” will be the same for each kernel call. The final code will have multiple kernel calls with varying DIM values.

The shared memory “state” variable will be reused many times per thread… hence I planned to put it in shared memory.

Currently am planning to define the shared variable for each block as follows:

__shared__ double state[blockSize][dim];

// and access it as

state[tid][0] = some computation ;

state[tid][2] = some computation by same thread ;

.

.

.

blockSize is either 64 for now.

And each thread would access the whole second “dim” dimension of the variable to do various calculations on per-thread basis.

As I am new to FERMI (using a M2090) am not really sure if this is the best way to layout such a variable in shared memory or is there a better way to define/read the variable in shared memory ( 1d layout ? ) ? Will the above mentioned mode of access lead to bank conflicts ? ( I suspect it will ). … is there anyway to avoid them or minimize their impact ?

The programming guide (4.2) doesn’t give much information on double access on Fermi for 2d data layout . The 2d layout is just for convince… but am not sure how to make 1d layout that will not lead to bank conflicts, given the value of dim being variable.

Thanks for all the help.

So I tried my code with only global and global+ shared memory… I get approximately the same speed. One reason is bank conflicts… from the above stated access another maybe given the high register usage , it maybe that I am being limited due to that. ( See below the results of the compiler )

Without # pragma unroll over each loop

176 bytes stack frame, 72 bytes spill stores, 72 bytes spill loads

ptxas info	: Used 63 registers, 3072+0 bytes smem, 40 bytes cmem[0], 56 bytes cmem[2], 80 bytes cmem[16]

With # pragma unroll over loops

152 bytes stack frame, 708 bytes spill stores, 872 bytes spill loads

ptxas info	: Used 63 registers, 40 bytes cmem[0], 56 bytes cmem[2], 80 bytes cmem[16]

now what I don’t understand here is… where are the variables being stored when am not unrolling the loops as the register usage is MAX ffor both the cases…

the unrolled code is slower by 30 to 40 % !

Anyone ?

Unrolling the loop completely could result in very lengthy straigth-line code that allows the compiler to find many more common subexpressions that it can precompute and store in a register for later use. This can result in very high register pressure. There may be other reasons, it is hard to tell without knowing what the code does in detail. A look at the intermediate PTX file (use -keep to preserve it) may hint as to what is happening in terms of subexpressions being pulled to the front of the code.