Efficient use of a shared data

Dear all,

I am using finite element method to simulate fluid flows. Here I defined the vector of field variables named as Q. Then for integration procedures, done by thread utilization, that mentioned vector should be shared among the threads of a block on GPU. Then how is the best way to optimize the memory transactions? In the case of shared memory, what is the way to avoid bank conflicts while the number of threads in block is much greater than the size of the shared data e.g. Q[16] and threads per block = 128 ?

Many thanks,

The conflicts occur per warps. So you only have to worry about conflicts inside the warp. If the threads in a warp read consecutive elements of an array there are no conflicts. In the Fermi architecture there are 32 banks not 16.