I am using finite element method to simulate fluid flows. Here I defined the vector of field variables named as Q. Then for integration procedures, done by thread utilization, that mentioned vector should be shared among the threads of a block on GPU. Then how is the best way to optimize the memory transactions? In the case of shared memory, what is the way to avoid bank conflicts while the number of threads in block is much greater than the size of the shared data e.g. Q and threads per block = 128 ?