share mem performance load/access

is there a guide/example on how to best load/access the share mem for each block? during my nsight profile I have some low shared memory efficiency that seem could use some performance tweak.

I have found that this book http://www.wrox.com/WileyCDA/WroxTitle/Professional-CUDA-C-Programming.productCd-1118739329.html has explained getting max performance out of shared memory the best.