share mem performance load/access

is there a guide/example on how to best load/access the share mem for each block? during my nsight profile I have some low shared memory efficiency that seem could use some performance tweak.

I have found that this book Professional CUDA C Programming | Wiley has explained getting max performance out of shared memory the best.