As per my research so far, following optimizations can be applied to get optimized CUDA kernel for algorithms related to linear algebra solvers.
Global Memory Coalescing
Thread Block Merging
Please share your knowledge about CUDA kernel optimizations. What are the other optimizations that can be applied on linear algebra kernels.