CUDA Kernel Optimization

As per my research so far, following optimizations can be applied to get optimized CUDA kernel for algorithms related to linear algebra solvers.

Global Memory Coalescing
Thread Block Merging

Please share your knowledge about CUDA kernel optimizations. What are the other optimizations that can be applied on linear algebra kernels.