Optimizing Sequential cuBLAS Calls for Matrix Operations—Alternatives to Kernel Fusion?

I am currently working on a CUDA project where my code involves a sequence of matrix multiplications followed by activation functions. Typically, such dependent, sequential operations can be optimized using kernel fusion to minimize shared memory access, enhancing overall performance.

To streamline my implementation, I opted to use cuBLAS for handling the matrix multiplications. However, I’ve found that cuBLAS doesn’t support kernel fusion, which seems like a missed opportunity for optimization in terms of reducing memory overhead and improving execution speed.

Given this context, I am seeking advice on alternative methods to optimize these sequential cuBLAS calls. Are there techniques within CUDA or associated libraries that can mimic the effects of kernel fusion, or perhaps a way to efficiently manage these operations to achieve similar performance gains? Any suggestions on optimizing memory usage or overlapping computations would also be greatly appreciated.

Share this problem. Way I understand it cuBLASDx aims to facilitate kernel fusion for BLAS operations. But currently it looks to only support matrix multiplication which is not enough in my case.

You can use cublasLT for some fusion cases, or directly using cutlass.

Please take a look at cuDNN’s Graph API (Graph API — NVIDIA cuDNN v9.1.0 documentation). It supports fusion prologue and epilogue fusions with convolutions and matmuls. It offers an abstraction layer on top of cublas and cutlass.

Generic Runtime Fusion Engines (Graph API — NVIDIA cuDNN v9.1.0 documentation)