Is there any kernel-level cuBLAS API, that we can use at the warp or block level?

I want to run many matrix-matrix multiplications inside a GPU kernel (__global__) function, therefore, I need the API to invoke the cuBLAS from a thread/warp/block.


Quick answer is no, but we are working on a new library device side cuBLAS library that should be available through Math Library Early Access Program, later this year. Sign up for updates here

Just curious, what size matrices are you interested in?

In the meantime, you might want to check out CUTLASS to see if it can satisfy you needs. GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines

The GEMM size for each warp is around 10000x128x128 (MxNxK).