Kernel-level cuBLAS

Hi, All

Is there any kernel-level cuBLAS API, that we can use at the warp or block level?

I want to run many matrix-matrix multiplications inside a GPU kernel (__global__) function, therefore, I need the API to invoke the cuBLAS from a thread/warp/block.

Thanks

Hi Daniel_Wong,

Quick answer is no, but we are working on a new library device side cuBLAS library that should be available through Math Library Early Access Program, later this year. Sign up for updates here CUDA Math Library Early Access Program | NVIDIA Developer

Just curious, what size matrices are you interested in?

In the meantime, you might want to check out CUTLASS to see if it can satisfy you needs. GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines

1 Like

The GEMM size for each warp is around 10000x128x128 (MxNxK).

Thanks!