Hi, All
Is there any kernel-level cuBLAS API, that we can use at the warp or block level?
I want to run many matrix-matrix multiplications inside a GPU kernel (__global__) function, therefore, I need the API to invoke the cuBLAS from a thread/warp/block.
Thanks
Hi Daniel_Wong,
Quick answer is no, but we are working on a new library device side cuBLAS library that should be available through Math Library Early Access Program, later this year. Sign up for updates here CUDA Math Library Early Access Program | NVIDIA Developer
Just curious, what size matrices are you interested in?
In the meantime, you might want to check out CUTLASS to see if it can satisfy you needs. GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines
1 Like
The GEMM size for each warp is around 10000x128x128 (MxNxK).
Thanks!