I want to use the hybrid CPU-GPU feature of cublasXt to reduce the memory usage of matrix multiplication, and it works for single-batch matmul. I wonder can cublasXt handle the multi-batch matmul as well? If not, is there any other library has feature like that to save the device-memory usage of batched matmul?
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Optimizing Sequential cuBLAS Calls for Matrix Operations—Alternatives to Kernel Fusion? | 3 | 462 | April 29, 2024 | |
cublastLt optimize memory usage for triangular matrix | 0 | 361 | December 15, 2023 | |
Having multiple relatively small problems | 5 | 782 | April 7, 2022 | |
Stripmining matmul for bandwidth optimization host-to-gpu for LLM computation | 2 | 412 | February 26, 2024 | |
Out of core sparse*dense matrix multiplication | 0 | 847 | June 24, 2015 | |
cuBLAS matrix multiplication of a batch | 0 | 393 | May 2, 2020 | |
cublasSgemm with cudaMallocPitch? | 0 | 4824 | September 7, 2011 | |
SgemmBatched to multiply batched matrix and non batched matrix | 1 | 1073 | April 16, 2015 | |
cuBLAS-XT with different cards | 3 | 1391 | October 25, 2015 | |
Batch Matrix Multiplication using CuBLAS | 1 | 1061 | February 19, 2021 |