about using batched multiplications

a.alexeev3 · April 19, 2019, 1:19pm

Hello.
I have GPU card - NVIDIA Quadro K4000M 4 GB GDDR5. I’d like to make parallel batch matrix multiplication. I know cublasSgemmStridedBatched() method, but everywhere it is discussed said about small matrices. Tell me please, given my GPU card, how it is possible to properly calculate optimal batch size and matrices size? Now I have for example batch=16, size of one of the matrix A is 4x512, the second matrix B is the 512x512 size. The result matrix C is 16 matrices of 4x512 size. Maybe it is the other more proper way to do such kind of calculations. I tried streamed matrix multiplication before, but it hasn’t give me any gain in speed.
I have been working on my own neural network fraimwork in C++, maybe somebody would be interested in developing it with me. The main aim is research and publishing papers of Scopus level in objects detection field. The whole calculations are done through eigen matrices at the moment.
Alexey.

Topic		Replies	Views
about batched computation GPU-Accelerated Libraries	0	383	April 19, 2019
cuBLAS GemmStridedBatched: Algorithm selection CUDA Programming and Performance	0	646	May 25, 2020
Some Guidance on optimal approach to batch Matrix Multiply GPU-Accelerated Libraries cuda	0	439	August 11, 2020
CUBLAS matrix multiplication matrix size limited by GPU memory size CUDA Programming and Performance	8	3502	August 2, 2010
matrix multiplication with large dimensions CUDA Programming and Performance	7	1587	April 9, 2011
multiple matrix-matrix multiplications CUDA Programming and Performance	4	1310	May 21, 2014
Hundreds of parallel matrix-vector multiplications with cuBLAS GPU-Accelerated Libraries	8	2291	April 8, 2021
Maximum matrix size for matrix multiplication operation on GeForce GTX 960M CUDA Programming and Performance	12	3699	November 28, 2018
cublasZgemmBatched low performance 2x2 matrices; how to increase performance? GPU-Accelerated Libraries	9	1301	February 20, 2015
Large matrix multiplication for neural network purpose CUDA Programming and Performance	1	731	October 6, 2016

about using batched multiplications

Related topics