NVIDIA Developer Forums

How threads/blocks are mapped on GPU while calling cublasSgemm routines?

Accelerated Computing GPU-Accelerated Libraries

Gopal_HC February 13, 2013, 10:40am 1

I am interested in knowing how cublasSgemm routine is mapped on GPU while calculating matrix multiplication (C = A * B).

Basically i want to know :

) How these routines are implemented on GPU ?
) Does m and n values mapped on one compute unit (SM)? If No, then what can be maximum value for m and n ?
) Do we have control of threads/Blocks ?

Topic		Replies	Views	Activity
CUBLAS grids and threads division GPU-Accelerated Libraries	7	3933	June 18, 2018
Using gcgemm from CuBLAS CUDA Programming and Performance	1	767	March 23, 2020
mis tips of cubulas user guide GPU-Accelerated Libraries	2	483	August 13, 2019
Question of using cublassgemm() for matrix mulitiplication CUDA Programming and Performance	3	1030	January 28, 2015
cuBLAS launch 5 times threads blocks more than expected GPU-Accelerated Libraries cublas	4	500	April 11, 2024
multi-gpu cublas CUDA Programming and Performance	11	7436	May 27, 2013
What's the best matrix size for cublasSgemm performance ? GPU-Accelerated Libraries	1	1697	February 16, 2017
The larger block the better? CUDA Programming and Performance	8	474	March 25, 2024
lower limit of cuBLASSgemm GPU-Accelerated Libraries	2	559	July 15, 2016
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7614	September 1, 2010