I am interested in knowing how cublasSgemm routine is mapped on GPU while calculating matrix multiplication (C = A * B).
Basically i want to know :
- ) How these routines are implemented on GPU ?
- ) Does m and n values mapped on one compute unit (SM)? If No, then what can be maximum value for m and n ?
- ) Do we have control of threads/Blocks ?