How threads/blocks are mapped on GPU while calling cublasSgemm routines?

I am interested in knowing how cublasSgemm routine is mapped on GPU while calculating matrix multiplication (C = A * B).

Basically i want to know :

  1. ) How these routines are implemented on GPU ?
  2. ) Does m and n values mapped on one compute unit (SM)? If No, then what can be maximum value for m and n ?
  3. ) Do we have control of threads/Blocks ?