Question of using cublassgemm() for matrix mulitiplication

I tried to use cublassgemm() to do matrix multiplication using sample code provided in CUDA SDK package. It specifies the number of threads per block and number of block per grid by the following code:

// setup execution parameters
dim3 threads(block_size, block_size);  //block_size=32
dim3 grid(matrix_size.uiWC / threads.x, matrix_size.uiHC / threads.y); 

However, when I changed the size of both threads and grids(such as there is only one thread per block and only one block in the grid), its run time didn’t really change a lot(I tried large size matrix). I would like to know could I really can control how many blocks and threads I can use when using cublassgemm()?

CUBLAS API calls make their own decisions as to which kernel(s) to launch and how to configure those kernels. This is based on heuristics and should be close to optimal, modulo specific bugs that could exist. There is nothing for the user to configure in terms of runtime configuration: CUBLAS functions are a convenient black box.

Thank you very much for your reply. The reason I ask this question is that CUDA sample do scheduling the number of threads and blocks and I thought it is a right way to it.

Which CUDA sample are you looking at? When you write your own CUDA kernels for matrix computation, you will have to decide on a launch configuration of course. But if you use a pre-packaged library such as CUBLAS, CUFFT, CUDNN, NPP, etc that work has been done for you.