Threads in Matrix Multiplication

I want to graph the values(time) that I get performing matrix multiplication for different square matrices like 8x8,16x16,… so on .My block size is fixed at 16. Also I have the log down the values that I can get for each matrix multiplication with varying threads from minimum no of threads to a maximum no of threads i.e., 512 threads. what are the minimum no of threads involved at the beginning and is there any command where I can make the program use fixed no of threads.

you can see at NVIDIA_CUDA_SDK example of matrix multiplication. You only have to change parameters. Note that size of matrix are multiple of blocks size.