I am using CUDA code and specifically CUBLAS functions. My question is, how do I set the block grid and threads divsion when using a CUBLAS functions (i.e: cublasSgemm(…))? Is it internally optimized, and if so which SM is it best suited for? Is it generally optimized? Are there matrix sizes that work best for it (aside from the 32 factor for the size)?
Or simply put, is there a list somewhere of how to make those functions work best?
It is quite difficult to find an answer in the toolkit documentation.
Thanks for any answer!