I ve a CUDA program that (thankfully) does what it is supposed to. Now, I am just looking to optimise the code and see if there are something I can make faster/ better. An important part of the program uses a cublas library function, cublassgemm, and I was just wondering if there is any way by which I can check the occupancy of the cublas function (Whether it uses all of the processors available?). Compiling to -cubin only gives me the shared memory and register usage of the other kernels and not the library kernels. I need this information to calculate the kernel usage with the CUDA occupancy calculator. Is there some way I can check the memory usage and thus the occupancy of the cuBLAS functions?
Or is it assumed that the CUBLAS functions are maximally optimised already and therefore will be achieving maximum occupancy anyway…
Would appreciate any response.