Hi, when I perform a matrix multiplication using the function cublasSgemm ia have bad performance also with small matrix. I have profiled the code and the problem is that the number of register used by the function is huge and the jetson can launch only two block in parallel. There is a way to reduce the number of registers used by the function? I have tried the command line --maxxrregcount=xx but doesn’t works
No, there is no way to modify the register footprint of code that is already compiled – as is the case in any compiled library such as cublas. The --maxrregcount switch affects code that is being compiled, not code that you are only linking against in a library.
Having said that, if you have ideas for performance improvement of the cublas library on Jetson, you might file a bug at developer.nvidia.com