cuBLAS: very low perfromance on Jetson TK1

Hi, when I perform a matrix multiplication using the function cublasSgemm ia have bad performance also with small matrix. I have profiled the code and the problem is that the number of register used by the function is huge and the jetson can launch only two block in parallel. There is a way to reduce the number of registers used by the function? I have tried the command line --maxxrregcount=xx but doesn’t works

THX

No, there is no way to modify the register footprint of code that is already compiled – as is the case in any compiled library such as cublas. The --maxrregcount switch affects code that is being compiled, not code that you are only linking against in a library.

Having said that, if you have ideas for performance improvement of the cublas library on Jetson, you might file a bug at developer.nvidia.com

If you’re really looking to lower the power consumption, the new TX1 is a much nicer platform.