I’m trying to optimize a program where sgemm and strsv calls are the bottleneck, by fiddling with the block sizes. I’m particulary interested in:
- comparison between cublas 1.1 and cublas 2.0 versions
- performance for different (also non-quadratic) matrix sizes
- comparison of different devices
I’d appreciate if you could tell me where to find some substantiated benchmarks or under which conditions sgemm performs optimally.