Ok, I posted some exhaustive benchmarks for sgemm here (scroll to bottom):
https://code.google.com/p/maxas/wiki/sgemm
I didn’t realize I was actually performing 4.8% faster than cublas at the 4094 matrix size or as much as 25% faster for smaller matrix sizes. Read the rest of the lengthy article to find out how.
CudaaduC: there are memory bound plots in these benchmarks. You can see how well the L2 is able to hide this up to about the 4096 matrix size. As for advice, if you can hold out a couple months, GM200 is just around the corner with more cores and a 384 bit bus:
http://www.loadthegame.com/2014/09/27/new-gm200-gpu-rumored-nvidia-gtx-980-ti-gtx-titan-black/
Maddy: indeed the reuse flag drops power levels, enough so that an implementation with higher reuse coverage can run at a higher sustained clock. I’ll have the normalized float code checked in shortly, though it’s pretty simple stuff. You just change the size of your memory allocation, initialize it appropriately, then pass in a different flag for the texture format. The kernel requires no changes.
I’ll see if I can play around with XMADs today. The first thing I want to understand is how all the flags work so I know what’s possible with that instruction. I strongly suspect that Nvidia has optimized the hell out of FFMA and XMAD isn’t likely to touch it efficiency wise. On a related note, I confirmed that FFMA.FTZ consumes the same amount of power as FFMA and is hence of not much value.