I am releasing my assembler for Kepler GPU now.
My assembler uses frontend of Maxas to parse assembly, I modify the code in order to compile assembly on Kepler.
Any one want to optimize application on Kepler for extreme performance can try my assembler.
I also give an example of how to optimize SGEMM on Kepler:
I optimized the SGMM performance at 3104Gflop/s(88%), which is 15% higher than cuBLAS.
I also provide an automatic method to crack several generations of GPUs(I’ve tested it on fermi, Maxwell GPU).