Apparently someone has built a significantly (4x in some cases) faster gemm implementation than cuBLAS for CUDA assuming Maxwell and above processors.
Yes, Scott Gray. He has posted quite a bit here on these forums, mainly in the CUDA performance area.
If past history is an indication, NVIDIA’s CUBLAS team will already be already aware of this, and if the source code is available under a BSD license may even incorporate this directly into CUBLAS (note the list of BSD licenses for various codes in the CUBLAS manual).
BLAS, and GEMM in particular, is notorious for researchers being ahead of vendor libraries with regard to specific variants of the functionality (e.g specific sizes, matrix aspect ratios, matrix element types, transpose modes, architecture generations). This has a long tradition in the field going back at least to the times when Kazushige Goto bested the BLAS libraries shipping with the DEC Alpha (around 1990, I think, but my memory is hazy).