I am a PhD student researching in the area of parallel programming. In my next research paper, I aim to present some high-performance (OpenCL) implementations for the Basic Linear Algebra Subroutines (BLAS) – especially for the matrix multiplication routine GEMM – on matrix sizes as used in the area of deep learning; my targeted hardware is NVIDIA Tesla K20 GPU. To strengthen my evaluation, I want to compare to the fastest state-of-the-art implementation for BLAS that targets NVIDIA Tesla GPU.
My question is: Which is the currently fastest BLAS implementation for NVIDIA Tesla GPU on matrix sizes as used in deep learning – the cuBLAS library?
Many thanks in advance.