Why performance is worse with CUBLAS- than with kernel-function

System:
CPU: Intel Core i5-4570
MSVS Community 2017 v15.9.7
Platform Toolset: Visual Studio 2017 (v141)
Build: Release x64
GPU: GeForce GT 640 (c.c. 3.0)
CUDA Compilation tools R10.1, V10.1.105
CUDA Driver Ver.: 10.1

I am using CUDA for last couple of months. Goal of my research is to develop performance optimized 2D DCT transform kernel function. Optimization targets short processing time. Since transform is used for video processing batches of data are processed. Transform can be described with mathematical equation C = A * B * AT where A and AT are predefined matrices. All matrices are of size 32 x 32.

Own kernel function was developed at first and to check potential improvement variant with CUBLAS function was developed as well. Function cublasgemmBatched()was used for this purpose. It was used twice for two multiplications from math eq. Batch size is 12960. Results were compared at the end for both variants. I expected that transform variant with CUBLAS will be faster but processing time with kernel function is almost 10x faster. How to explain this ?

Any idea where to search for an answer ?

Should I try with strided batched matrix multiplication cublasgemmStridedBatched() since I operate with square matrices only ? Or there is another library which should be used to outperform the kernel function ?

I know I ask to many questions :) but any suggestion is welcomed.

There was problem that I didn’t have warmup call. Its absence was of significant importance for CUBLAS variant since after warmup was added for both, CUBLAS and own kernel, processing time for CUBLAS variant was 1,38x shorter.

In next iteration I moved to cublasgemmStridedBatched(). There CUBLAS showed even better performance, it was 1,73x faster.

Using NVIDIA visual profiler I identified that low shared efficiency (50%) in my own kernel had improvement potential. After rewriting my kernel for vectorized memory access with float2 data type shared efficiency increased to about 98% and my kernel was 1,3 faster than CUBLAS.

Now, I don’t know why CUBLAS exhibits such low performance. In the profiler I see that block size for CUBLAS kernel is relatively small (8x8) what decreases number of possible active warps per SM and decreases the occupancy (30%). Global store efficiency is also low, 25%. Is there an issue that CUBLAS GEMM is optimized for specific matrix sizes and my application with 32x32 matrices is not in that group ? Probably I will move to Tesla K40 GPU and see does GPU architecture makes the difference.

Generally speaking, GEMM maps to dozens of different kernels, optimized for different GPU architectures, matrix sizes, transpose modes, matrix aspect ratios etc. For a given GEMM invocation a heuristic picks the most appropriate kernel(s). The heuristic may not always pick the optimal kernel, or none of the available kernels may be the perfect fit for a particular call to GEMM.

As I recall, batched GEMMs in particular were introduced primarily to deal with very small matrices, as some applications need to handle tons of matrices of size 3x3, 4x4, or thereabouts. Matrices of size 32x32 may be close to the upper limit of what batched GEMMs were targetted to handle; check the documentation.

With regard to the sub-optimal performance observed, consider filing an enhancement request with NVIDIA. You can file one by using the bug reporting form and prefixing the synopsis with “RFE:” to mark it as an enhancement request.

Realistically, given the age of the Kepler architecture, it is unlikely that improvements will be made for compute capability 3.x, but equivalent issues may affect newer architectures. The primary targets for performance improvements in libraries are the latest GPU architectures (at this time: Turing and Volta) although some amount of back-porting of such improvements to older architectures may occur.