System:

CPU: Intel Core i5-4570

MSVS Community 2017 v15.9.7

Platform Toolset: Visual Studio 2017 (v141)

Build: Release x64

GPU: GeForce GT 640 (c.c. 3.0)

CUDA Compilation tools R10.1, V10.1.105

CUDA Driver Ver.: 10.1

I am using CUDA for last couple of months. Goal of my research is to develop performance optimized 2D DCT transform kernel function. Optimization targets short processing time. Since transform is used for video processing batches of data are processed. Transform can be described with mathematical equation C = A * B * AT where A and AT are predefined matrices. All matrices are of size 32 x 32.

Own kernel function was developed at first and to check potential improvement variant with CUBLAS function was developed as well. Function cublasgemmBatched()was used for this purpose. It was used twice for two multiplications from math eq. Batch size is 12960. Results were compared at the end for both variants. I expected that transform variant with CUBLAS will be faster but processing time with kernel function is almost 10x faster. How to explain this ?