CPU: Intel Core i5-4570
MSVS Community 2017 v15.9.7
Platform Toolset: Visual Studio 2017 (v141)
Build: Release x64
GPU: GeForce GT 640 (c.c. 3.0)
CUDA Compilation tools R10.1, V10.1.105
CUDA Driver Ver.: 10.1
I am using CUDA for last couple of months. Goal of my research is to develop performance optimized 2D DCT transform kernel function. Optimization targets short processing time. Since transform is used for video processing batches of data are processed. Transform can be described with mathematical equation C = A * B * AT where A and AT are predefined matrices. All matrices are of size 32 x 32.
Own kernel function was developed at first and to check potential improvement variant with CUBLAS function was developed as well. Function cublasgemmBatched()was used for this purpose. It was used twice for two multiplications from math eq. Batch size is 12960. Results were compared at the end for both variants. I expected that transform variant with CUBLAS will be faster but processing time with kernel function is almost 10x faster. How to explain this ?