cuBLAS vs CUDA kernels Performance

Hi there,
I was trying to test the performance of the tensor cores on the Nvidia Jetson machine, which can be accessed using cuBLAS. I made three programs to perform matrix multiplication: the first was a cuBLAS program which did the matrix multiplication using “cublasSgemm”, the second was a copy of the first program but with the Tensor cores enabled, and the third was matrix multiplication done directly in a CUDA kernel without cuBLAS.

Testing with a variety of different matrix sizes, the CUDA kernel version was always significantly faster than the cuBLAS versions, ranging from 5-100x faster execution times. The CUDA kernel version’s performance seemed to vary linearly with the array sizes (ie, a 10x bigger array led to 10x longer execution time). The cuBLAS versions always seemed to run around 0.5 seconds, but would increase slightly with larger arrays.

Has anybody else experienced this kind of performance gap between cuBLAS and CUDA kernels meant to accomplish the same calculation? I had assumed that the cuBLAS functions would be as optimized as possible. Could the consistently long execution times of the cuBLAS versions be caused by a lazy initialization of the cuBLAS libraries, which wouldn’t happen in the CUDA kernel version?


How exactly are you timing these executions?

Can you provide reproducer code?