I’ve been trying to optimize a CUDA kernel where I perform a series of dot products between a pivot row and multiple other rows in a matrix. To improve performance, I attempted to use cuBLAS and leverage Tensor Cores, but the time taken actually increased compared to my custom kernel. I’m not sure what I’m missing, so I’d really appreciate some insights.
Here’s a quick rundown of what my kernel does:
I compute dot products between a single pivot row and multiple rows of a matrix.
The pivot row is reused multiple times, so I load it into shared memory to save on global memory accesses.
Each dot product result is then used to update the corresponding matrix row.
The kernel is parallelized over rows and columns, and I use shared memory to accumulate partial results for each thread block.
When I replaced the dot product computation with cublasDdot and tried batching the updates using cublasGemmEx, the performance dropped. My suspicion is that the problem lies in the granularity or the overhead of using Tensor Cores for this type of operation.
My Questions:
Are Tensor Cores suitable for workloads involving repeated dot products between a single row and multiple others?
Could the overhead of preparing data for cuBLAS (e.g., arranging the pivot row for repeated use) be the reason for the slowdown?
Should I stick with my custom kernel for small-scale operations like this, or is there a better way to make use of Tensor Cores for this problem?
Are there any specific optimizations or tricks for aligning a row-based computation like this to the strengths of Tensor Cores?
Thanks in advance for your help! I’d love to hear your suggestions and learn what I might be missing
The Tensor Cores perform a small matrix-matrix multiplication, so from the 3 involved dimensions, they reuse two of them and calculate dot products of the third (common) one.
If you just calculate the dot product with a single pivot row, you are reusing the pivot row (for several ‘multiple rows’), but you are not reusing the multiple rows (for several pivot rows).
If your algorithm cannot productively use a multiplication with multiple pivot rows at the same time, there potentially are other ways to reuse that dimension:
If you are operating on complex numbers: complex-complex multiplication reuses each component twice.
You can switch from floating-point to fixed-point calculations with INT8 accuracy and divide your numbers into two, three or four blocks of INT8 numbers. This also lends to reusing each part several times.
Apart from that, even if you give up some speed of the tensor cores, they still could be faster than normal computations.
More likely your cuBLAS setup is struggling from inefficient memory accesses than from inefficient computations.
the profiling data tells me on changing the kernel computations to fused computations , other than that, the GPU is not under utilized , there are less bank conflicts
Even if the GPU is not under-utilized, it does not mean that all (or as many) operations are necessary.
You can compare both kernels, yours and cuBlas. Do they have the same memory transfers from/to global memory, which number of compute operations do they need? Do they use Tensor Cores or normal FP32 units?