Implementing Hadamard Operations with Tensors in CUDA C++

cilas · April 1, 2025, 12:31pm

Hello,

I am working on performance comparisons in CUDA C++ and want to implement Hadamard operations (element-wise multiplication) using tensors. I would like to know which approaches I can use to achieve this. I know that cuBLAS and cuTensor libraries optimize matrix operations for tensors, but I’m not sure if they enable Hadamard operations. When using wmma kernels, I couldn’t visualize how to perform such an operation. So… Is there a way to implement this Hadamard operation using tensors?

Here is an example of a Hadamard operation:

A = [[2  3][4  1]]
B = [[5  1][2  3]]
A ⊙ B = [[2×5  3×1] [4×2  1×3]]
A ⊙ B = [[10  3][8   3]]

Thank you in advance for your help!

Robert_Crovella · April 1, 2025, 12:48pm

For an elementwise operation like this, you’re going to be memory-bound in the general case. Ordinary CUDA kernel methods should be able to achieve nearly optimal behavior (highest throughput as dictated by your memory bandwidth). The primary issues to address are optimal use of memory (coalesced global load/store) and exposing enough parallelism to saturate your GPU. A basic CUDA tutorial will give you enough skill to write such a kernel yourself. If you want a library-based approach, a thrust::transform should work well.

CUBLAS doesn’t have support for elementwise matrix multiplication. And I don’t think a wmma method is sensible, either. Tensor cores are designed to deliver a typical matrix dot-product, not anything elementwise. I can’t imagine a method to trick either of those approaches into working elementwise. Perhaps someone will come up with a clever approach, but my question would be why bother? Since you will be memory bound, there is no way that a higher compute path (if one exists; I don’t think it does) will provide a meaningful benefit.

Curefab · April 1, 2025, 1:35pm

The Hadamard transform is more complicated than ordinary element-wise multiplication: Hadamard transform - Wikipedia

So really make sure that element-wise multiplication (it does not matter, whether it is 1D vectors or 2D matrices, as long as it is strictly element-wise) is what you actually want and need.

One can trick the Tensor Cores to do element-wise multiplies by using a lot of zeroes in the matrices to only keep the multiplications you need, but then the trick is on you: Then you could perhaps only use around 1%-5% of the Tensor Core performance productively.

Robert_Crovella · April 1, 2025, 1:41pm

OP didn’t mention Hadamard transform and the operation described by OP aligns with the matrix hadamard product.

Topic		Replies	Views
Matrix Vector Multiplication in CUDA CUDA Programming and Performance	2	1725	February 14, 2011
CUBLAS Hadamard Product SHAD CUDA Programming and Performance	2	4024	December 7, 2007
Use cuda core & tensor core at the same time CUDA Programming and Performance	6	549	September 29, 2024
How to operate irregular gemm on tensor core? CUDA Programming and Performance	10	599	August 24, 2024
Experiment to reduce time execution with tensors CUDA Programming and Performance	2	38	March 15, 2025
Tensor Convolution example using WMMA? CUDA Programming and Performance	4	2041	February 25, 2019
WMMA - What does "warp matrix operations" mean? CUDA Programming and Performance cuda , kernel	7	6945	October 18, 2022
How does 4x4 mma at tensor core level translate to 16x16 mma at warp level? CUDA Programming and Performance cuda	2	1059	November 15, 2023
Programming Tensor Cores in CUDA 9 Technical Blog	14	1177	November 28, 2022
cuBLAS convolution does not use Tensor Cores GPU-Accelerated Libraries cublas	6	2268	June 8, 2021

Implementing Hadamard Operations with Tensors in CUDA C++

Related topics