I am working on performance comparisons in CUDA C++ and want to implement Hadamard operations (element-wise multiplication) using tensors. I would like to know which approaches I can use to achieve this. I know that cuBLAS and cuTensor libraries optimize matrix operations for tensors, but I’m not sure if they enable Hadamard operations. When using wmma kernels, I couldn’t visualize how to perform such an operation. So… Is there a way to implement this Hadamard operation using tensors?
Here is an example of a Hadamard operation:
A = [[2 3][4 1]]
B = [[5 1][2 3]]
A ⊙ B = [[2×5 3×1] [4×2 1×3]]
A ⊙ B = [[10 3][8 3]]
For an elementwise operation like this, you’re going to be memory-bound in the general case. Ordinary CUDA kernel methods should be able to achieve nearly optimal behavior (highest throughput as dictated by your memory bandwidth). The primary issues to address are optimal use of memory (coalesced global load/store) and exposing enough parallelism to saturate your GPU. A basic CUDA tutorial will give you enough skill to write such a kernel yourself. If you want a library-based approach, a thrust::transform should work well.
CUBLAS doesn’t have support for elementwise matrix multiplication. And I don’t think a wmma method is sensible, either. Tensor cores are designed to deliver a typical matrix dot-product, not anything elementwise. I can’t imagine a method to trick either of those approaches into working elementwise. Perhaps someone will come up with a clever approach, but my question would be why bother? Since you will be memory bound, there is no way that a higher compute path (if one exists; I don’t think it does) will provide a meaningful benefit.
So really make sure that element-wise multiplication (it does not matter, whether it is 1D vectors or 2D matrices, as long as it is strictly element-wise) is what you actually want and need.
One can trick the Tensor Cores to do element-wise multiplies by using a lot of zeroes in the matrices to only keep the multiplications you need, but then the trick is on you: Then you could perhaps only use around 1%-5% of the Tensor Core performance productively.