I am looking for a fast kernel that does a pointwise matrix multiplication. Is there a way to pull this off using the cublas libraries.
Specifically, I want to perform frequency domain filtering, using one 32K fft result applied to multiple 32K filters. Does anyone know of existing code that I can use instead of pulling apart the MatrixMultiply SDK kernel to write my own?
Thanks for your help,