Pointwise Matrix Multiply (filtering in frequency domain)

I am looking for a fast kernel that does a pointwise matrix multiplication. Is there a way to pull this off using the cublas libraries.

Specifically, I want to perform frequency domain filtering, using one 32K fft result applied to multiple 32K filters. Does anyone know of existing code that I can use instead of pulling apart the MatrixMultiply SDK kernel to write my own?

Thanks for your help,