cufftDx performance not achieve the cufft performance


I’m trying to improve performance using cufftDx library instead of cufft.
I created matrix of 1024X1024 complex numbers, and made convolution of each row with complex vector (using FFT, vector multiplication and IFFT).

Using the cufft library, I used FFT and IFFT planned by cufftPlanMany, and vector multiplication kernel.

Using the cufftDx, I implement all the convolution in one kernel so I was expected to get better performance because of the efficient L1 cache usage.
I created the convolution kernel so each block act on few rows of the matrix, and perform the convolution on this rows. Thus, every SM execute the convolution on amount of data that is smaller then the L1 cache. This way, the L1-cache usage is efficient and the execution time of the convolution suppose to decrease.

It didn’t worked and I got better results with the cufft, Any Ideas?

You would need to add code and hardware specs to get performance evaluation.
Please note that cuFFTDx is still in EA, therefore maximum performance isn’t guaranteed.