Efficient DFT implementation DFT, FFT

I am interested in cuda DFT implementation, such as 1200-point DFT and 12-point DFT which are not power of 2. While DFT is covered by CUFFT, the performance is not entirely satisfactory to me. For example, the time cost of 1200-point DFT is more than 3 times of 2048-point FFT. By profiling, I noticed that 1200-point CUFFT executes 5 kernel functions as radix2, radix4, radix5, radix5, radix6. Can someone provide leads for some fast cuda DFT implementations? It will also be appreciated if you believe the performance of CUFFT for 1200-point is reasonable and give some explanation. Thanks