It is my understanding that fft is mathematically equivalent to dft, and the only purpose of using fft is to reduce convolution complexity from O(n^2) to O(n log n) and storage complexity from O(n^2) to O(n). The tricky part is, fft uses hierarchical data structure, which is not ideal for exploiting parallelism. In my experience, direct matrix-vector multiplication may be much faster than utilizing hierarchical data structure on GPU. Then how does Nvidia implement the fft? What is the difference between cufft and dft?

The CUFFT developers decide what information is shareable and what is not. Since CUFFT is closed source, there are various implementation details that are not documented or shared. You might get some ideas about how it might be efficiently done in general by studying a high-quality open source FFT project such as fftw.

DFT is a function of an array, usually one representing equally-spaced samples of some function/signal. The FFT is an algorithm that computes such a function. I can’t answer your question of how Nvidia implements it, but the difference between cufft and dft are that the former is an implementation (cufft) of an algorithm (fft) that computes the latter (dft)

Thanks for the conceptual explanation. The reason why I am curious about the implementation is that different strategies lead to different memory usage. One of the benefits from using tree structure (traditional way of fft implementation) is that it saves memory and is sequentially cheaper. When it comes to parallel, old way of algorithmic analysis does not help much, and GPU occupancy is more important for throughput. One of our colleagues found large memory usage from cufft. That is how I came up with the guess that tree may not be adopted in cufft for the sake of SM utilization. One of the dilemmas from using a library is, you hardly know when it does not work…