Why is CUFFT able to schedule so many warps, but the same is not possible with CUBLAS? Perhaps this means CUBLAS has to suffer more data latency, isn’t it?
Hi,
BLAS and FFT are two types of functions with very different profiles: generally speaking, BLAS level 3 and xGEMM are mainly compute bound, thanks to a careful reuse of data via tiling techniques; FFT however are mainly memory bound, which is a totally different matter.
Given that both BLAS and FFT libraries are crucially important for many applications and that their performance are scrutinised when evaluating the potential of a platform, you can be sure that their developers are spending hours of efforts to get the last possible percent of efficiency the hardware can offer.
So why on earth, with so different profiles, would you expect to see the same type of implementation choices for both?
GEMM is compute bound: just use as many threads as necessary to get the data and keep them busy computing
FFT is memory bound: spawn as many threads as possible to hide the data access latencies
This is sort of a caricature, but the idea is there. Having a lot of threads/warps scheduled is not a goal, it is just one of the many possible strategies to achieve your ultimate goal which is performance.
Thank you, I understand that they are very different, and cannot be compared. However, I have difficulties trying to completely picture the GPU under compute bound problems, and memory bound applications. I see that for DGEMM, the occupancy is 33% only, since it is compute bound, I thought more threads would be allocated since the number of instructions per thread is higher compared to FFTs. I might be misunderstanding something, if possible please explain or point me to some document?