CUBLAS and CUFFT profile - vast differences in warps launched

I am profiling CUBLAS DGEMM and CUFFT (running FFTs in batches) by running on 2 GPUs, and I see this behavior:

FFT - 2gpus, data size = 8192
[16384 1 1] [32 8 1]
[32768 1 1] [8 8 2]
[16384 1 1] [32 8 1]
[32768 1 1] [8 8 2]

warps and threads launched
10,000X4X2 , 30,000X4X2 (4 batches and 2 gpus)
Instruction cnt - 32945420

CUBLAS DGEMM - 2gpus, data size = 8192
[64 128 1] [64 4 1] (each GPU)

warps and threads launched
4680X2 149760X2 (2 gpus)
Instruction cnt - 1980502000

Why is CUFFT able to schedule so many warps, but the same is not possible with CUBLAS? Perhaps this means CUBLAS has to suffer more data latency, isn’t it?

Hi,
BLAS and FFT are two types of functions with very different profiles: generally speaking, BLAS level 3 and xGEMM are mainly compute bound, thanks to a careful reuse of data via tiling techniques; FFT however are mainly memory bound, which is a totally different matter.
Given that both BLAS and FFT libraries are crucially important for many applications and that their performance are scrutinised when evaluating the potential of a platform, you can be sure that their developers are spending hours of efforts to get the last possible percent of efficiency the hardware can offer.
So why on earth, with so different profiles, would you expect to see the same type of implementation choices for both?
GEMM is compute bound: just use as many threads as necessary to get the data and keep them busy computing
FFT is memory bound: spawn as many threads as possible to hide the data access latencies
This is sort of a caricature, but the idea is there. Having a lot of threads/warps scheduled is not a goal, it is just one of the many possible strategies to achieve your ultimate goal which is performance.

Thank you, I understand that they are very different, and cannot be compared. However, I have difficulties trying to completely picture the GPU under compute bound problems, and memory bound applications. I see that for DGEMM, the occupancy is 33% only, since it is compute bound, I thought more threads would be allocated since the number of instructions per thread is higher compared to FFTs. I might be misunderstanding something, if possible please explain or point me to some document?

A good reference in that regard is this presentation.