Cufft2d FP16 and BF16 is slower than FP32

I run cufft2d fp16 and bf16 at GTX3080, cuda version is cuda11.1. Why fp16/bf16 is slow than fp32?

size FP32 FP16(us) BF16(us)
512*512 17(us) 27(us) 18(us)
1024*1024 59(us) 76(us) 33(us)
2048*2048 233(us) 288(us) 279(us)
4096*4096 1161(us) 1124(us) 1099(us)

the code for call fp16 cufft api :
cufftHandle plan_fp16;
int rank = 2;
int batch = 1;
size_t ws = 0;
long long size_arr[rank] = {N, N};
cufftXtMakePlanMany(plan_fp16, rank, size_arr, NULL, 0, 0,
CUDA_C_16F, NULL, 0, 0, CUDA_C_16F, batch, &ws, CUDA_C_16F);
cufftXtExec(plan_fp16, d_in_fp16, d_out_fp16, CUFFT_FORWARD);