I run cufft2d fp16 and bf16 at GTX3080, cuda version is cuda11.1. Why fp16/bf16 is slow than fp32?
size | FP32 | FP16(us) | BF16(us) |
---|---|---|---|
512*512 | 17(us) | 27(us) | 18(us) |
1024*1024 | 59(us) | 76(us) | 33(us) |
2048*2048 | 233(us) | 288(us) | 279(us) |
4096*4096 | 1161(us) | 1124(us) | 1099(us) |
the code for call fp16 cufft api :
cufftHandle plan_fp16;
cufftCreate(&plan_fp16);
int rank = 2;
int batch = 1;
size_t ws = 0;
long long size_arr[rank] = {N, N};
cufftXtMakePlanMany(plan_fp16, rank, size_arr, NULL, 0, 0,
CUDA_C_16F, NULL, 0, 0, CUDA_C_16F, batch, &ws, CUDA_C_16F);
cufftXtExec(plan_fp16, d_in_fp16, d_out_fp16, CUFFT_FORWARD);