I trained my model with fp16 intending to speed up by using Tensor Core.
I profiled the training to see the utilization of Tensor Core, here’s the result:
I’m wondering why dose kernels like volta_fp16_sgemm_fp16_128x64_tn
cannot use Tensor Core, yet volta_fp16_s884gemm_fp16_128x256_ldg8_f2f_tn
can. And what does s884
stand for?