I trained my model with fp16 intending to speed up by using Tensor Core.
I profiled the training to see the utilization of Tensor Core, here’s the result:
I’m wondering why dose kernels like
volta_fp16_sgemm_fp16_128x64_tn cannot use Tensor Core, yet
volta_fp16_s884gemm_fp16_128x256_ldg8_f2f_tn can. And what does
s884 stand for?