I was considering cublasGemmEx as a Tesor Core operation and based on that I was thinking that I can execute other operation with CUDA core. But it seems that is not possible to run all the time concurently.
So here my question is:
cublasGemmEx is a Tensor Core operation or CUDA core ? If it is a Tensor Core operation why I can not use all resources of CUDA core?
Also for Ampere machine I am seeing that inside of nsys the name of the kernel for double Tensor Core GEMM is cutlass kernel. How it is possible? I was thinking that it is a cuBlas kernel.