cublasGemmEx is a Tensor Core operation or CUDA core?


I was considering cublasGemmEx as a Tesor Core operation and based on that I was thinking that I can execute other operation with CUDA core. But it seems that is not possible to run all the time concurently.

So here my question is:
cublasGemmEx is a Tensor Core operation or CUDA core ? If it is a Tensor Core operation why I can not use all resources of CUDA core?

Also for Ampere machine I am seeing that inside of nsys the name of the kernel for double Tensor Core GEMM is cutlass kernel. How it is possible? I was thinking that it is a cuBlas kernel.

Best regards,

First, there is not simply Tensor core and CUDA cores. There are other resources that are shared among the on-chip resources, when kernels are called. Just because your primary compute is done on a Tensor core doesn’t mean there’s enough resources to prep data for CUDA core usage. Tensor cores are used to accelerate GEMMs.

What you’re seeing in Nsys is a cuBLAS kernel that has been build with CUTLASS. It does not affect your program. It simply means someone determined the CUTLASS implementation was optimal and then chose it to be distributed with cuBLAS for that release.

1 Like

Thanks for explanation. Do we have any documents that clearly represent the shared resources?

Not at that level. Just know this, if resources were available to allow parallel work. The hardware scheduler would handle it for you.