Run Parallel Tensor Cores GEMM and Cuda GEMM

Hi,

Is it possible to run Tensor Cores GEMM and Cuda GEMM in parallel? e.g. in two diferent stream run parallely Tensor Cores GEMM and Cuda GEMM.

Do we have a shared hardware here or Cuda core is seprate from Tensor Cores?

You can certainly try and if resources are available, two kernels could execute in parallel.
It is true Tensor Cores are separate hardware from the ALUs, used for CUDA GEMMS.
There are other resources to consider, mainly registers and shared memory. I expect one of these to be your limiting factor.

1 Like

Thanks for reply. Do we have any control over it to limit the resource that used by Tensor Cores? I mean forcing them to do not try to be occupied at maximum (maybe changing some vriable inside of Cuda).

I’m not following… Tensor Cores are a resource by themselves.

Data is loaded to Tensors Cores from registers in threads.

I tryed to disable TensorCore to see if is it possible or not.

Based on bellow documentation i write this line to change the type of computation to normal mode.
https://docs.nvidia.com/cuda/cublas/index.html#tensorop-restrictions

> cublasSetMathMode(handle, CUBLAS_DEFAULT_MATH);

But after checking by nsys I am seeing

cutlass::kernel<cutlass_80_tensorop_d884gemm_64x32_16x4_nn_align1>

which means that tensor core is used.

Depending on the data type, the default math mode may be to use Tensor Cores. What precision/compute type are you using?

I am using DGEMM so the datatype is double for all inputs and output. I hope to disable tensor core for this stream and perform my normal cuda core operation. I am using A100 GPU.

Did some more digging. FP64 Tensor core can’t be disabled with cuBLAS. They are IEEE complaint, so there’s really no need.
For your scenario of Tensor core and CUDA core in parallel, they can’t be issued in parallel. They do share resources, I wasn’t aware of that early. Sorry for leading you down a rabbit hole.

Thanks for your information, in this picture from profiling of magma_dpotrf we can see that maybe two different type of kernel for GEMM are in use (maybe this is a problem of warper or nsys). But actualy for large dimention it is happening.