Parallel execution on tensor cores and cuda cores on the same SM

On Xavier AGX, is there a way to check/confirm that tensor cores and cuda cores can be executed in parallel on the same SM?

I have done quite some research online, including reading this post: Run Parallel Tensor Cores GEMM and Cuda GEMM - #8 by mnicely

I also used a dummy program that creates multiple streams, some of them running GEMM on tensor cores and the rest of them running on cuda cores(non-MMA OPs). I see some overlap between the two kinds of streams but I cannot confirm whether the overlapping streams run on the same SM or not since there’s no such information from nvprof+nvvp.

Can someone from Nvidia help confirm whether this is possible? If so, how do I explicitly program the GPUs to run with more parallelism?

Nsight Systems should provide the insight you’re looking for.

If so, how do I explicitly program the GPUs to run with more parallelism?

Please don’t try to do this. The hardware will do a better job scheduling than programmatically. If streams don’t overlap it’s usually due to lack of resources.

Nsight Systems should provide the insight you’re looking for.

I was using nvprof + nvvp. I wasn’t able to find the information such as which stream runs on which SM. Also, can you please provide any pointer on the way to explicitly control which stream works on which SM’s tensor/cuda core?

The reason that we’re asking this question is because we wanted to confirm the ability to run parallel computing between cuda and tensor cores for future computing resource planning purposes. We didn’t intend to explicitly program things in production.

Also, can you please provide any pointer on the way to explicitly control which stream works on which SM’s tensor/cuda core

This not possible. The closest thing to controlling what is launched on particular number of SMs is cublasSetSmCountTarget