I also used a dummy program that creates multiple streams, some of them running GEMM on tensor cores and the rest of them running on cuda cores(non-MMA OPs). I see some overlap between the two kinds of streams but I cannot confirm whether the overlapping streams run on the same SM or not since there’s no such information from nvprof+nvvp.
Can someone from Nvidia help confirm whether this is possible? If so, how do I explicitly program the GPUs to run with more parallelism?
Nsight Systems should provide the insight you’re looking for.
If so, how do I explicitly program the GPUs to run with more parallelism?
Please don’t try to do this. The hardware will do a better job scheduling than programmatically. If streams don’t overlap it’s usually due to lack of resources.
Nsight Systems should provide the insight you’re looking for.
I was using nvprof + nvvp. I wasn’t able to find the information such as which stream runs on which SM. Also, can you please provide any pointer on the way to explicitly control which stream works on which SM’s tensor/cuda core?
The reason that we’re asking this question is because we wanted to confirm the ability to run parallel computing between cuda and tensor cores for future computing resource planning purposes. We didn’t intend to explicitly program things in production.