Parallel execution on tensor cores and cuda cores on the same SM

Also, can you please provide any pointer on the way to explicitly control which stream works on which SM’s tensor/cuda core

This not possible. The closest thing to controlling what is launched on particular number of SMs is cublasSetSmCountTarget