Multiple Streams on Tensor Cores


I am trying to run multiple GEMMs on Tensor Cores on different streams concurrently. However, it seems from the nvprof timeline that cuBlas is explicitly serializing the GEMMs by recording an event on the first stream and then polling for it before launching the second GEMM. Is my understanding of this correct? Is there a way to extract higher throughput from Tensor Cores using multiple streams etc.?

I am using a V100 with CUDA 10.0

GPUs are designed as throughput machines. As long as each kernel is able to utilize the GPU fully, there is no point in running kernels concurrently: throughput will not increase. You may observe minimal overlap between kernels as one is winding down while the other is starting up.

The GPU is able to run kernels from different non-default streams concurrently if each kernel only partially utilizes the GPU. However, this use case is rare in practice and should be avoided: it is best practice to provide enough parallelism that each kernel utilizes the GPU fully.

You can find numerous questions in these forums asked by people who unsuccessfully tried to create a concurrent kernel scenario.

Sure, I understand that. However, my question is specifically for Tensor Cores on V100s, where the cuBlas seems to be explicitly serializing by recording and polling for events. So I just want to know if there are hardware limitations which do not allow concurrent kernels on Tensor Cores?

Tensor cores are execution resources available to kernels just like any other execution resources, so my previous comments still apply: If a kernel is able to full utilize the hardware, running a second kernel concurrently won’t improve throughput, so the scheduler doesn’t do that.

I cannot speak to “serializing by recording and polling for events”; it’s not something I have looked at (mybe Robert Crovella has). Have you checked that this is actually a difference to GEMM calls that don’t use the tensor cores?

Yes, I verified this by recording the nvprof trace. For the case of normal (FP32) GEMMs, The cudaLaunchKernel api calls execute on after the other. However, in the case of mixed precision GEMMs using Tensor cores, I see a cudaEventRecord after the first cudaLaunchKernel and then a couple of cudaEventQuery(s). After the first kernel finishes executing the second cudaLaunchKernel starts.

My microbenchmark to replicate this setting :