Multiple Streams on Tensor Cores

t-tachug · February 14, 2019, 10:21am

Hi,

I am trying to run multiple GEMMs on Tensor Cores on different streams concurrently. However, it seems from the nvprof timeline that cuBlas is explicitly serializing the GEMMs by recording an event on the first stream and then polling for it before launching the second GEMM. Is my understanding of this correct? Is there a way to extract higher throughput from Tensor Cores using multiple streams etc.?

I am using a V100 with CUDA 10.0

njuffa · February 14, 2019, 11:10am

GPUs are designed as throughput machines. As long as each kernel is able to utilize the GPU fully, there is no point in running kernels concurrently: throughput will not increase. You may observe minimal overlap between kernels as one is winding down while the other is starting up.

The GPU is able to run kernels from different non-default streams concurrently if each kernel only partially utilizes the GPU. However, this use case is rare in practice and should be avoided: it is best practice to provide enough parallelism that each kernel utilizes the GPU fully.

You can find numerous questions in these forums asked by people who unsuccessfully tried to create a concurrent kernel scenario.

t-tachug · February 14, 2019, 11:22am

Sure, I understand that. However, my question is specifically for Tensor Cores on V100s, where the cuBlas seems to be explicitly serializing by recording and polling for events. So I just want to know if there are hardware limitations which do not allow concurrent kernels on Tensor Cores?

njuffa · February 14, 2019, 11:36am

Tensor cores are execution resources available to kernels just like any other execution resources, so my previous comments still apply: If a kernel is able to full utilize the hardware, running a second kernel concurrently won’t improve throughput, so the scheduler doesn’t do that.

I cannot speak to “serializing by recording and polling for events”; it’s not something I have looked at (mybe Robert Crovella has). Have you checked that this is actually a difference to GEMM calls that don’t use the tensor cores?

t-tachug · February 14, 2019, 11:46am

Yes, I verified this by recording the nvprof trace. For the case of normal (FP32) GEMMs, The cudaLaunchKernel api calls execute on after the other. However, in the case of mixed precision GEMMs using Tensor cores, I see a cudaEventRecord after the first cudaLaunchKernel and then a couple of cudaEventQuery(s). After the first kernel finishes executing the second cudaLaunchKernel starts.

My microbenchmark to replicate this setting : https://gist.github.com/chughtapan/0cf5f50ccf5ca6565c30eb88f38ec26b

Topic		Replies	Views
Run Parallel Tensor Cores GEMM and Cuda GEMM GPU-Accelerated Libraries cuda , cublas	9	2460	August 14, 2022
Parallel execution on tensor cores and cuda cores on the same SM Jetson AGX Xavier tensorrt	4	1208	October 18, 2021
Parallel execution of GEMM with other Operations GPU-Accelerated Libraries cublas	3	761	September 18, 2021
Concurrent kernels on Kepler CUDA Programming and Performance	8	1052	February 23, 2014
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2534	April 29, 2019
cublasGemmEx is a Tensor Core operation or CUDA core? GPU-Accelerated Libraries cublas	3	908	October 3, 2021
Is it possible to use cuda core and tensorcore concurrently ? Deep Learning (Training & Inference) mixed-precision	0	1611	October 13, 2019
Multiple simultaneous kernels across different streams CUDA Programming and Performance	3	4536	February 3, 2009
cublasSgemm launches kernel in wrong stream CUDA Programming and Performance	1	522	March 7, 2016
Concurrent Kernel executions Concurrent Kernel executions on same CPU thread and multiple CPU threa CUDA Programming and Performance	2	4169	August 25, 2011

Multiple Streams on Tensor Cores

Related topics