I am trying to run multiple GEMMs on Tensor Cores on different streams concurrently. However, it seems from the nvprof timeline that cuBlas is explicitly serializing the GEMMs by recording an event on the first stream and then polling for it before launching the second GEMM. Is my understanding of this correct? Is there a way to extract higher throughput from Tensor Cores using multiple streams etc.?
I am using a V100 with CUDA 10.0