Concurrent Kernels with CUSPARSE Library

I would like to ask you a question about the concurrent kernel execution in Nvidia GPUs. I explain us my situation. I have an code which launchs 1 sparse matrix multiplication for 2 different matrix (one for each one). These matrix multiplications are performed with the cuSPARSE Library. I want both operations can be concurrently performed, so I use 2 streams to launch them. With Nvidia Visual profiler, I´ve observed that both operations (cuSPARSE kernels) are overlaped. The time stamps for both kernels are:

  • Kernel 1) Start Time: 224,963 ms - End Time: 267,19 ms.
  • Kernel 2) Start Time: 229,359 ms - End Time: 359,158 ms.
  • I´m using a Tesla K20c with 13 SMs which can execute up 16 blocks per SM. Both kernels have 100% occupancy and launch an enough amount of blocks:

  • Kernel 1) 13738 blocks, 32 Register/Thread, 1,125 KB shared memory.
  • Kernel 2) 4521 blocks, 32 Register/Thread, 1,266 KB shared memory.
  • With this configuration, I think that both kernels shouldn´t show this behaviour, since both kernels launch an enough number of blocks to fill all SMs of the GPU. However, Nvidia Visual Profiler shows that these kernels are being overlaped. Why?. Anyone could explain me why this behaviour can occur?

    Many thanks in advance.

    It seems like this was discussed here:

    I don’t think there is any specification for how the device will schedule blocks from concurrent kernels. I given this, your observation is certainly possible.

    Hi !!

    Then, Why I have only observed this behaviour when I use the cuSPARSE Library with streams?. I have more codes where 4 kernels are lauched through streams each one. Each kernel satures the GPU resources and the overlap among kernels does not occurr.

    Certainly you won’t see the behavior if you don’t use streams - then all kernels will serialize.

    Unless things have changed in recent GPUs, if each kernel saturates the GPU you will not see overlap except possibly for a very brief period at the very start and very end where a given kernel does not use all the resources as it is starting up or winding down.

    If your concern is about performance, consider this: If each kernel already saturates the GPU by itself, no additional throughput is gained by running the kernels concurrently. In fact, running such kernels concurrently could easily increase run time due to competition for finite hardware resources.

    If you try a different scenario where each kernel only uses a small fraction of the GPU resources, and each such kernel runs in a different stream, you should see the kernels running concurrently. Such situations can occur in practice when operating on batches of small matrices for example.