Slow cusparseDcsrgeam in CUDA 9.2


I’m having performance issues with the cuSPARSE routine cusparseDcsrgeam.

After switching from CUDA 8.0 to CUDA 9.2, cusparseDcsrgeam is about 8 times slower, using the exact same input. I have no idea why, so I’m hoping to get some help and find out if this is due to recent changes in the library or if there’s anything I can do on my side to work around the problem.

The matrices I’m adding together are 2,500,0002,500,000 or 7,560,0002,500,000 and have a maximum of 17,380,000 non-zero elements. Their size difference doesn’t seem to affect the runtime much.

You can see the runtimes of the csrgeam_windowBased_core kernel measured with nvprof below:

CUDA 9.2

Time(%)      Time     Calls       Avg       Min       Max  Name
28.97%  5.17855s       120  43.155ms  36.771ms  49.312ms  void csrgeam_windowBased_core<double, bool=0>(cusparseCsrgeamParams<double>)
21.62%  3.86527s       120  32.211ms  27.838ms  39.385ms  void csrgeam_windowBased_core<float, bool=1>(cusparseCsrgeamParams<float>)

CUDA 8.0

6.90%  756.84ms       120  6.3070ms  4.9622ms  8.5097ms  void csrgeam_windowBased_core<double, bool=0>(cusparseCsrgeamParams<double>)
3.88%  425.22ms       120  3.5435ms  2.7723ms  4.7569ms  void csrgeam_windowBased_core<float, bool=1>(cusparseCsrgeamParams<float>)

Analyzing with the visual profiler also shows that when using CUDA 9.2, the kernels are launched with a grid size of (1890000, 1, 1) and a block size of (32, 4, 1). It also says that the kernels are limited by shared memory bandwidth, with 429329352 transactions for csrgeam_windowBased_core<double, bool=0> alone. None of the other memory types show any issues, and their utilization is between Idle and Low.

The first thing I notice when I run the same analysis on the CUDA 8.0 version, is that the grid size is exactly one 8th, at (236250, 1, 1) and the block size is the same dimension with x and y switched (4, 32, 1). Also, device memory utilization is at Medium. For CUDA 9.2 it was between Idle and Low.

All of this is running on a GTX 1060 in a laptop, with Ubuntu 16.04. I compile with the arch option set to sm_61. For additional info, here’s the deviceQuery output:

CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6078 MBytes (6373572608 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1671 MHz (1.67 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 1
Result = PASS

Please help!
Best Regards

I would suggest providing a complete test code. Or file a bug at (they will also ask you for a complete test code.)

Thanks for your answer txbob. I can file a bug, but I can’t find the correct place on the page. If you can reproduce the issue, then I can do some more research on gemm2 and geam2 (new in CUDA 9 though) as well. gemm2 runs very slowly compared to gemm as well.

Here’s a complete sample that perfectly reproduces the problem for me.

#include <cuda.h>
#include <cuda_runtime.h>
#include <cusparse_v2.h>
#include <thrust/fill.h>
#include <thrust/execution_policy.h>

#define CLEANUP(s)                                   \
do {                                                 \
    printf ("%s\n", s);                              \
    if (rowPtrA)     cudaFree(rowPtrA);              \
    if (rowPtrB)     cudaFree(rowPtrB);              \
    if (rowPtrC)     cudaFree(rowPtrC);              \
    if (colIdxA)     cudaFree(colIdxA);              \
    if (colIdxB)     cudaFree(colIdxB);              \
    if (colIdxC)     cudaFree(colIdxC);              \
    if (valsA)       cudaFree(valsA);                \
    if (valsB)       cudaFree(valsB);                \
    if (valsC)       cudaFree(valsC);                \
    if (descr)       cusparseDestroyMatDescr(descr); \
    if (handle)      cusparseDestroy(handle);        \
    cudaDeviceReset();                               \
    fflush (stdout);                                 \
} while (0)

int main(int argc, char* argv[])

    /* The example builds two diagonal matrices containing 0.3
       and 0.4, and adds them together. */
    cusparseHandle_t handle = NULL;
    cusparseMatDescr_t descr = NULL;
    cusparseStatus_t cusparseStat;
    cusparseSetPointerMode(handle, CUSPARSE_POINTER_MODE_HOST);
    cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ZERO);

    cudaError_t err1, err2, err3, err4, err5, err6, err7;

    const int dimension = 2500000;
    const double alpha = 1.0;
    const double beta = 1.0;

    const int rows = dimension;
    const int cols = dimension;
    const int nnzA = dimension;
    const int nnzB = dimension;
    int nnzC = 0;

    int* rowPtrA = NULL;
    int* rowPtrB = NULL;
    int* rowPtrC = NULL;
    int* colIdxA = NULL;
    int* colIdxB = NULL;
    int* colIdxC = NULL;
    double* valsA = NULL;
    double* valsB = NULL;
    double* valsC = NULL;

    err1 = cudaMalloc((void**)&rowPtrA, (rows+1)*sizeof(int));
    err2 = cudaMalloc((void**)&rowPtrB, (rows+1)*sizeof(int));
    err3 = cudaMalloc((void**)&rowPtrC, (rows+1)*sizeof(int));
    err4 = cudaMalloc((void**)&colIdxA, (nnzA)*sizeof(int));
    err5 = cudaMalloc((void**)&colIdxB, (nnzB)*sizeof(int));
    err6 = cudaMalloc((void**)&valsA,   (nnzA)*sizeof(double));
    err7 = cudaMalloc((void**)&valsB,   (nnzB)*sizeof(double));
    if(err1 != cudaSuccess ||
       err2 != cudaSuccess ||
       err3 != cudaSuccess ||
       err4 != cudaSuccess ||
       err5 != cudaSuccess ||
       err6 != cudaSuccess ||
       err7 != cudaSuccess) {
        CLEANUP("CUDA error when allocating memory for matrix A and B.");

    thrust::fill(thrust::device, valsA, valsA+(nnzA+1), 0.3);
    thrust::fill(thrust::device, valsB, valsB+(nnzB+1), 0.4);
    thrust::sequence(thrust::device, rowPtrA, rowPtrA+rows);
    thrust::sequence(thrust::device, rowPtrB, rowPtrB+rows);
    thrust::sequence(thrust::device, colIdxA, colIdxA+nnzA);
    thrust::sequence(thrust::device, colIdxB, colIdxB+nnzB);
    err1 = cudaDeviceSynchronize();
    if(err1 != cudaSuccess){
        CLEANUP("CUDA synchronization error");

    cusparseStat = cusparseXcsrgeamNnz(handle, rows, cols,
                        descr, nnzA,
                        rowPtrA, rowPtrA,
                        descr, nnzB,
                        rowPtrB, colIdxB,
                        descr, rowPtrC,
    if(cusparseStat != CUSPARSE_STATUS_SUCCESS){
        CLEANUP("Error in csrgeamNnz");
    err2 = cudaDeviceSynchronize();
    if(err2 != cudaSuccess){
        CLEANUP("CUDA synchronization error");
    err1 = cudaMalloc((void**)&colIdxC, nnzC*sizeof(int));
    err2 = cudaMallocManaged((void**)&valsC, nnzC*sizeof(double));
    if(err1 != cudaSuccess ||
       err2 != cudaSuccess){
        CLEANUP("CUDA error when mallocing matrix C pointers");

    cusparseStat = cusparseDcsrgeam(handle, rows, cols,
                     descr, nnzA,
                     valsA, rowPtrA, colIdxA,
                     descr, nnzB,
                     valsB, rowPtrB, colIdxB,
                     valsC, rowPtrC, colIdxC);
    if(cusparseStat != CUSPARSE_STATUS_SUCCESS){
        CLEANUP("Error in csrgeam");
    err1 = cudaDeviceSynchronize();
    if(err1 != cudaSuccess){
        CLEANUP("CUDA synchronization error");

    // Print arbitrary result value. Should return 0.7.
    std::cout << valsC[10] << std::endl;

    CLEANUP("No CUDA errors.");

Yes, I can reproduce the observation. It looks like a perf regression. I would suggest filing a bug.

login at as a registered developer
click on your name in the upper right hand corner
click on my account
click on my bugs

Thank you! I’ve filed a bug.