Performance Downgrade when changing [deprecated] cusparse<t>csrmm() to cusparseSpMM()

Dear NVIDIA developers,

I am working on the acceleration of a scientific codebase and currently I am using the cuSPARSE library to compute sparsedense and densesparse matrix-matrix multiplications. I recently started working with the updated CUDA 10.1 version and reading the documentation of cuSPARSE, I found out that the

cusparse<t>csrmm()

is deprecated and will be removed in a future release.

Naturally, I changed to the recommended

cusparseSpMM()

routine but noticed a substantial performance slow-down. I tried to isolate the specific issue and profiled the following code segment that multiplies a random sparse with a random dense matrix (both N by N).

Code below:

t1 = get_time(0.0);

	cusparse_state=cusparseZdense2csr((cusparseHandle_t)cusparse_handle, N, N, descrA, (cuDoubleComplex*) cA_dev, N, nnzPerRow_d, (cuDoubleComplex*) cA_nnz_d, cA_edgei_d, cA_indexj_d);

	cudaMemcpy(cA_edgei, cA_edgei_d, (N+1)*sizeof(int), cudaMemcpyDeviceToHost);

	cusparse_state = cusparseCreateCsr(&sparse_descriptor, N, N, *nnzTotalHostPtr, cA_edgei_d, cA_indexj_d, (cuDoubleComplex*) cA_nnz_d, CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I, CUSPARSE_INDEX_BASE_ZERO, CUDA_C_64F);
	cusparse_state = cusparseCreateDnMat(&dense_descriptor, N, N, N, (cuDoubleComplex*) cB_dev, CUDA_C_64F, CUSPARSE_ORDER_COL);
	cusparse_state = cusparseCreateDnMat(&denseC_descriptor, N, N, N, (cuDoubleComplex*) cC_cusparse, CUDA_C_64F, CUSPARSE_ORDER_COL);

	for(int k = 0; k<number_trials; k++){
		cusparse_state = cusparseSpMM((cusparseHandle_t)cusparse_handle, CUSPARSE_OPERATION_NON_TRANSPOSE, CUSPARSE_OPERATION_NON_TRANSPOSE, (cuDoubleComplex*)&alpha, sparse_descriptor, dense_descriptor, (cuDoubleComplex*)&beta, denseC_descriptor, CUDA_C_64F, CUSPARSE_CSRMM_ALG1, NULL);

//		cusparse_state = cusparseZcsrmm((cusparseHandle_t)cusparse_handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, N, *nnzTotalHostPtr, (cuDoubleComplex*)&alpha, descrA, (cuDoubleComplex*)cA_nnz_d, cA_edgei_d, cA_indexj_d, (cuDoubleComplex*)cB_dev, N, (cuDoubleComplex*)&beta, (cuDoubleComplex*)cC_cusparse,N);

		cudaMemcpy(cC_cshost, cC_cusparse, N*N*sizeof(CPX), cudaMemcpyDeviceToHost);
	}

cudaMemcpy(cA_nnz, cA_nnz_d, *nnzTotalHostPtr*sizeof(CPX), cudaMemcpyDeviceToHost);
cudaMemcpy(cA_indexj, cA_indexj_d, *nnzTotalHostPtr*sizeof(int), cudaMemcpyDeviceToHost);

t1 = get_time(t1);

The code was compiled using the 19.4 PGI version of pgc++.

By changing the location of the comment I can profile either cusparsecsrmm() or cusparseSpMM().
The results were that the speed of cusparseSpMM() is about half of what cusparsecsrmm() gave me for double precision complex numbers, regardless of size or sparsity of the matrices.

My question is essentially whether or not I’m using cusparseSpMM() in a non-intended fashion and if that’s not the case why cusparsecsrmm() is even deprecated at all, especially given the fact that there already exist known issues with the cusparseSpMM() detailed in this forum post:

It’s not obvious to me that you are using cusparseSpMM in a non-intended fashion.

For the performance issue, my suggestion would be to file a bug. You will likely be asked for a complete test code.