CUSPARSE conversion routines not working... cusparseSnnz and cusparseSdense2csr misbehaving...

I’ve put together a little demo of my problem. See the attached file. Here is the output of my program:

Initializing CUSPARSE…done
This tests shows that the CUSPARSE format conversion
functions are not working as expected. We have a matrix in device memory that we want to
convert to CSR, but things don’t work correctly. The example below is taking from
page 10 of the CUSPARSE Library PDF. This was tested on CUDA 3.2 Nov 2010.
Yes I know that the matrix is already sparse, but the use case is that we already have a sparse matrix in device
memory that we want to convert to CSR format.
h_A =
1 4 0 0 0
0 2 3 0 0
5 0 0 7 8
0 0 9 0 6

Calling cusparseSnnz with lda = 4. We are using CUSPARSE_DIRECTION_ROW (e.g nnzPerVector stores nnz per row)
nnz= 9 - CORRECT!

h_nnzPerVector - WRONG
1 3 3 2
Should be: 2 2 3 2

Calling cusparseSdense2csr

h_csrValA - WRONG
1 4 7 9 2 5 8 3 6
Should be: 1 4 2 3 5 7 8 9 6

h_csrRowPtrA - WRONG, though first and last enteries are correct
0 1 4 7 9
Should be: 0 2 4 7 9

h_csrColIndA - WRONG
0 0 3 4 1 2 3 1 4
Should be: 0 1 1 2 0 3 4 2 4

So as you can see the results are just wrong. If we instead do things by column (forgot the exact setup)
then we will get the correct results for the above variables. But the problem then is that if you later
want to use your CSR in say a call to cusparseScsrmv then you would have to specify for transA that the matrix
is a transpose. This brings the multiplication down to a crawl, and a regular CUBLAS dense multiply is 15x faster! Go Figure!

So I think this may be a bug. Any help greatly appreciated (9.83 KB)

You’ve allocated nnzPerVector size m instead of mxn

Hi there,

One of the CUSPARSE engineers took a look at this for us, and here’s what he says:

Hope this helps.



This is a failure of the CUSPARSE documentation.

In the CUBLAS documentation, we find the following:

“For maximum compatibility with existing Fortran environments,
CUBLAS uses column‐major storage and 1‐based indexing. Since C
and C++ use row‐major storage, applications cannot use the native
array semantics for two‐dimensional arrays. Instead, macros or inline
functions should be defined to implement matrices on top of onedimensional

Such a statement should be at the beginning of the CUSPARSE documentation.

Further confusion is introduced by footnotes throughout the CUSPARSE documentation saying that various arrays are in row-major format.
For instance, on page 8;

“Note: It is assumed that the indices are given in row-major format …”

Thanks for the feedback. I’ll pass this along to the CUSPARSE team.

Yes I guess I was under the impression that CUSPARSE worked in row-major. Well anyways, good to know that its not a bug! Ok, so let me try to get it to work now that I know things are done per column.

One quesiton,
what would be the best conversion route to make a call to cusparseScsrmv the most efficient e.g. I did get things working with the above code by fiddling with the variables but then had to do TRANSPOSE for cusparseScsrmv which slowed down my app to a crawl. My question is I want to use NON_TRANSPOSE (which BTW the doc says is the only one that is supported, though I’ve got it to work with just TRANSPOSE). Using NON_TRANSPOSE should be faster right?

I guess I should clarify my above quesiton. What would I need to do if my matrix is stored in row-major format in device memory and I want to convert it to CSR? assume I have a mXn, could you tell me the paramaters for the conversion calls and for cusparseScsrmv. I’m guessing for cusparseScsrmv I would then have to use TRANSPOSE, right? And if I do, then things will be slow (I’ve already proved this in my local test 20x for CUBLAS mv versus 5x for CUSPARSE mv)…

Would it be best to convert my device matrix into col - major to get the benefit of cusparseScsrmv?? Is there a CUBLAS method for matrix transpose. I know there is an SDK example…


Try the following:

Hope this helps,


Hi Cliff,
thanks for your detailed response. I will try this, but it’s going to be hard for me to finish off my paper (Text mining on GPU) and try to recode things. I guess I want to ask one final thing. You talk of “implicit transpose”. All I want to know is will I see a huge hit on performance when I call csrmv? Because as I mentioned, I did get things working (not by your mechnamis of using CSC) but I was hit big time when using csrmv and TRANSPOSE.

Can’t say I got it to work. Before I post the code for what I think Cliff meant, here is some code that corresponds to using TRANSPOSE on the call to csrmv. Doing it this way makes things work though makes my app more than 15x slower.

int nnz = 0;

  int *d_nnzPerVector; 

int m = cd->numDocuments; //the rows

  int n = cd->numTerms;  //the columns

  int count_nnzPerVector = n;

  cutilSafeCall( cudaMalloc((void**)&d_nnzPerVector, count_nnzPerVector*sizeof(*d_nnzPerVector) ) );

  cudaMemset(d_nnzPerVector, -1, count_nnzPerVector*sizeof(*d_nnzPerVector));

  if (CUSPARSE_STATUS_SUCCESS != cusparseSnnz(g_cusparse_handle,CUSPARSE_DIRECTION_ROW,



    printf("Error: Couldn't initialize conversion of dense to sparse matrix.\n");



cutilSafeCall( cudaMalloc((void**)&cd->d_csrValA, nnz*sizeof(*cd->d_csrValA) ) );

  cutilSafeCall( cudaMalloc((void**)&cd->d_csrColIndA, nnz*sizeof(*cd->d_csrColIndA) ) );

  cutilSafeCall( cudaMalloc((void**)&cd->d_csrRowPtrA, (n+1)*sizeof(*cd->d_csrRowPtrA) ) );

if (CUSPARSE_STATUS_SUCCESS != cusparseSdense2csr(g_cusparse_handle,n,m,cd->cusparseAMatDesc,cd->A,



    printf("Error: Couldn't convert dense to sparse matrix.\n");





//The call to csrmv

  int m = cd->numDocuments;

  int n = cd->numTerms;





    printf("Error: couldn't perform matrix-vector multiply.\n");

    return false;



which results in the following:

0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.5000  0.5000  0.0000  0.5000  0.5000  0.0000  

0.0000  0.7789  0.0000  0.0000  0.0000  0.4435  0.0000  0.0000  0.4435  0.0000  0.0000  0.0000  

0.0000  0.0000  0.0000  0.7120  0.4054  0.4054  0.0000  0.0000  0.0000  0.0000  0.0000  0.4054  

0.2641  0.0000  0.9277  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.2641  

0.5774  0.0000  0.0000  0.0000  0.5774  0.0000  0.0000  0.0000  0.5774  0.0000  0.0000  0.0000  


2 1 1 1 2 2 1 1 2 1 1 2 

nnz= 17


0.2641 0.5774 0.7789 0.9277 0.7120 0.4054 0.5774 0.4435 0.4054 0.5000 0.5000 0.4435 0.5774 0.5000 0.5000 0.4054 0.2641 


0 2 3 4 5 7 9 10 11 13 14 15 17 


3 4 1 3 2 2 4 1 2 0 0 1 4 0 0 2 3

Now I"m going to (re)try what Cliff mentioned, but I don’t see how it will work. If we call csrmv passing it csc type paramaters the # of cells in the row vector are different between the both.

I’ve come to the conclusion that if one is going to use CUBLAS or CUSPARSE, one should store their matrices in column-major format as required by the docs. This avoids any expensive transpose operation. At present I’m getting 20x speedup using CUBLAS, row-major + transpose, sgemv. I’m going to covert this into CUSPARSE, column-major, no transpose, sgemv, and will put the results here once I’m done.

could you describe your d_nnzRowVector please! I quite not understand the values of this vector. THANKS