cuSparse: cusparseXnnz CUDA 8 and 9 - SEGFAULT/ BUG?

Hi,

I’m having an issue with the cusparse library when the number of non-zero elements in a CSR matrix are > 2^31-1 (signed int limit?).
Everything works rather well when I stay below.

Specifically, the code that causes the SEGFAULT in cusparseSnnz for Dimensions Nrows: 262162 x Ncols 11064 is below. Everything works fine if reduce the each dimension by 20%. Also, If I skip the NNZ computation and manually set it to nnz = Nrows * Ncols everything works too.

Any Ideas? Overflow in the function itself?

Thanks,
Daniel

void SparseMatrix::fromDense(float *h_dataDense, size_t Nrows, size_t Ncols)
{
if(allocated) this->cleanup();

    this->Nrows = Nrows;
    this->Ncols = Ncols;
    cusparseHandle_t handle;
    cusparseSafeCall(cusparseCreate(&handle));
    // --- Host side dense matrix

    printf("Trying to allocate %f GBs on GPU\n", (float)(Nrows * Ncols * (float)sizeof(float))/(8.*1024.*1024.*1024.));
    float *d_A_dense;  gpuErrchk(cudaMallocManaged(&d_A_dense, Nrows * Ncols * sizeof(*d_A_dense)));
    gpuErrchk(cudaMemcpy(d_A_dense, h_dataDense, Nrows * Ncols * sizeof(*d_A_dense), cudaMemcpyHostToDevice));
    // --- Descriptor for sparse matrix A
    cusparseSafeCall(cusparseCreateMatDescr(&descr));
    cusparseSetMatType      (descr, CUSPARSE_MATRIX_TYPE_GENERAL);
    cusparseSetMatIndexBase (descr, CUSPARSE_INDEX_BASE_ZERO);
    nnz = 0;                                // --- Number of nonzero elements in dense matrix
    const int lda = Nrows;  // --- Leading dimension of dense matrix
    // --- Device side number of nonzero elements per row
    gpuErrchk(cudaMallocManaged(&(d_nnzPerVector), Nrows  * sizeof(int)));
    cudaDeviceSynchronize();

    cusparseSafeCall(cusparseSnnz(handle, CUSPARSE_DIRECTION_ROW, Nrows,Ncols, descr, d_A_dense, lda, d_nnzPerVector, &nnz));  --> SEGFAULTS!

Referring to the documentation for the function:

http://docs.nvidia.com/cuda/cusparse/index.html#cusparse-lt-t-gt-nnz

we see that in the function prototype:

cusparseStatus_t cusparseSnnz(cusparseHandle_t handle, cusparseDirection_t dirA, int m, int n, const cusparseMatDescr_t descrA, const float *A, int lda, int *nnzPerRowColumn, int *nnzTotalDevHostPtr)

the final parameter is a pointer to an int quantity:

int *nnzTotalDevHostPtr

It stands to reason that the function was not designed to count situations where the number of nz quantities in your dense matrix exceeds what can be reliably stored in a int quantity.

I don’t see any issue here except perhaps more graceful fault handling.

If you’d like to see the function be usable for the case where the number of nz is larger than what can be stored in a int quantity, or more graceful fault handling in this case, I suggest filing an RFE (bug) at developer.nvidia.com

As an aside, a matrix of dimension 262162 x 11064 where nearly every element is non-zero hardly qualifies as a “sparse” matrix in the definition I am accustomed to. To each his own. Not sure what value a sparse representation of such a matrix would bring.

I don’t think anymore that this is just an int overflow issue. The counter should return a negative value if that is the case and not segfault. Something else must be going inside this function. It must be a bug.

Re: sparsity – Sure it is not sparse, but that shouldn’t really matter. In our case it is part of software QA.

Signed integer overflow in C++ is undefined behavior:

https://stackoverflow.com/questions/18195715/why-is-unsigned-integer-overflow-defined-behavior-but-signed-integer-overflow-is

You cannot predict the behavior reliably, and anything is possible in the presence of UB.

Quite evidently, the burden is on the user of the function to ensure that signed integer overflow does not occur. Since the library routine may, for example, be using the running count as some sort of index, it’s not irrational to think that a negative index could result in a seg fault. Imposing the requirement that the library do careful overflow checking at each update of the variable could impose a runtime performance hit that many “ordinary” users of the function might find to be an undesirable tradeoff.

You might not like the behavior, but you haven’t offered any evidence that seems convincing to me to call it a bug.

In any event, file a bug if you wish.

Also, the “sparse” matrix would actually use more memory than the dense representation.

Assuming floats:
Dense representation: 10.8GiB
CSR (w/ 2^31 entries): 16GiB (values and indices, row offsets are negligible)


However, I would probably report this. Not because it’s a bug (it’s working as intended) but because with increasing memory sizes on the device, this should be an int64 in the API. The int32 made sense when device memory was 4GB, but as devices go beyond 16GB, you should get some information from the function about whether it’s worth it to leave it dense or convert it to a sparse format.

?

Every float nz value will use 4 bytes. Every column index will also use 4 bytes. CSR should at least double the storage requirements for a dense matrix with no zeroes (completely ignoring any contribution from row pointers, as well as matrix storage overhead)

At 2^31 nnz values, the CSR representation will not fit in the memory of a 16GB GPU.

However there exist GPUs with more than 16GB of memory and also there is Unified memory oversubscription on pascal and newer architectures.

For a completely dense matrix, yes. I assumed the CSR matrix was very full, but not dense. As per OP, the original matrix is 262162 x 11064, which is >2^31. And you’re correct, the row offsets for that matrix is an extra MiB. So that CSR would not fit on a 16GB card.