CuSparse, Kepler and big matrices


I’m currently working with CuSparse and the CUDA Toolkit 6.5 and I found nowhere an information that tells us if CuSparse is already optimized for the Kepler architecture (I checked in release notes 5, 5.5, 6 and 6.5). I’m using a C2075 for now so it’s not important, but I may switch to something like a K20 (and/or GTX 780 Ti for SP tests) later.

Another question : I’m working with big matrices, created in COO and transformed to CSR (like in cusparse example) but when I reach 300K cols (for 10K lines) and 10% nnz, I get an illegal memory access whether I use float or double. I don’t see why it’s happening at that moment, the memory usage tells me there is still memory available after the allocs. All the smaller matrices work well.

Best regards

Fragmentation can result in a single allocation failing even though the requested size is less than what is reported as “available”.

300K cols x 10K lines * 0.1 * 4bytes/float is over a gigabyte, just for the NNZ storage (the CSR Column indices will add another gigabyte). If you try this with double you’re over 3Gbytes (for NNZ + CSR Column indices). If you have ECC turned on you’ve only got ~5Gbytes available on the C2075. And COO consumes more storage than CSR, and if you’re trying to convert one to the other, you need storage for both… I think fragmentation could possibly be a reason for an allocation failure at that point.

All CUDA libraries (CUBLAS, CUSPARSE, CUFFT, NPP, etc.) are constantly being updated for performance, features, and bug fixes. It’s fairly common for these libraries to detect the underlying architecture/compute capability and use optimized routines or kernels for specific architectures. For example, Kepler devices offer warp shuffle intrinsics which can be performance-significant in reductions or any time inter-thread communication is needed, that might have been previously done with shared memory. I can’t give you specifics, but I would certainly expect that various routines in CUSPARSE have been optimized for Kepler.

Thanks for the answer. I’ll try to check the memory fragmentation. For the memory, I’ve done the calculations before, with 300K x 10K, I’m at 2.23 Go for values, 1.12 Go for col, 1.12 Go for row and 11Ko for CSR row, which roughly is 4.5 Go and I’ve 5.3 Go available on the C2075.

The error is apparently coming from the line where I call “cusparseDcsrmv()” not when I alloc memory.

Best regards

I found the error, had nothing to do with the amount of memory, just the col and row inversed (m is row and n is col, sometimes you learn the other way).