Questions about CUB DeviceSpmv

My code currently uses ModernGPU 1.1 to perform a “unary” SpMV using SpmvCsrUnary(). The vector elements are structs of unsigned uint64’s and the “+” operator is a struct-wide bitwise XOR. All non-zero matrix values are 1’s so we don’t use the CSR values array, only the row_offsets and column_indices arrays.

We would like to move to a more recently updated library optimized for better performance on newer GPUs. To this end, does CUB’s DeviceSpmv() support only primitive data types or can it use a derived data type for the vectors with an overloaded “+” operator?

Can it be easily modified to support a “unary” SpMV, perhaps by setting and detecting a null pointer for the values array?

This not something that is currently available in cuSPARSE or CUB.
This feels like a GraphBLAS request, can you confirm?

I took a quick look at GraphBLAS. It supports derived data types for vectors with custom operators. But it seems to store matrices and vectors in opaque objects which raises two issues.

For matrices, we don’t want to store the values since all non-zero entries are 1s, but GrB_Matrix_new() and GrB_Matrix_build() require a domain and values array that I’m not sure can be empty. Is this possible? Our (sub-)matrices have up ~20 million rows and columns with ~2^30 nonzeros so we don’t have the gpu memory to store these.

For the vectors, this is one step in a processing chain. We need to know the in-memory storage format (currently an array of structs) of the vectors for the other steps. This likely wouldn’t be too hard to figure out for a particular GraphBLAS implementation, but that could break if we upgrade or switch implementations of the library.

Are there straightforward solutions to these issues that I’m overlooking?

In the end I modified CUB’s DeviceSpmv() to do what we need. I turned DeviceSpmv() into DeviceUnarySpmv() where we don’t use the values array and eliminated alpha and beta so that it only does y = A x or optionally y = A x + y. This eliminates the need for any multiplication operation. The elements of x and y can be any derived type (a struct of uint64’s for us), and the addition operation is given by overloading + (struct-wide binary xor for us). Depending on the matrix and gpu, it runs about 30-40% faster than Modern GPU’s SpmvCsrUnary().

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.