About Hardware Memory Compression

Hello,

I have a question on the Hardware Memory Compression feature of the cuSPARSE library.

I have a sparse linear system such as Ax=b and I’m solving it using iterative methods. I was prototyping the optimization features of cuSPARSE and I was able to get significant speed-up using the “graph capture” feature. I also noticed functions such as cusparseSpMV and cusparseSpSV can benefit from “hardware memory compression”.

I successfully implemented that feature too however I didn’t notice any gains from it. As I understand correctly, the vectors I send into these functions should have repeating values so that the algorithm can compress those repeating values and send them from the GPU’s global RAM to the GPU’s caches. Since a sparse matrix in CSR format has three dense vectors, one holds the ROW index value which has all unique numbers, one VALUE index which most probably has all unique values, and finally a COLUMN index which holds the column numbers which can have repeating values. Furthermore, the dense vector that the sparse matrix interacted with has no repeating values either (in my case).

My question would be: if I don’t have many repeating values in those dense vectors I cannot benefit from the hardware memory compression in terms of reduced timings. Am I right? Or what would be the use case of this feature?

Regards
Deniz

Ps. I have rtx4090 and 3090 for prototyping.

Hi Deniz,
As mentioned in CUDA Hopper Tuning Guide [link], the compressor automatically chooses between several possible compression algorithms, or none if there is no suitable pattern. So I suspect that your data does not benefit from compression.

Can you please explain your use case and application? Also, can you share your code and the input matrix?

Thanks

Sure,

I’m solving a sparse linear system (Ax=b) arising from geophysical problems. A is a sparse complex symmetric non-Hermitian matrix usually obtained by finite difference or finite elements discretization of a mesh. b is a dense matrix that is also complex-valued. I’ve written down a conjugate gradient solver myself which is based on the GPBiCG algorithm.

The most time-consuming parts of an iterative solver are the sparse matrix dense vector multiplication and the sparse triangular matrix solutions for the preconditioning part. The preconditioner matrix is also factorized by CUDA’s ILU(0). The iterative solver usually needs 100 or 200 steps to converge to my desired level of accuracy. Since those steps require the same cusparse and cublas functions, I use the graph capturing feature and I implemented that successfully (and I’m happy with its performance).

The implementation details are in my case a little bit different. I use Matlab as the main language and I create mexfunctions to write my CUDA codes. Then I compile them so they can be called directly in Matlab later on. I pass my matrix A and the b vector, which are already in GPU memory space in Matlab, to the compiled CUDA code and it solves the equation for me. This strategy works perfectly fine for other codes I created using CUDA-C and mexfunctions for years now.

I can share the whole code or the matrix and vectors but I think I should show how I define the “CUdeviceptr” variables for the hardware memory compression. (I defined every vector in this way in my code/algorithm to see a speed-up.) I figured out how to do it using the examples online. Maybe it may reveal a mistake/missing element if there is one.

CUmemGenericAllocationHandle handle;
CUmemAllocationProp prop = {};
prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
prop.location.id = dev;
prop.allocFlags.compressionType = CU_MEM_ALLOCATION_COMP_GENERIC;
prop.win32HandleMetaData        = 0;
CUmemAccessDesc accessDesc = {};
accessDesc.location        = prop.location;
accessDesc.flags           = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
size_t granularity = 0;
size_t regularsize;
size_t paddedsize;
CUresult result;

regularsize = N * sizeof(cuDoubleComplex);
cuMemGetAllocationGranularity(&granularity, &prop,CU_MEM_ALLOC_GRANULARITY_RECOMMENDED);
paddedsize = round_up(regularsize, granularity);
CUdeviceptr d_compressed_variable = NULL;
cuMemAddressReserve(&d_compressed_variable, paddedsize, 0, 0, 0);
result=cuMemCreate(&handle, paddedsize, &prop, 0);
result = cuMemMap(d_compressed_variable, paddedsize, 0, handle, 0);
result = cuMemSetAccess(d_compressed_variable, paddedsize, &accessDesc, 1);
//At this stage I may use cudaMemcpy to populate the d_compressed_variable or not use it to define it as a zero-valued initial vector 

The functions won’t support the CUdeviceptr can accept this variable with (cuDoubleComplex*) type-casting so there is no problem with it. I can access its content with my custom kernels too. There is no problem there either. I lean towards the idea that there might be no suitable pattern to compress/decompress.

please also take a look at CUDALibrarySamples/cuSPARSE/compression at master · NVIDIA/CUDALibrarySamples · GitHub if can be useful

Yes, I figured it out using that example mainly. I also checked other pdf files from Nvidia and there was a suggested scenario where there was a wave propagation example in the time domain. At initial times, most of the values in the media are zero. The zero values become non-zero as the wave propagates to the other side of the media. This example made sense to me and the hardware memory compression could be utilized in such a case perfectly. On the other hand, frequency-domain problems may not benefit from this as it solves all the values in one go.