Sure,

I’m solving a sparse linear system (Ax=b) arising from geophysical problems. A is a sparse complex symmetric non-Hermitian matrix usually obtained by finite difference or finite elements discretization of a mesh. b is a dense matrix that is also complex-valued. I’ve written down a conjugate gradient solver myself which is based on the GPBiCG algorithm.

The most time-consuming parts of an iterative solver are the sparse matrix dense vector multiplication and the sparse triangular matrix solutions for the preconditioning part. The preconditioner matrix is also factorized by CUDA’s ILU(0). The iterative solver usually needs 100 or 200 steps to converge to my desired level of accuracy. Since those steps require the same cusparse and cublas functions, I use the graph capturing feature and I implemented that successfully (and I’m happy with its performance).

The implementation details are in my case a little bit different. I use Matlab as the main language and I create mexfunctions to write my CUDA codes. Then I compile them so they can be called directly in Matlab later on. I pass my matrix A and the b vector, which are already in GPU memory space in Matlab, to the compiled CUDA code and it solves the equation for me. This strategy works perfectly fine for other codes I created using CUDA-C and mexfunctions for years now.

I can share the whole code or the matrix and vectors but I think I should show how I define the “CUdeviceptr” variables for the hardware memory compression. (I defined every vector in this way in my code/algorithm to see a speed-up.) I figured out how to do it using the examples online. Maybe it may reveal a mistake/missing element if there is one.

```
CUmemGenericAllocationHandle handle;
CUmemAllocationProp prop = {};
prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
prop.location.id = dev;
prop.allocFlags.compressionType = CU_MEM_ALLOCATION_COMP_GENERIC;
prop.win32HandleMetaData = 0;
CUmemAccessDesc accessDesc = {};
accessDesc.location = prop.location;
accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
size_t granularity = 0;
size_t regularsize;
size_t paddedsize;
CUresult result;
regularsize = N * sizeof(cuDoubleComplex);
cuMemGetAllocationGranularity(&granularity, &prop,CU_MEM_ALLOC_GRANULARITY_RECOMMENDED);
paddedsize = round_up(regularsize, granularity);
CUdeviceptr d_compressed_variable = NULL;
cuMemAddressReserve(&d_compressed_variable, paddedsize, 0, 0, 0);
result=cuMemCreate(&handle, paddedsize, &prop, 0);
result = cuMemMap(d_compressed_variable, paddedsize, 0, handle, 0);
result = cuMemSetAccess(d_compressed_variable, paddedsize, &accessDesc, 1);
//At this stage I may use cudaMemcpy to populate the d_compressed_variable or not use it to define it as a zero-valued initial vector
```

The functions won’t support the CUdeviceptr can accept this variable with (cuDoubleComplex*) type-casting so there is no problem with it. I can access its content with my custom kernels too. There is no problem there either. I lean towards the idea that there might be no suitable pattern to compress/decompress.