cuSOLVER into python (possibly for pyCUDA)

Dear NVIDIA community,

since we were not very successful yet posting this problem on, we hope that we can solve our problem directly with you developers here.

We are experiencing problems while using cuSOLVER’s cusolverSpScsrlsvchol function, probably due to misunderstanding of the cuSOLVER library…

Motivation: we are solving the Poisson equation -divgrad x = b on a rectangular grid. In 2 dimensions with a 5-stencil (1, 1, -4, 1, 1), the Laplacian on the grid provides a (quite sparse) matrix A. Moreover, the charge distribution on the grid gives a (dense) vector b. A is positive definite and symmetric.

Now we solve A*x = b for x using nvidia’s new cuSOLVER library that comes with cuda-7.0 . It provides a function cusolverSpScsrlsvchol which should do the sparse Cholesky factorisation for floats.

Note: we are able to correctly solve the system with the alternative sparse QR factorisation function cusolverSpScsrlsvqr. For a 4x4 grid with all b entries on the edge being 1 and the rest 0 we get for x:

1 1 0.999999 1 1 1 0.999999 1 1 1 1 1 1 1 1 1

cusolverSpScsrlsvchol returns wrong results for x:

1 3.33333 2.33333 1 3.33333 2.33333 1.33333 1 2.33333 1.33333 0.666667 1 1 1 1 1

What are we doing wrong? Thanks for your help!

Our C++ CUDA code:

#include <iostream>
#include <cuda_runtime.h>
#include <cuda.h>
#include <cusolverSp.h>
#include <cusparse.h>
#include <vector>
#include <cassert>

// create poisson matrix with Dirichlet bc. of a rectangular grid with
// dimension NxN
void assemble_poisson_matrix_coo(std::vector<float>& vals, std::vector<int>& row, std::vector<int>& col,
                     std::vector<float>& rhs, int Nrows, int Ncols) {

        //nnz: 5 entries per row (node) for nodes in the interior
    // 1 entry per row (node) for nodes on the boundary, since we set them explicitly to 1.
    int nnz = 5*Nrows*Ncols - (2*(Ncols-1) + 2*(Nrows-1))*4;

    int counter = 0;
    for(int i = 0; i < Nrows; ++i) {
        for (int j = 0; j < Ncols; ++j) {
            int idx = j + Ncols*i;
            if (i == 0 || j == 0 || j == Ncols-1 || i == Nrows-1) {
                vals[counter] = 1.;
                row[counter] = idx;
                col[counter] = idx;
                rhs[idx] = 1.;
//                if (i == 0) {
//                    rhs[idx] = 3.;
//                }
            } else { // -laplace stencil
                // above
                vals[counter] = -1.;
                row[counter] = idx;
                col[counter] = idx-Ncols;
                // left
                vals[counter] = -1.;
                row[counter] = idx;
                col[counter] = idx-1;
                // center
                vals[counter] = 4.;
                row[counter] = idx;
                col[counter] = idx;
                // right
                vals[counter] = -1.;
                row[counter] = idx;
                col[counter] = idx+1;
                // below
                vals[counter] = -1.;
                row[counter] = idx;
                col[counter] = idx+Ncols;

                rhs[idx] = 0;
    assert(counter == nnz);

int main() {
    // --- create library handles:
    cusolverSpHandle_t cusolver_handle;
    cusolverStatus_t cusolver_status;
    cusolver_status = cusolverSpCreate(&cusolver_handle);
    std::cout << "status create cusolver handle: " << cusolver_status << std::endl;

    cusparseHandle_t cusparse_handle;
    cusparseStatus_t cusparse_status;
    cusparse_status = cusparseCreate(&cusparse_handle);
    std::cout << "status create cusparse handle: " << cusparse_status << std::endl;

    // --- prepare matrix:
    int Nrows = 4;
    int Ncols = 4;
    std::vector<float> csrVal;
    std::vector<int> cooRow;
    std::vector<int> csrColInd;
    std::vector<float> b;

    assemble_poisson_matrix_coo(csrVal, cooRow, csrColInd, b, Nrows, Ncols);

    int nnz = csrVal.size();
    int m = Nrows * Ncols;
    std::vector<int> csrRowPtr(m+1);

    // --- prepare solving and copy to GPU:
    std::vector<float> x(m);
    float tol = 1e-5;
    int reorder = 0;
    int singularity = 0;

    float *db, *dcsrVal, *dx;
    int *dcsrColInd, *dcsrRowPtr, *dcooRow;
    cudaMalloc((void**)&db, m*sizeof(float));
    cudaMalloc((void**)&dx, m*sizeof(float));
    cudaMalloc((void**)&dcsrVal, nnz*sizeof(float));
    cudaMalloc((void**)&dcsrColInd, nnz*sizeof(int));
    cudaMalloc((void**)&dcsrRowPtr, (m+1)*sizeof(int));
    cudaMalloc((void**)&dcooRow, nnz*sizeof(int));

    cudaMemcpy(db,, b.size()*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(dcsrVal,, csrVal.size()*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(dcsrColInd,, csrColInd.size()*sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dcooRow,, cooRow.size()*sizeof(int), cudaMemcpyHostToDevice);

    cusparse_status = cusparseXcoo2csr(cusparse_handle, dcooRow, nnz, m,
                                       dcsrRowPtr, CUSPARSE_INDEX_BASE_ZERO);
    std::cout << "status cusparse coo2csr conversion: " << cusparse_status << std::endl;

    cudaDeviceSynchronize(); // matrix format conversion has to be finished!

    // --- everything ready for computation:

    cusparseMatDescr_t descrA;

    cusparse_status = cusparseCreateMatDescr(&descrA);
    std::cout << "status cusparse createMatDescr: " << cusparse_status << std::endl;

    // optional: print dense matrix that has been allocated on GPU

    std::vector<float> A(m*m, 0);
    float *dA;
    cudaMalloc((void**)&dA, A.size()*sizeof(float));

    cusparseScsr2dense(cusparse_handle, m, m, descrA, dcsrVal,
                       dcsrRowPtr, dcsrColInd, dA, m);

    cudaMemcpy(, dA, A.size()*sizeof(float), cudaMemcpyDeviceToHost);
    std::cout << "A: \n";
    for (int i = 0; i < m; ++i) {
        for (int j = 0; j < m; ++j) {
            std::cout << A[i*m + j] << " ";
        std::cout << std::endl;


    std::cout << "b: \n";
    cudaMemcpy(, db, (m)*sizeof(int), cudaMemcpyDeviceToHost);
    for (auto a : b) {
        std::cout << a << ",";
    std::cout << std::endl;

// --- solving!!!!

// // does not work:
//    cusolver_status = cusolverSpScsrlsvchol(cusolver_handle, m, nnz, descrA, dcsrVal,
//                       dcsrRowPtr, dcsrColInd, db, tol, reorder, dx,
//                       &singularity);

     cusolver_status = cusolverSpScsrlsvqr(cusolver_handle, m, nnz, descrA, dcsrVal,
                        dcsrRowPtr, dcsrColInd, db, tol, reorder, dx,


    std::cout << "singularity (should be -1): " << singularity << std::endl;

    std::cout << "status cusolver solving (!): " << cusolver_status << std::endl;

    cudaMemcpy(, dx, m*sizeof(float), cudaMemcpyDeviceToHost);

    cusparse_status = cusparseDestroy(cusparse_handle);
    std::cout << "status destroy cusparse handle: " << cusparse_status << std::endl;

    cusolver_status = cusolverSpDestroy(cusolver_handle);
    std::cout << "status destroy cusolver handle: " << cusolver_status << std::endl;

    for (auto a : x) {
        std::cout << a << " ";
    std::cout << std::endl;


    return 0;

We encountered a subsequent problem when interfacing this now into python – hence the question title. (

We have tried wrapping the methods the same way the dense cuSolver routines are wrapped in scikits-cuda (

However, the code crashes with a segmentation fault when calling the cusolverSpDcsrlsvqr() function. Debugging with cuda-gdb (cuda-gdb --args python -m pycuda.debug; run;bt) yields the following stacktrace,

#0 0x00007fffd9e3b71a in cusolverSpXcsrissymHost () from /usr/local/cuda/lib64/ #1 0x00007fffd9df5237 in hsolverXcsrqr_zeroPivot () from /usr/local/cuda/lib64/
#2 0x00007fffd9e0c764 in hsolverXcsrqr_analysis_coletree () from /usr/local/cuda/lib64/
#3 0x00007fffd9f160a0 in cusolverXcsrqr_analysis () from /usr/local/cuda/lib64/
#4 0x00007fffd9f28d78 in cusolverSpScsrlsvqr () from /usr/local/cuda/lib64/

which is weird, since we do not call cusolverSpScsrlsvqr() nor do we think it should call a host function (cusolverSpXcsrissymHost).

The corresponding python code looks like this:

# ### Interface cuSOLVER PyCUDA

import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import scipy.sparse as sp
import ctypes

# #### wrap the cuSOLVER cusolverSpDcsrlsvqr() using ctypes

# cuSparse
_libcusparse = ctypes.cdll.LoadLibrary('')

class cusparseMatDescr_t(ctypes.Structure):
    _fields_ = [
        ('MatrixType', ctypes.c_int),
        ('FillMode', ctypes.c_int),
        ('DiagType', ctypes.c_int),
        ('IndexBase', ctypes.c_int)
_libcusparse.cusparseCreate.restype = int
_libcusparse.cusparseCreate.argtypes = [ctypes.c_void_p]

_libcusparse.cusparseDestroy.restype = int
_libcusparse.cusparseDestroy.argtypes = [ctypes.c_void_p]

_libcusparse.cusparseCreateMatDescr.restype = int
_libcusparse.cusparseCreateMatDescr.argtypes = [ctypes.c_void_p]

_libcusolver = ctypes.cdll.LoadLibrary('')

_libcusolver.cusolverSpCreate.restype = int
_libcusolver.cusolverSpCreate.argtypes = [ctypes.c_void_p]

_libcusolver.cusolverSpDestroy.restype = int
_libcusolver.cusolverSpDestroy.argtypes = [ctypes.c_void_p]

_libcusolver.cusolverSpDcsrlsvqr.restype = int
_libcusolver.cusolverSpDcsrlsvqr.argtypes= [ctypes.c_void_p,

#### Prepare the matrix and parameters, copy to Device via gpuarray

# coo to csr
val = np.arange(1,5,dtype=np.float64)
col = np.arange(0,4,dtype=np.int32)
row = np.arange(0,4,dtype=np.int32)
A = sp.coo_matrix((val,(row,col))).todense()
Acsr = sp.csr_matrix(A)
b = np.ones(4)
x = np.empty(4)
print('A:' + str(A))
print('b: ' + str(b))

dcsrVal = gpuarray.to_gpu(
dcsrColInd = gpuarray.to_gpu(Acsr.indices)
dcsrIndPtr = gpuarray.to_gpu(Acsr.indptr)
dx = gpuarray.to_gpu(x)
db = gpuarray.to_gpu(b)
m = ctypes.c_int(4)
nnz = ctypes.c_int(4)
descrA = cusparseMatDescr_t()
reorder = ctypes.c_int(0)
tol = ctypes.c_double(1e-10)
singularity = ctypes.c_int(99)

#create cusparse handle
_cusp_handle = ctypes.c_void_p()
status = _libcusparse.cusparseCreate(ctypes.byref(_cusp_handle))
print('status: ' + str(status))
cusp_handle = _cusp_handle.value

#create MatDescriptor
status = _libcusparse.cusparseCreateMatDescr(ctypes.byref(descrA))
print('status: ' + str(status))

#create cusolver handle
_cuso_handle = ctypes.c_void_p()
status = _libcusolver.cusolverSpCreate(ctypes.byref(_cuso_handle))
print('status: ' + str(status))
cuso_handle = _cuso_handle.value

print('cusp handle: ' + str(cusp_handle))
print('cuso handle: ' + str(cuso_handle))

### Call solver

# destroy handles
status = _libcusolver.cusolverSpDestroy(cuso_handle)
print('status: ' + str(status))
status = _libcusparse.cusparseDestroy(cusp_handle)
print('status: ' + str(status))

Thanks a ton for your help!!

One problem, as indicated in the answer at SO, is that your A matrix is not symmetric. That is a requirement for the cholesky version, and a differentiator between the cholesky and QR version.

I just ran the program and txbob is correct, the A matrix is not symmetric.

From the manual on csrlsvchol:

The function csrlsvchol takes a CUSPARSE_MATRIX_TYPE_GENERAL as its input and assumes it’s symmetric (i.e. it only uses the lower half). This is why the error is appears when you solve it via Cholesky and not QR (they’re not solving the same system).

The Laplacian operator should be symmetric, the root issue is the matrix generator code you are using is not correct. It should generate an entry for every (i,j) and (j,i) pair, the value should be the same for ij and ji.

Thanks a lot for all your replies, indeed we did not take into account that the Dirichlet boundary conditions make the Laplacian matrix asymmetric… But this should be the case, right?

The post on SO came directly after I had posted here, sorry.

For a Laplacian with open boundary conditions, I agree that one obtains a completely symmetric Laplacian – however, for Dirichlet boundary conditions we need to explicitly only set the diagonal entry to 1 (all off-diagonals to 0) and the RHS to 0 as well. In this case, one actually breaks the symmetry of the matrix, no?

Concerning the interfacing to python, the problem is still open… It is obscure to me why the device function would ever call a host function on a (previously) device-stored matrix.

You can also move the Dirichlet conditions into the b vector. Then your A matrix would be symmetric. This is easier to think about if you use finite volume instead of finite difference (i.e. your values are cell centered instead of on the nodes).

In other words, set cells just outside the boundary to your Dirichlet conditions, then move them into the b vector. For example, if your boundary conditions are 1 outside the square on all sides, for the first cell (0,0), your eqn would become:

(c(-1,0)-c(0,0)) - (c(0,0)-c(1,0)) + (c(0,-1)-c(0,0))-(c(0,0)-c(0,1)) =
c(-1,0) + c(0,-1) - 4c(0,0) + c(0,1) + c(1,0) =
1 + 1 -4c(0,0) + c(0,1) + c(1,0) =
-4c(0,0) + c(0,1) + c(1,0) = -2
4c(0,0) - c(0,1) - c(1,0) = 2

Note that c(-1,0) = c(0,-1) = 1 in this case. This would preserve your A symmetry. If you don’t like this approximate Dirichlet condition, you can approximate the condition at the cell boundary and use a Taylor series to approximate the cell centered value outside the grid.

For an example like this at steady state, it’s not necessary to use the Taylor approximation (both boundary conditions will yield the same solution), but for transient problems it can be important.

Actually you’re right, trf86, thanks a lot for the hint! The Dirichlet boundary conditions make the corresponding columns of the boundaries in A redundant: 1*x = 0 will lead to x = 0 in any case such that these phi entries do not play any role for calculating b entries for inner nodes.

So we’ll just use a modified matrix A with only the inner nodes present which makes it symmetric. The corresponding Taylor approximation is then not needed (since we have the luxury to have both x and b being zero in these entries ;-) ).

Thanks a lot for all your comments!