Help Improving Performance using cuSolver/cuSparse Routines

joao.nizer · December 15, 2023, 5:05pm

I wanted to report and ask for help when using CUDA cuSolver/cuSparse GPU routines that are slower than CPU versions (Python → Scipy Sparse Solvers). And, of course, ask for help if something is being done incorrectly in order to improve performance.

CPU Model:

>wmic cpu get caption, deviceid, name, numberofcores, maxclockspeed, status
Caption                              DeviceID  MaxClockSpeed  Name                                     NumberOfCores  Status
AMD64 Family 25 Model 97 Stepping 2  CPU0      3601           AMD Ryzen 7 7745HX with Radeon Graphics  8              OK

GPU Model:

Device 0: "NVIDIA GeForce RTX 4070 Laptop GPU"
  CUDA Driver Version / Runtime Version          12.3 / 12.3
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 8188 MBytes (8585216000 bytes)
  (036) Multiprocessors, (128) CUDA Cores/MP:    4608 CUDA Cores
  GPU Max Clock rate:                            1605 MHz (1.61 GHz)
  Memory Clock rate:                             8001 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 33554432 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.3, NumDevs = 1
Result = PASS

Example 1: cusolverSpDcsrlsvqr vs scipy.sparse.linalg.spsolve
Adapted from: python - Interfacing cuSOLVER-sparse using PyCUDA - Stack Overflow

NVIDIA Nsight Compute CLI: nRuns=1
out_profiler.zip (919.5 KB)
Python Code:

### Interface cuSOLVER/cuSPARSE with PyCUDA
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
import numpy
from scipy import sparse
import ctypes
import timeit
from tabulate import tabulate
import os

### Wrap the cuSOLVER cusolverSpDcsrlsvqr() using ctypes

# cuSparse
_libcusparse = ctypes.cdll.LoadLibrary("C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\cusparse64_12.dll")

class cusparseMatDescr_t(ctypes.Structure):
    _fields_ = [
        ('MatrixType', ctypes.c_int),
        ('FillMode', ctypes.c_int),
        ('DiagType', ctypes.c_int),
        ('IndexBase', ctypes.c_int)
        ]
_libcusparse.cusparseCreate.restype = int
_libcusparse.cusparseCreate.argtypes = [ctypes.c_void_p]

_libcusparse.cusparseDestroy.restype = int
_libcusparse.cusparseDestroy.argtypes = [ctypes.c_void_p]

_libcusparse.cusparseCreateMatDescr.restype = int
_libcusparse.cusparseCreateMatDescr.argtypes = [ctypes.c_void_p]


# cuSOLVER
_libcusolver = ctypes.cdll.LoadLibrary("C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\cusolver64_11.dll")

_libcusolver.cusolverSpCreate.restype = int
_libcusolver.cusolverSpCreate.argtypes = [ctypes.c_void_p]

_libcusolver.cusolverSpDestroy.restype = int
_libcusolver.cusolverSpDestroy.argtypes = [ctypes.c_void_p]

_libcusolver.cusolverSpDcsrlsvqr.restype = int
_libcusolver.cusolverSpDcsrlsvqr.argtypes= [ctypes.c_void_p,
                                            ctypes.c_int,
                                            ctypes.c_int,
                                            ctypes.c_void_p,
                                            ctypes.c_void_p,
                                            ctypes.c_void_p,
                                            ctypes.c_void_p,
                                            ctypes.c_void_p,
                                            ctypes.c_double,
                                            ctypes.c_int,
                                            ctypes.c_void_p,
                                            ctypes.c_void_p]


os.system('cls')

resultsNNZ=["NNZ (log10)"]
resultsGPU=["GPU (s)"]
resultsCPU=["CPU (s)"]
resultsRatio=["SpeedUp (GPU/CPU)"]

nRuns = 100

for nElements in [10, 100, 100_000, 1_000_000]:
    et_cpu=0
    et_gpu=0
    os.system('cls')
    print(tabulate([resultsNNZ, resultsGPU, resultsCPU, resultsRatio]))

    for ii in range(0,nRuns):
        #### Prepare the matrix and parameters
        val = numpy.arange(1,nElements+1,dtype=numpy.float64)
        col = numpy.arange(0,nElements,dtype=numpy.int32)
        row = numpy.arange(0,nElements,dtype=numpy.int32)
        A = sparse.coo_matrix((val,(row,col)),shape=(nElements,nElements))
        Acsr = sparse.csr_matrix(A)
        b = numpy.ones(nElements)
        x = numpy.empty(nElements)
        vals_gpu = numpy.empty_like(x)

        # Copy Data Host -> Device
        dcsrVal = gpuarray.to_gpu(Acsr.data)
        dcsrColInd = gpuarray.to_gpu(Acsr.indices)
        dcsrIndPtr = gpuarray.to_gpu(Acsr.indptr)
        dx = gpuarray.to_gpu(x)
        db = gpuarray.to_gpu(b)
        m = ctypes.c_int(nElements)
        nnz = ctypes.c_int(nElements)
        descrA = ctypes.c_void_p()
        reorder = ctypes.c_int(0)
        tol = ctypes.c_double(1e-10)
        singularity = ctypes.c_int(99)


        # Create cusparse handle
        _cusp_handle = ctypes.c_void_p()
        assert 0 == _libcusparse.cusparseCreate(ctypes.byref(_cusp_handle))

        cusp_handle = _cusp_handle.value

        # Create MatDescriptor
        assert 0 == _libcusparse.cusparseCreateMatDescr(ctypes.byref(descrA))

        # Create cusolver handle
        _cuso_handle = ctypes.c_void_p()
        assert 0 == _libcusolver.cusolverSpCreate(ctypes.byref(_cuso_handle))

        cuso_handle = _cuso_handle.value

        ### Call GPU Solver
        tic=timeit.default_timer()

        _libcusolver.cusolverSpDcsrlsvqr(cuso_handle,
                                        m,
                                        nnz,
                                        descrA,
                                        int(dcsrVal.gpudata),
                                        int(dcsrIndPtr.gpudata),
                                        int(dcsrColInd.gpudata),
                                        int(db.gpudata),
                                        tol,
                                        reorder,
                                        int(dx.gpudata),
                                        ctypes.byref(singularity))
        
        toc=timeit.default_timer()
        et_gpu += toc-tic

        # Copy Results from Device -> Host
        vals_gpu = dx.get() #cuda.memcpy_dtoh(vals_gpu, dx)

        # Destroy handles
        assert 0 == _libcusolver.cusolverSpDestroy(cuso_handle)
        assert 0 == _libcusparse.cusparseDestroy(cusp_handle)

        ### Call CPU Solver:
        tic = timeit.default_timer()

        vals_cpu = sparse.linalg.spsolve(Acsr, b)

        toc = timeit.default_timer()
        et_cpu += toc-tic

        # Check Matching results:
        assert 1e-12 > 0.0-numpy.sum(numpy.abs(vals_gpu-vals_cpu))

    resultsNNZ.append(numpy.log10(nElements))
    resultsGPU.append(et_gpu/nRuns)
    resultsCPU.append(et_cpu/nRuns)
    resultsRatio.append(et_gpu/et_cpu)

# Show Final Results
os.system('cls')
print(tabulate([resultsNNZ, resultsGPU, resultsCPU, resultsRatio]))

- Prompt result:

-----------------  ------------  ------------  ---------  --------
NNZ (log10)         1             2            5          6
GPU (s)             0.000669252   0.000793483  0.0265892  0.230813
CPU (s)             4.0802e-05    4.6252e-05   0.0135402  0.172922
SpeedUp (GPU/CPU)  16.4024       17.1556       1.96372    1.33478
-----------------  ------------  ------------  ---------  --------

Example 2: cusolverSpcsreigvsi() vs scipy.sparse.linalg.eigs()

Input Data:
Kmat_d040.zip (669.6 KB)
Kmat_d080.zip (180.3 KB)
Visual Studio 2022 Project Files:
cuSolverSp_csreigvsi_vs2022.zip (688.9 KB)
GPU C++ NVIDIA Nsigth Compute CLI:
It was taking too long to profile with Kmat_d040.mtx, so I run with Kmat_d080.mtx →
out_profiler.zip (4.1 MB)
CPU - Python Code

import numpy as np
import scipy
from scipy.linalg import sqrtm
from scipy import sparse
import time
from tabulate import tabulate
import os

# sigma shift
lm=1.5
ko2=(2*np.pi/lm)**2
sgm=4*ko2 # Initial guess

results=[]
Ntimes=10

for jj in ['040']:
    print('Computing...')
    # Read Matrix
    fileName=f"Kmat_d{jj}.txt"
    fullPath="./mat/"+fileName

    t_cpu = time.process_time()
    sp_data=np.loadtxt(fullPath)
    n_lines=np.shape(sp_data)[0]
    m=int(sp_data[0][0])
    n=int(sp_data[0][1])
    et1 = (time.process_time() - t_cpu)

    t_cpu = time.process_time()
    mat=sparse.coo_matrix((sp_data[1:,2].T,sp_data[1:,0:2].T),shape=(m,n))
    et2 = (time.process_time() - t_cpu)

    # CPU
    t_cpu = time.process_time()
    for ii in np.arange(Ntimes):
        vals,_ =scipy.sparse.linalg.eigs(mat, k=3, sigma=sgm)
    et3 = (time.process_time() - t_cpu)
    neff=vals
    
    results.append([jj, et1, et2, et3, neff])
    
    os.system('cls')
    print(tabulate(results))

- Prompt Results

---  --------  -  -------  -------------------------------------------------
040  0.046875  0  **1.26562**  [56.13836141+0.j 55.3195665 +0.j 41.94037716+0.j]
---  --------  -  -------  -------------------------------------------------

C++ GPU Code:

#include <assert.h>
#include <ctype.h>
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <vector>
#include <chrono>

#include "cusolverSp.h"
#include "cusolverSp_LOWLEVEL_PREVIEW.h"
#include "helper_cuda.h"
#include "helper_cusolver.h"

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>


template <typename T_ELEM>
int loadMMSparseMatrix(char* filename, char elem_type, bool csrFormat, int* m,
	int* n, int* nnz, T_ELEM** aVal, int** aRowInd,
	int** aColInd, int extendSymMatrix);

void UsageSP(void) {
	printf("<options>\n");
	printf("-h          : display this help\n");
	printf("-file=<filename> : filename containing a matrix in MM format\n");
	printf("-device=<device_id> : <device_id> if want to run on specific GPU\n");

	exit(0);
}

void parseCommandLineArguments(int argc, char* argv[], struct testOpts& opts) {
	memset(&opts, 0, sizeof(opts));

	if (checkCmdLineFlag(argc, (const char**)argv, "-h")) {
		UsageSP();
	}

	if (checkCmdLineFlag(argc, (const char**)argv, "file")) {
		char* fileName = 0;
		getCmdLineArgumentString(argc, (const char**)argv, "file", &fileName);

		if (fileName) {
			opts.sparse_mat_filename = fileName;
		}
		else {
			printf("\nIncorrect filename passed to -file \n ");
			UsageSP();
		}
	}
}

int main(int argc, char* argv[]) {
	struct testOpts opts;
	cusolverSpHandle_t cusolverSpH =
		NULL;
	cusparseMatDescr_t descrA = NULL;

	int rowsA = 0;  // number of rows of A
	int colsA = 0;  // number of columns of A
	int nnzA = 0;   // number of nonzeros of A
	int baseA = 0;  // base index in CSR format
	double mu0 = 70.0;       // <double> mu eigenvalue initial guess
	double tol = 1e-3; // tol for cusolver<>csreigvsi
	int maxiter = 1000; // max iter for cusolver<>csreigvsi

	// CSR(A) from I/O
	// CPU
	int* h_csrRowPtrA = NULL;  // <int> n+1
	int* h_csrColIndA = NULL;  // <int> nnzA
	double* h_csrValA = NULL;  // <double> nnzA

	// GPU
	int* d_csrRowPtrA = NULL;  // <int> n+1
	int* d_csrColIndA = NULL;  // <int> nnzA
	double* d_csrValA = NULL;  // <double> nnzA

	parseCommandLineArguments(argc, argv, opts);

	findCudaDevice(argc, (const char**)argv);

	if (opts.sparse_mat_filename == NULL) {
		opts.sparse_mat_filename = sdkFindFilePath("Kmat_d040.mtx", argv[0]);
		if (opts.sparse_mat_filename != NULL)
			printf("Using default input file [%s]\n", opts.sparse_mat_filename);
		else
			printf("Could not find Kmat_d040.mtx\n");
	}
	else {
		printf("Using input file [%s]\n", opts.sparse_mat_filename);
	}

	printf("Step 1: read matrix market format\n");

	if (opts.sparse_mat_filename) {
		if (loadMMSparseMatrix<double>(opts.sparse_mat_filename, 'd', true, &rowsA,
			&colsA, &nnzA, &h_csrValA, &h_csrRowPtrA,
			&h_csrColIndA, false)) {
			return 1;
		}
		baseA = h_csrRowPtrA[0];  // baseA = {0,1}
	}
	else {
		fprintf(stderr, "Error: input matrix is not provided\n");
		return 1;
	}

	if (rowsA != colsA) {
		fprintf(stderr, "Error: only support square matrix\n");
		return 1;
	}

	printf("sparse matrix A is %d x %d with %d nonzeros, base=%d\n", rowsA, colsA,
		nnzA, baseA);


	double* d_mu = NULL;       // <double> mu eigenvalue initial guess
	thrust::device_vector<double> d_x0(rowsA, 1.);       // <double> mu eigenvector initial guess
	thrust::device_vector<double> d_x(rowsA, 0.);       // <double> mu eigenvector
	double* h_mu = NULL;       // <double> mu eigenvalue initial guess
	thrust::host_vector<double> h_x0(rowsA, 1.);       // <double> mu eigenvector initial guess
	thrust::host_vector<double> h_x(rowsA, 0.);       // <double> mu eigenvector

	// Create cusolver Handle and matrix Descriptor
	checkCudaErrors(cusolverSpCreate(&cusolverSpH));
	checkCudaErrors(cusparseCreateMatDescr(&descrA));
	checkCudaErrors(cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL));

	// Set Matrix base 0 or 1
	if (baseA) {
		checkCudaErrors(cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE));
	}
	else {
		checkCudaErrors(cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ZERO));
	}


	printf("Step 2: Solve Eigenmodes (GPU):");

	// Allocate Memory in the GPU
	checkCudaErrors(
		cudaMalloc((void**)&d_csrRowPtrA, sizeof(int) * (rowsA + 1)));
	checkCudaErrors(cudaMalloc((void**)&d_csrColIndA, sizeof(int) * nnzA));
	checkCudaErrors(cudaMalloc((void**)&d_csrValA, sizeof(double) * nnzA));
	checkCudaErrors(cudaMalloc(&d_mu, sizeof(double)));
	checkCudaErrors(cudaMemset(d_mu, 0, sizeof(double)));

	checkCudaErrors(cudaMemcpy(d_csrRowPtrA, h_csrRowPtrA,
		sizeof(int) * (rowsA + 1),
		cudaMemcpyHostToDevice));
	checkCudaErrors(cudaMemcpy(d_csrColIndA, h_csrColIndA, sizeof(int) * nnzA,
		cudaMemcpyHostToDevice));
	checkCudaErrors(cudaMemcpy(d_csrValA, h_csrValA, sizeof(double) * nnzA,
		cudaMemcpyHostToDevice));
	auto start = std::chrono::high_resolution_clock::now();
	checkCudaErrors(cusolverSpDcsreigvsi(
		cusolverSpH,
		rowsA,
		nnzA,
		descrA,
		d_csrValA,
		d_csrRowPtrA,
		d_csrColIndA,
		mu0,
		thrust::raw_pointer_cast(d_x0.data()),
		maxiter,
		tol,
		d_mu,
		thrust::raw_pointer_cast(d_x.data())
	));
	auto end = std::chrono::high_resolution_clock::now();
	std::chrono::duration<double> duration = end - start;


	double result;
	checkCudaErrors(cudaMemcpy(&result, d_mu, sizeof(double),
		cudaMemcpyDeviceToHost));

	std::cout << result << std::endl;
	std::cout << "Computation time (GPU): " << duration.count() << "s" << std::endl;
}

- C++ Prompt Results:

GPU Device 0: "Ada" with compute capability 8.9

Using default input file [./Kmat_d040.mtx]
Step 1: read matrix market format
sparse matrix A is 17790 x 17790 with 189294 nonzeros, base=1
Step 2: Solve Eigenmodes (GPU):56.1384
Computation time (GPU): 50.1298s

I’m wondering if there is anything wrong with the way I’m using the cuSolver routines. But I also noticed that the profiler results show that the Grid Size is small. However, I’m not launching the kernels myself; this seems to be done automatically by the routines themselves.

Thanks for your attention!

Topic		Replies	Views
CuSolver Sparse on Fortran GPU-Accelerated Libraries cusolver	6	1442	April 28, 2024
CuSolver Sparse Code 2 error CUDA Programming and Performance	18	2152	March 6, 2017
Problem of two large sparse matrices multiplication in cuParse CUDA Programming and Performance	6	3696	November 21, 2016
Performance characteristics of cusparseSpMM GPU-Accelerated Libraries nvbugs	1	722	June 28, 2023
cuSOLVER sparse: cusolverSpDcsrlsvqr() error GPU-Accelerated Libraries	4	3811	June 7, 2015
Sparse matrix manipulation CUDA Programming and Performance	18	1171	January 11, 2024
cusparseScsrilu02 breaks with large matrices GPU-Accelerated Libraries cublas , cusolver , cusparse	10	1153	June 1, 2022
cuSolverSp_LinearSolver produces different results with different order methods CUDA Programming and Performance	6	1440	September 21, 2017
cusolverSpScsreigvsi throws segmentation fault for larger matrices (e.g. 33000x33000) GPU-Accelerated Libraries cusolver , cusparse	7	1024	January 3, 2023
cusparseScsrmv transpose mode is not working CUDA Programming and Performance	17	1499	July 9, 2018

Help Improving Performance using cuSolver/cuSparse Routines

Related topics