Help Improving Performance using cuSolver/cuSparse Routines

I wanted to report and ask for help when using CUDA cuSolver/cuSparse GPU routines that are slower than CPU versions (Python ā†’ Scipy Sparse Solvers). And, of course, ask for help if something is being done incorrectly in order to improve performance.

  • CPU Model:
>wmic cpu get caption, deviceid, name, numberofcores, maxclockspeed, status
Caption                              DeviceID  MaxClockSpeed  Name                                     NumberOfCores  Status
AMD64 Family 25 Model 97 Stepping 2  CPU0      3601           AMD Ryzen 7 7745HX with Radeon Graphics  8              OK
  • GPU Model:
Device 0: "NVIDIA GeForce RTX 4070 Laptop GPU"
  CUDA Driver Version / Runtime Version          12.3 / 12.3
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 8188 MBytes (8585216000 bytes)
  (036) Multiprocessors, (128) CUDA Cores/MP:    4608 CUDA Cores
  GPU Max Clock rate:                            1605 MHz (1.61 GHz)
  Memory Clock rate:                             8001 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 33554432 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.3, NumDevs = 1
Result = PASS

Example 1: cusolverSpDcsrlsvqr vs scipy.sparse.linalg.spsolve
Adapted from: python - Interfacing cuSOLVER-sparse using PyCUDA - Stack Overflow

### Interface cuSOLVER/cuSPARSE with PyCUDA
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
import numpy
from scipy import sparse
import ctypes
import timeit
from tabulate import tabulate
import os

### Wrap the cuSOLVER cusolverSpDcsrlsvqr() using ctypes

# cuSparse
_libcusparse = ctypes.cdll.LoadLibrary("C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\cusparse64_12.dll")

class cusparseMatDescr_t(ctypes.Structure):
    _fields_ = [
        ('MatrixType', ctypes.c_int),
        ('FillMode', ctypes.c_int),
        ('DiagType', ctypes.c_int),
        ('IndexBase', ctypes.c_int)
        ]
_libcusparse.cusparseCreate.restype = int
_libcusparse.cusparseCreate.argtypes = [ctypes.c_void_p]

_libcusparse.cusparseDestroy.restype = int
_libcusparse.cusparseDestroy.argtypes = [ctypes.c_void_p]

_libcusparse.cusparseCreateMatDescr.restype = int
_libcusparse.cusparseCreateMatDescr.argtypes = [ctypes.c_void_p]


# cuSOLVER
_libcusolver = ctypes.cdll.LoadLibrary("C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\cusolver64_11.dll")

_libcusolver.cusolverSpCreate.restype = int
_libcusolver.cusolverSpCreate.argtypes = [ctypes.c_void_p]

_libcusolver.cusolverSpDestroy.restype = int
_libcusolver.cusolverSpDestroy.argtypes = [ctypes.c_void_p]

_libcusolver.cusolverSpDcsrlsvqr.restype = int
_libcusolver.cusolverSpDcsrlsvqr.argtypes= [ctypes.c_void_p,
                                            ctypes.c_int,
                                            ctypes.c_int,
                                            ctypes.c_void_p,
                                            ctypes.c_void_p,
                                            ctypes.c_void_p,
                                            ctypes.c_void_p,
                                            ctypes.c_void_p,
                                            ctypes.c_double,
                                            ctypes.c_int,
                                            ctypes.c_void_p,
                                            ctypes.c_void_p]


os.system('cls')

resultsNNZ=["NNZ (log10)"]
resultsGPU=["GPU (s)"]
resultsCPU=["CPU (s)"]
resultsRatio=["SpeedUp (GPU/CPU)"]

nRuns = 100

for nElements in [10, 100, 100_000, 1_000_000]:
    et_cpu=0
    et_gpu=0
    os.system('cls')
    print(tabulate([resultsNNZ, resultsGPU, resultsCPU, resultsRatio]))

    for ii in range(0,nRuns):
        #### Prepare the matrix and parameters
        val = numpy.arange(1,nElements+1,dtype=numpy.float64)
        col = numpy.arange(0,nElements,dtype=numpy.int32)
        row = numpy.arange(0,nElements,dtype=numpy.int32)
        A = sparse.coo_matrix((val,(row,col)),shape=(nElements,nElements))
        Acsr = sparse.csr_matrix(A)
        b = numpy.ones(nElements)
        x = numpy.empty(nElements)
        vals_gpu = numpy.empty_like(x)

        # Copy Data Host -> Device
        dcsrVal = gpuarray.to_gpu(Acsr.data)
        dcsrColInd = gpuarray.to_gpu(Acsr.indices)
        dcsrIndPtr = gpuarray.to_gpu(Acsr.indptr)
        dx = gpuarray.to_gpu(x)
        db = gpuarray.to_gpu(b)
        m = ctypes.c_int(nElements)
        nnz = ctypes.c_int(nElements)
        descrA = ctypes.c_void_p()
        reorder = ctypes.c_int(0)
        tol = ctypes.c_double(1e-10)
        singularity = ctypes.c_int(99)


        # Create cusparse handle
        _cusp_handle = ctypes.c_void_p()
        assert 0 == _libcusparse.cusparseCreate(ctypes.byref(_cusp_handle))

        cusp_handle = _cusp_handle.value

        # Create MatDescriptor
        assert 0 == _libcusparse.cusparseCreateMatDescr(ctypes.byref(descrA))

        # Create cusolver handle
        _cuso_handle = ctypes.c_void_p()
        assert 0 == _libcusolver.cusolverSpCreate(ctypes.byref(_cuso_handle))

        cuso_handle = _cuso_handle.value

        ### Call GPU Solver
        tic=timeit.default_timer()

        _libcusolver.cusolverSpDcsrlsvqr(cuso_handle,
                                        m,
                                        nnz,
                                        descrA,
                                        int(dcsrVal.gpudata),
                                        int(dcsrIndPtr.gpudata),
                                        int(dcsrColInd.gpudata),
                                        int(db.gpudata),
                                        tol,
                                        reorder,
                                        int(dx.gpudata),
                                        ctypes.byref(singularity))
        
        toc=timeit.default_timer()
        et_gpu += toc-tic

        # Copy Results from Device -> Host
        vals_gpu = dx.get() #cuda.memcpy_dtoh(vals_gpu, dx)

        # Destroy handles
        assert 0 == _libcusolver.cusolverSpDestroy(cuso_handle)
        assert 0 == _libcusparse.cusparseDestroy(cusp_handle)

        ### Call CPU Solver:
        tic = timeit.default_timer()

        vals_cpu = sparse.linalg.spsolve(Acsr, b)

        toc = timeit.default_timer()
        et_cpu += toc-tic

        # Check Matching results:
        assert 1e-12 > 0.0-numpy.sum(numpy.abs(vals_gpu-vals_cpu))

    resultsNNZ.append(numpy.log10(nElements))
    resultsGPU.append(et_gpu/nRuns)
    resultsCPU.append(et_cpu/nRuns)
    resultsRatio.append(et_gpu/et_cpu)

# Show Final Results
os.system('cls')
print(tabulate([resultsNNZ, resultsGPU, resultsCPU, resultsRatio]))
    • Prompt result:
-----------------  ------------  ------------  ---------  --------
NNZ (log10)         1             2            5          6
GPU (s)             0.000669252   0.000793483  0.0265892  0.230813
CPU (s)             4.0802e-05    4.6252e-05   0.0135402  0.172922
SpeedUp (GPU/CPU)  16.4024       17.1556       1.96372    1.33478
-----------------  ------------  ------------  ---------  --------

Example 2: cusolverSpcsreigvsi() vs scipy.sparse.linalg.eigs()

import numpy as np
import scipy
from scipy.linalg import sqrtm
from scipy import sparse
import time
from tabulate import tabulate
import os

# sigma shift
lm=1.5
ko2=(2*np.pi/lm)**2
sgm=4*ko2 # Initial guess

results=[]
Ntimes=10

for jj in ['040']:
    print('Computing...')
    # Read Matrix
    fileName=f"Kmat_d{jj}.txt"
    fullPath="./mat/"+fileName

    t_cpu = time.process_time()
    sp_data=np.loadtxt(fullPath)
    n_lines=np.shape(sp_data)[0]
    m=int(sp_data[0][0])
    n=int(sp_data[0][1])
    et1 = (time.process_time() - t_cpu)

    t_cpu = time.process_time()
    mat=sparse.coo_matrix((sp_data[1:,2].T,sp_data[1:,0:2].T),shape=(m,n))
    et2 = (time.process_time() - t_cpu)

    # CPU
    t_cpu = time.process_time()
    for ii in np.arange(Ntimes):
        vals,_ =scipy.sparse.linalg.eigs(mat, k=3, sigma=sgm)
    et3 = (time.process_time() - t_cpu)
    neff=vals
    
    results.append([jj, et1, et2, et3, neff])
    
    os.system('cls')
    print(tabulate(results))

    • Prompt Results
---  --------  -  -------  -------------------------------------------------
040  0.046875  0  **1.26562**  [56.13836141+0.j 55.3195665 +0.j 41.94037716+0.j]
---  --------  -  -------  -------------------------------------------------
  • C++ GPU Code:
#include <assert.h>
#include <ctype.h>
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <vector>
#include <chrono>

#include "cusolverSp.h"
#include "cusolverSp_LOWLEVEL_PREVIEW.h"
#include "helper_cuda.h"
#include "helper_cusolver.h"

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>


template <typename T_ELEM>
int loadMMSparseMatrix(char* filename, char elem_type, bool csrFormat, int* m,
	int* n, int* nnz, T_ELEM** aVal, int** aRowInd,
	int** aColInd, int extendSymMatrix);

void UsageSP(void) {
	printf("<options>\n");
	printf("-h          : display this help\n");
	printf("-file=<filename> : filename containing a matrix in MM format\n");
	printf("-device=<device_id> : <device_id> if want to run on specific GPU\n");

	exit(0);
}

void parseCommandLineArguments(int argc, char* argv[], struct testOpts& opts) {
	memset(&opts, 0, sizeof(opts));

	if (checkCmdLineFlag(argc, (const char**)argv, "-h")) {
		UsageSP();
	}

	if (checkCmdLineFlag(argc, (const char**)argv, "file")) {
		char* fileName = 0;
		getCmdLineArgumentString(argc, (const char**)argv, "file", &fileName);

		if (fileName) {
			opts.sparse_mat_filename = fileName;
		}
		else {
			printf("\nIncorrect filename passed to -file \n ");
			UsageSP();
		}
	}
}

int main(int argc, char* argv[]) {
	struct testOpts opts;
	cusolverSpHandle_t cusolverSpH =
		NULL;
	cusparseMatDescr_t descrA = NULL;

	int rowsA = 0;  // number of rows of A
	int colsA = 0;  // number of columns of A
	int nnzA = 0;   // number of nonzeros of A
	int baseA = 0;  // base index in CSR format
	double mu0 = 70.0;       // <double> mu eigenvalue initial guess
	double tol = 1e-3; // tol for cusolver<>csreigvsi
	int maxiter = 1000; // max iter for cusolver<>csreigvsi

	// CSR(A) from I/O
	// CPU
	int* h_csrRowPtrA = NULL;  // <int> n+1
	int* h_csrColIndA = NULL;  // <int> nnzA
	double* h_csrValA = NULL;  // <double> nnzA

	// GPU
	int* d_csrRowPtrA = NULL;  // <int> n+1
	int* d_csrColIndA = NULL;  // <int> nnzA
	double* d_csrValA = NULL;  // <double> nnzA

	parseCommandLineArguments(argc, argv, opts);

	findCudaDevice(argc, (const char**)argv);

	if (opts.sparse_mat_filename == NULL) {
		opts.sparse_mat_filename = sdkFindFilePath("Kmat_d040.mtx", argv[0]);
		if (opts.sparse_mat_filename != NULL)
			printf("Using default input file [%s]\n", opts.sparse_mat_filename);
		else
			printf("Could not find Kmat_d040.mtx\n");
	}
	else {
		printf("Using input file [%s]\n", opts.sparse_mat_filename);
	}

	printf("Step 1: read matrix market format\n");

	if (opts.sparse_mat_filename) {
		if (loadMMSparseMatrix<double>(opts.sparse_mat_filename, 'd', true, &rowsA,
			&colsA, &nnzA, &h_csrValA, &h_csrRowPtrA,
			&h_csrColIndA, false)) {
			return 1;
		}
		baseA = h_csrRowPtrA[0];  // baseA = {0,1}
	}
	else {
		fprintf(stderr, "Error: input matrix is not provided\n");
		return 1;
	}

	if (rowsA != colsA) {
		fprintf(stderr, "Error: only support square matrix\n");
		return 1;
	}

	printf("sparse matrix A is %d x %d with %d nonzeros, base=%d\n", rowsA, colsA,
		nnzA, baseA);


	double* d_mu = NULL;       // <double> mu eigenvalue initial guess
	thrust::device_vector<double> d_x0(rowsA, 1.);       // <double> mu eigenvector initial guess
	thrust::device_vector<double> d_x(rowsA, 0.);       // <double> mu eigenvector
	double* h_mu = NULL;       // <double> mu eigenvalue initial guess
	thrust::host_vector<double> h_x0(rowsA, 1.);       // <double> mu eigenvector initial guess
	thrust::host_vector<double> h_x(rowsA, 0.);       // <double> mu eigenvector

	// Create cusolver Handle and matrix Descriptor
	checkCudaErrors(cusolverSpCreate(&cusolverSpH));
	checkCudaErrors(cusparseCreateMatDescr(&descrA));
	checkCudaErrors(cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL));

	// Set Matrix base 0 or 1
	if (baseA) {
		checkCudaErrors(cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE));
	}
	else {
		checkCudaErrors(cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ZERO));
	}


	printf("Step 2: Solve Eigenmodes (GPU):");

	// Allocate Memory in the GPU
	checkCudaErrors(
		cudaMalloc((void**)&d_csrRowPtrA, sizeof(int) * (rowsA + 1)));
	checkCudaErrors(cudaMalloc((void**)&d_csrColIndA, sizeof(int) * nnzA));
	checkCudaErrors(cudaMalloc((void**)&d_csrValA, sizeof(double) * nnzA));
	checkCudaErrors(cudaMalloc(&d_mu, sizeof(double)));
	checkCudaErrors(cudaMemset(d_mu, 0, sizeof(double)));

	checkCudaErrors(cudaMemcpy(d_csrRowPtrA, h_csrRowPtrA,
		sizeof(int) * (rowsA + 1),
		cudaMemcpyHostToDevice));
	checkCudaErrors(cudaMemcpy(d_csrColIndA, h_csrColIndA, sizeof(int) * nnzA,
		cudaMemcpyHostToDevice));
	checkCudaErrors(cudaMemcpy(d_csrValA, h_csrValA, sizeof(double) * nnzA,
		cudaMemcpyHostToDevice));
	auto start = std::chrono::high_resolution_clock::now();
	checkCudaErrors(cusolverSpDcsreigvsi(
		cusolverSpH,
		rowsA,
		nnzA,
		descrA,
		d_csrValA,
		d_csrRowPtrA,
		d_csrColIndA,
		mu0,
		thrust::raw_pointer_cast(d_x0.data()),
		maxiter,
		tol,
		d_mu,
		thrust::raw_pointer_cast(d_x.data())
	));
	auto end = std::chrono::high_resolution_clock::now();
	std::chrono::duration<double> duration = end - start;


	double result;
	checkCudaErrors(cudaMemcpy(&result, d_mu, sizeof(double),
		cudaMemcpyDeviceToHost));

	std::cout << result << std::endl;
	std::cout << "Computation time (GPU): " << duration.count() << "s" << std::endl;
}
    • C++ Prompt Results:
GPU Device 0: "Ada" with compute capability 8.9

Using default input file [./Kmat_d040.mtx]
Step 1: read matrix market format
sparse matrix A is 17790 x 17790 with 189294 nonzeros, base=1
Step 2: Solve Eigenmodes (GPU):56.1384
Computation time (GPU): 50.1298s

Iā€™m wondering if there is anything wrong with the way Iā€™m using the cuSolver routines. But I also noticed that the profiler results show that the Grid Size is small. However, Iā€™m not launching the kernels myself; this seems to be done automatically by the routines themselves.

Thanks for your attention!

1 Like