I wanted to report and ask for help when using CUDA cuSolver/cuSparse GPU routines that are slower than CPU versions (Python ā Scipy Sparse Solvers). And, of course, ask for help if something is being done incorrectly in order to improve performance.
- CPU Model:
>wmic cpu get caption, deviceid, name, numberofcores, maxclockspeed, status
Caption DeviceID MaxClockSpeed Name NumberOfCores Status
AMD64 Family 25 Model 97 Stepping 2 CPU0 3601 AMD Ryzen 7 7745HX with Radeon Graphics 8 OK
- GPU Model:
Device 0: "NVIDIA GeForce RTX 4070 Laptop GPU"
CUDA Driver Version / Runtime Version 12.3 / 12.3
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 8188 MBytes (8585216000 bytes)
(036) Multiprocessors, (128) CUDA Cores/MP: 4608 CUDA Cores
GPU Max Clock rate: 1605 MHz (1.61 GHz)
Memory Clock rate: 8001 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 33554432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 12.3, NumDevs = 1
Result = PASS
Example 1: cusolverSpDcsrlsvqr vs scipy.sparse.linalg.spsolve
Adapted from: python - Interfacing cuSOLVER-sparse using PyCUDA - Stack Overflow
-
NVIDIA Nsight Compute CLI:
nRuns=1
out_profiler.zip (919.5 KB) -
Python Code:
### Interface cuSOLVER/cuSPARSE with PyCUDA
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
import numpy
from scipy import sparse
import ctypes
import timeit
from tabulate import tabulate
import os
### Wrap the cuSOLVER cusolverSpDcsrlsvqr() using ctypes
# cuSparse
_libcusparse = ctypes.cdll.LoadLibrary("C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\cusparse64_12.dll")
class cusparseMatDescr_t(ctypes.Structure):
_fields_ = [
('MatrixType', ctypes.c_int),
('FillMode', ctypes.c_int),
('DiagType', ctypes.c_int),
('IndexBase', ctypes.c_int)
]
_libcusparse.cusparseCreate.restype = int
_libcusparse.cusparseCreate.argtypes = [ctypes.c_void_p]
_libcusparse.cusparseDestroy.restype = int
_libcusparse.cusparseDestroy.argtypes = [ctypes.c_void_p]
_libcusparse.cusparseCreateMatDescr.restype = int
_libcusparse.cusparseCreateMatDescr.argtypes = [ctypes.c_void_p]
# cuSOLVER
_libcusolver = ctypes.cdll.LoadLibrary("C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.3\\bin\\cusolver64_11.dll")
_libcusolver.cusolverSpCreate.restype = int
_libcusolver.cusolverSpCreate.argtypes = [ctypes.c_void_p]
_libcusolver.cusolverSpDestroy.restype = int
_libcusolver.cusolverSpDestroy.argtypes = [ctypes.c_void_p]
_libcusolver.cusolverSpDcsrlsvqr.restype = int
_libcusolver.cusolverSpDcsrlsvqr.argtypes= [ctypes.c_void_p,
ctypes.c_int,
ctypes.c_int,
ctypes.c_void_p,
ctypes.c_void_p,
ctypes.c_void_p,
ctypes.c_void_p,
ctypes.c_void_p,
ctypes.c_double,
ctypes.c_int,
ctypes.c_void_p,
ctypes.c_void_p]
os.system('cls')
resultsNNZ=["NNZ (log10)"]
resultsGPU=["GPU (s)"]
resultsCPU=["CPU (s)"]
resultsRatio=["SpeedUp (GPU/CPU)"]
nRuns = 100
for nElements in [10, 100, 100_000, 1_000_000]:
et_cpu=0
et_gpu=0
os.system('cls')
print(tabulate([resultsNNZ, resultsGPU, resultsCPU, resultsRatio]))
for ii in range(0,nRuns):
#### Prepare the matrix and parameters
val = numpy.arange(1,nElements+1,dtype=numpy.float64)
col = numpy.arange(0,nElements,dtype=numpy.int32)
row = numpy.arange(0,nElements,dtype=numpy.int32)
A = sparse.coo_matrix((val,(row,col)),shape=(nElements,nElements))
Acsr = sparse.csr_matrix(A)
b = numpy.ones(nElements)
x = numpy.empty(nElements)
vals_gpu = numpy.empty_like(x)
# Copy Data Host -> Device
dcsrVal = gpuarray.to_gpu(Acsr.data)
dcsrColInd = gpuarray.to_gpu(Acsr.indices)
dcsrIndPtr = gpuarray.to_gpu(Acsr.indptr)
dx = gpuarray.to_gpu(x)
db = gpuarray.to_gpu(b)
m = ctypes.c_int(nElements)
nnz = ctypes.c_int(nElements)
descrA = ctypes.c_void_p()
reorder = ctypes.c_int(0)
tol = ctypes.c_double(1e-10)
singularity = ctypes.c_int(99)
# Create cusparse handle
_cusp_handle = ctypes.c_void_p()
assert 0 == _libcusparse.cusparseCreate(ctypes.byref(_cusp_handle))
cusp_handle = _cusp_handle.value
# Create MatDescriptor
assert 0 == _libcusparse.cusparseCreateMatDescr(ctypes.byref(descrA))
# Create cusolver handle
_cuso_handle = ctypes.c_void_p()
assert 0 == _libcusolver.cusolverSpCreate(ctypes.byref(_cuso_handle))
cuso_handle = _cuso_handle.value
### Call GPU Solver
tic=timeit.default_timer()
_libcusolver.cusolverSpDcsrlsvqr(cuso_handle,
m,
nnz,
descrA,
int(dcsrVal.gpudata),
int(dcsrIndPtr.gpudata),
int(dcsrColInd.gpudata),
int(db.gpudata),
tol,
reorder,
int(dx.gpudata),
ctypes.byref(singularity))
toc=timeit.default_timer()
et_gpu += toc-tic
# Copy Results from Device -> Host
vals_gpu = dx.get() #cuda.memcpy_dtoh(vals_gpu, dx)
# Destroy handles
assert 0 == _libcusolver.cusolverSpDestroy(cuso_handle)
assert 0 == _libcusparse.cusparseDestroy(cusp_handle)
### Call CPU Solver:
tic = timeit.default_timer()
vals_cpu = sparse.linalg.spsolve(Acsr, b)
toc = timeit.default_timer()
et_cpu += toc-tic
# Check Matching results:
assert 1e-12 > 0.0-numpy.sum(numpy.abs(vals_gpu-vals_cpu))
resultsNNZ.append(numpy.log10(nElements))
resultsGPU.append(et_gpu/nRuns)
resultsCPU.append(et_cpu/nRuns)
resultsRatio.append(et_gpu/et_cpu)
# Show Final Results
os.system('cls')
print(tabulate([resultsNNZ, resultsGPU, resultsCPU, resultsRatio]))
-
- Prompt result:
----------------- ------------ ------------ --------- --------
NNZ (log10) 1 2 5 6
GPU (s) 0.000669252 0.000793483 0.0265892 0.230813
CPU (s) 4.0802e-05 4.6252e-05 0.0135402 0.172922
SpeedUp (GPU/CPU) 16.4024 17.1556 1.96372 1.33478
----------------- ------------ ------------ --------- --------
Example 2: cusolverSpcsreigvsi() vs scipy.sparse.linalg.eigs()
-
Input Data:
Kmat_d040.zip (669.6 KB)
Kmat_d080.zip (180.3 KB) -
Visual Studio 2022 Project Files:
cuSolverSp_csreigvsi_vs2022.zip (688.9 KB) -
GPU C++ NVIDIA Nsigth Compute CLI:
It was taking too long to profile with Kmat_d040.mtx, so I run with Kmat_d080.mtx ā
out_profiler.zip (4.1 MB) -
CPU - Python Code
import numpy as np
import scipy
from scipy.linalg import sqrtm
from scipy import sparse
import time
from tabulate import tabulate
import os
# sigma shift
lm=1.5
ko2=(2*np.pi/lm)**2
sgm=4*ko2 # Initial guess
results=[]
Ntimes=10
for jj in ['040']:
print('Computing...')
# Read Matrix
fileName=f"Kmat_d{jj}.txt"
fullPath="./mat/"+fileName
t_cpu = time.process_time()
sp_data=np.loadtxt(fullPath)
n_lines=np.shape(sp_data)[0]
m=int(sp_data[0][0])
n=int(sp_data[0][1])
et1 = (time.process_time() - t_cpu)
t_cpu = time.process_time()
mat=sparse.coo_matrix((sp_data[1:,2].T,sp_data[1:,0:2].T),shape=(m,n))
et2 = (time.process_time() - t_cpu)
# CPU
t_cpu = time.process_time()
for ii in np.arange(Ntimes):
vals,_ =scipy.sparse.linalg.eigs(mat, k=3, sigma=sgm)
et3 = (time.process_time() - t_cpu)
neff=vals
results.append([jj, et1, et2, et3, neff])
os.system('cls')
print(tabulate(results))
-
- Prompt Results
--- -------- - ------- -------------------------------------------------
040 0.046875 0 **1.26562** [56.13836141+0.j 55.3195665 +0.j 41.94037716+0.j]
--- -------- - ------- -------------------------------------------------
- C++ GPU Code:
#include <assert.h>
#include <ctype.h>
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <vector>
#include <chrono>
#include "cusolverSp.h"
#include "cusolverSp_LOWLEVEL_PREVIEW.h"
#include "helper_cuda.h"
#include "helper_cusolver.h"
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
template <typename T_ELEM>
int loadMMSparseMatrix(char* filename, char elem_type, bool csrFormat, int* m,
int* n, int* nnz, T_ELEM** aVal, int** aRowInd,
int** aColInd, int extendSymMatrix);
void UsageSP(void) {
printf("<options>\n");
printf("-h : display this help\n");
printf("-file=<filename> : filename containing a matrix in MM format\n");
printf("-device=<device_id> : <device_id> if want to run on specific GPU\n");
exit(0);
}
void parseCommandLineArguments(int argc, char* argv[], struct testOpts& opts) {
memset(&opts, 0, sizeof(opts));
if (checkCmdLineFlag(argc, (const char**)argv, "-h")) {
UsageSP();
}
if (checkCmdLineFlag(argc, (const char**)argv, "file")) {
char* fileName = 0;
getCmdLineArgumentString(argc, (const char**)argv, "file", &fileName);
if (fileName) {
opts.sparse_mat_filename = fileName;
}
else {
printf("\nIncorrect filename passed to -file \n ");
UsageSP();
}
}
}
int main(int argc, char* argv[]) {
struct testOpts opts;
cusolverSpHandle_t cusolverSpH =
NULL;
cusparseMatDescr_t descrA = NULL;
int rowsA = 0; // number of rows of A
int colsA = 0; // number of columns of A
int nnzA = 0; // number of nonzeros of A
int baseA = 0; // base index in CSR format
double mu0 = 70.0; // <double> mu eigenvalue initial guess
double tol = 1e-3; // tol for cusolver<>csreigvsi
int maxiter = 1000; // max iter for cusolver<>csreigvsi
// CSR(A) from I/O
// CPU
int* h_csrRowPtrA = NULL; // <int> n+1
int* h_csrColIndA = NULL; // <int> nnzA
double* h_csrValA = NULL; // <double> nnzA
// GPU
int* d_csrRowPtrA = NULL; // <int> n+1
int* d_csrColIndA = NULL; // <int> nnzA
double* d_csrValA = NULL; // <double> nnzA
parseCommandLineArguments(argc, argv, opts);
findCudaDevice(argc, (const char**)argv);
if (opts.sparse_mat_filename == NULL) {
opts.sparse_mat_filename = sdkFindFilePath("Kmat_d040.mtx", argv[0]);
if (opts.sparse_mat_filename != NULL)
printf("Using default input file [%s]\n", opts.sparse_mat_filename);
else
printf("Could not find Kmat_d040.mtx\n");
}
else {
printf("Using input file [%s]\n", opts.sparse_mat_filename);
}
printf("Step 1: read matrix market format\n");
if (opts.sparse_mat_filename) {
if (loadMMSparseMatrix<double>(opts.sparse_mat_filename, 'd', true, &rowsA,
&colsA, &nnzA, &h_csrValA, &h_csrRowPtrA,
&h_csrColIndA, false)) {
return 1;
}
baseA = h_csrRowPtrA[0]; // baseA = {0,1}
}
else {
fprintf(stderr, "Error: input matrix is not provided\n");
return 1;
}
if (rowsA != colsA) {
fprintf(stderr, "Error: only support square matrix\n");
return 1;
}
printf("sparse matrix A is %d x %d with %d nonzeros, base=%d\n", rowsA, colsA,
nnzA, baseA);
double* d_mu = NULL; // <double> mu eigenvalue initial guess
thrust::device_vector<double> d_x0(rowsA, 1.); // <double> mu eigenvector initial guess
thrust::device_vector<double> d_x(rowsA, 0.); // <double> mu eigenvector
double* h_mu = NULL; // <double> mu eigenvalue initial guess
thrust::host_vector<double> h_x0(rowsA, 1.); // <double> mu eigenvector initial guess
thrust::host_vector<double> h_x(rowsA, 0.); // <double> mu eigenvector
// Create cusolver Handle and matrix Descriptor
checkCudaErrors(cusolverSpCreate(&cusolverSpH));
checkCudaErrors(cusparseCreateMatDescr(&descrA));
checkCudaErrors(cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL));
// Set Matrix base 0 or 1
if (baseA) {
checkCudaErrors(cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE));
}
else {
checkCudaErrors(cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ZERO));
}
printf("Step 2: Solve Eigenmodes (GPU):");
// Allocate Memory in the GPU
checkCudaErrors(
cudaMalloc((void**)&d_csrRowPtrA, sizeof(int) * (rowsA + 1)));
checkCudaErrors(cudaMalloc((void**)&d_csrColIndA, sizeof(int) * nnzA));
checkCudaErrors(cudaMalloc((void**)&d_csrValA, sizeof(double) * nnzA));
checkCudaErrors(cudaMalloc(&d_mu, sizeof(double)));
checkCudaErrors(cudaMemset(d_mu, 0, sizeof(double)));
checkCudaErrors(cudaMemcpy(d_csrRowPtrA, h_csrRowPtrA,
sizeof(int) * (rowsA + 1),
cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(d_csrColIndA, h_csrColIndA, sizeof(int) * nnzA,
cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(d_csrValA, h_csrValA, sizeof(double) * nnzA,
cudaMemcpyHostToDevice));
auto start = std::chrono::high_resolution_clock::now();
checkCudaErrors(cusolverSpDcsreigvsi(
cusolverSpH,
rowsA,
nnzA,
descrA,
d_csrValA,
d_csrRowPtrA,
d_csrColIndA,
mu0,
thrust::raw_pointer_cast(d_x0.data()),
maxiter,
tol,
d_mu,
thrust::raw_pointer_cast(d_x.data())
));
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = end - start;
double result;
checkCudaErrors(cudaMemcpy(&result, d_mu, sizeof(double),
cudaMemcpyDeviceToHost));
std::cout << result << std::endl;
std::cout << "Computation time (GPU): " << duration.count() << "s" << std::endl;
}
-
- C++ Prompt Results:
GPU Device 0: "Ada" with compute capability 8.9
Using default input file [./Kmat_d040.mtx]
Step 1: read matrix market format
sparse matrix A is 17790 x 17790 with 189294 nonzeros, base=1
Step 2: Solve Eigenmodes (GPU):56.1384
Computation time (GPU): 50.1298s
Iām wondering if there is anything wrong with the way Iām using the cuSolver routines. But I also noticed that the profiler results show that the Grid Size is small. However, Iām not launching the kernels myself; this seems to be done automatically by the routines themselves.
Thanks for your attention!