Does cusparse tri-diagonal solver has an offset cost?

tejav · April 5, 2016, 4:49am

I used cusparseDgtsv_nopivot(…) call from cusparse library with different row sizes and it took (0.268, 0.347, 0.433) ms for (36, 4194, 41994) respectively on K40 GPU. For 0.268 ms, CPU can solve tri-diagonal of size ~10000! The matrices I am dealing with are small, so I would like to know if there is some offset cost in the call. Following is the code I have used for getting time metrics.

#include <string>
#include <algorithm>

#include <cstdlib>
#include <cstdio>
#include <assert.h>
#include <sys/time.h>

#include <cusparse_v2.h>

using namespace std;

int main(int argc, char* argv[]){

    int lin_rows = atoi(argv[1]);

    double* h_linear_tridiag;
    double* h_linear_x;
    double* h_linear_rhs;

    h_linear_tridiag = new double[3*lin_rows]();
    h_linear_rhs = new double[lin_rows]();
    h_linear_x = new double[lin_rows]();

    for(int i=0;i<3*lin_rows;i++)
        h_linear_tridiag[i] = rand()%1000+2;

    for(int i=0;i<lin_rows;i++)
        h_linear_rhs[i] = rand()%1000+2;


   double* d_lin_tridiag, *d_lin_rhs, *d_lin_x;

    cudaMalloc((void**)&d_lin_tridiag, 3*lin_rows*sizeof(double));
    cudaMalloc((void**)&d_lin_rhs, lin_rows*sizeof(double));
    cudaMallocManaged((void**)&d_lin_x, lin_rows*sizeof(double));

    // Transferring linear solver data
    cudaMemcpy(d_lin_tridiag, h_linear_tridiag, 3*lin_rows*sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(d_lin_rhs, h_linear_rhs, lin_rows*sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(d_lin_x, h_linear_rhs, lin_rows*sizeof(double), cudaMemcpyHostToDevice);

    cusparseHandle_t cusparseH;
    cusparseMatDescr_t descrA;

    // Initializing variables
    cusparseCreate(&cusparseH);

    cusparseCreateMatDescr(&descrA);
    cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL);
    cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ZERO);

    cudaEvent_t start,stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    cusparseDgtsv_nopivot(cusparseH, lin_rows, 1,
                &d_lin_tridiag[0], &d_lin_tridiag[lin_rows], &d_lin_tridiag[2*lin_rows],d_lin_x, lin_rows);
    cudaDeviceSynchronize();
    cudaEventRecord(stop);

    cudaMemcpy(h_linear_x, d_lin_x, lin_rows*sizeof(double), cudaMemcpyDeviceToHost);

    float milliSec = 0;
    cudaEventElapsedTime(&milliSec, start, stop);

    cout << lin_rows << " " << milliSec << endl;

return 0;
}

Robert_Crovella · April 5, 2016, 2:44pm

Yes, many library operations on the GPU have some overhead. Some of this overhead may be on the per-call operations, and some may be related to library initialization.

To overcome this overhead, it’s usually advised to do “small” operations on the CPU and larger ones on the GPU. From a benchmarking perspective, it’s common practice to run a call a number of times, and average the time, or else run it twice but throw out the first measurement.

When I modify your test code to run the cusparseDgtsv_nopivot twice, but only time the second call, I witness an improvement in execution time.

For your CPU case, are you using a library routine to do the tridiagonal solve? If so, which library are you using?

Topic		Replies	Views
How to solve a tridiagonal matrix using the cusparse<t>gtsv2_nopivot() functions in the cusparse library GPU-Accelerated Libraries cusparse	7	991	November 5, 2023
cusparse gtsv_nopivot overhead from cudaFree and cudaMalloc in each call GPU-Accelerated Libraries	0	549	July 19, 2017
Performance of CuSparse Dgtsv() (three-diagonal matrices solver). GPU-Accelerated Libraries	1	1158	July 31, 2013
cuSPARSE in Fortran90 GPU-Accelerated Libraries	0	386	June 21, 2019
Solution of tridiagonal linear equations CUDA Programming and Performance	2	754	February 20, 2023
Why cusparseDcsrsv_solve so slow? GPU-Accelerated Libraries	0	447	March 8, 2018
Help with cuSparse: why is this code not working? CUDA Programming and Performance cuda	0	374	July 29, 2020
Device version of cusolverSpScsrlsvqr is extremely slower than host version CUDA Programming and Performance cuda , performance	4	803	October 12, 2021
cusparse gtsv2 reg. Legacy PGI Compilers	4	3291	November 20, 2019
CUBLAS tbsv performance. CUDA Programming and Performance	2	1234	May 5, 2011

Does cusparse tri-diagonal solver has an offset cost?

Related topics